Class SimpleXMLParser

java.lang.Object
de.stefanfrings.utils.SimpleXMLParser

public abstract class SimpleXMLParser extends Object
Very efficient parser for large XML documents, based on SAX. This class may be used to read huge files because it parses the XML document line by line using only very few memory.

The main benefit on top of SAX is that SimpleXMLParser provides the hierarchy of parent elements and that it concatenates the text fragments between start and end tag to a single string.

You have to implement either the start(XMLElement) or end(XMLElement) method to process the data. Example:

 SimpleXMLParser parser = new SimpleXMLParser()
 {
     protected void end(XMLElement element)
     {
         System.out.print("start()  ");
         System.out.println(element.toString());
     }

     protected void end(XMLElement element)
     {
         System.out.print("end()    ");
         System.out.println(element.toString());
     }
 };
 parser.parse(new FileInputStream("test.xml"), "test", false);
 

Example input:

 <shops>
     <shop type="fast food" favorite="false">
         <name language="en">Mc Donald</name>
         <description language="en">
             Well known for burgers
             and salads
         </description>
     </shop>
     <shop type="books" favorite="true">
         <name language="en">Readers Place</name>
         <description language="en">
             They really know what they sell.
             Ask the employees for recommendations.
         </description>
     </shop>
 </shops>
 
For this document, the start() and end() methods would be called 7 times with the following names and attributes in {}:
 start()  shops {}
 start()  shops/shop {type=fast food, favorite=false}
 start()  shops/shop/name {language=en}
 end()    shops/shop/name {language=en, __characters=Mc Donald}
 start()  shops/shop/description {language=en}
 end()    shops/shop/description {language=en, __characters=Well known for burgers\nand salads}
 end()    shops/shop {type=fast food, favorite=false}
 start()  shops/shop {type=books, favorite=false}
 start()  shops/shop/name {language=en}
 end()    shops/shop/name {language=en, __characters=Readers Place}
 start()  shops/shop/description {language=en}
 end()    shops/shop/description {language=en, __characters=They really know what they sell.\nAsk the employees for recommendations.}
 end()    shops/shop {type=books, favorite=false}
 end()    shops {}
 
The collected text characters between start and end tags are returned like the other XML attributes but with the special name "__characters". While parsing the XML, the following parts get removed from these characters: Heading and trailing whitespaces, heading and trailing line-feeds, duplicate whitespaces, duplicate lineFeeds, indentation and all other control characters.

XML namespaces and DTD are supported as well but have no effect on the output. An element "s:shop" would be called "shop" in the output.

Author:
Stefan Frings, http://stefanfrings.de/javautils
  • Constructor Details

    • SimpleXMLParser

      public SimpleXMLParser()
  • Method Details

    • parse

      public void parse(InputStream byteStream, String name, boolean validate) throws XMLParseException
      Parses an XML document. Every time the start of an XML element has been read, the method start(XMLElement) will be called. Every time the end of an XML element has been read, the method end(XMLElement) will be called. Data between start and end tags are only available to the latter method.
      Parameters:
      byteStream - Source of the XML document
      name - A symbolic name for the source, used in log messages
      validate - Whether to validate DTD schema and XML namespaces (costs time!)
      Throws:
      XMLParseException - If the XML is invalid.
    • start

      protected void start(XMLElement element) throws Exception
      This method is called whenever the start of a new element is reached. The default implementation does nothing.

      All attributes of the XML element are available via element.getAttribute(name).

      To access the text content of an XML element, you have to override the end() function instead. The characters are not available at this stage.

      Parameters:
      element - The current XML element
      Throws:
      Exception - In case of any exception
    • end

      protected void end(XMLElement element) throws Exception
      This method is called whenever the end of an element is reached. The default implementation does nothing.

      All attributes of the XML element are available via element.getAttribute(name).

      The text content of the XML element is made available via element.getAttribute("__characters").

      Parameters:
      element - The current XML element
      Throws:
      Exception - In case of any exception