Java – Parsing an XML document with SAX

Java offers both DOM and SAX mechanisms for parsing XML documents. For lightweight, single pass requirements on large documents SAX will usually win. Here we work through a simple example of how to do it.

If you’re working a requirement that involves reasonably complicated processing on an XML document, especially one that requires the entire document to be worked and worked multiple times, the DOM (Document Object Model) approach will probably be your go-to move.

Loading the entire document into a memory model you can work on at your leisure may feel like a one-size-fits-all solution, but in this case size can quickly become a problem. If you’re working with a large document the heap requirements can be considerable as can the time it takes to read the entire document and build that model.

Enter SAX, the Simple API for XML, which offers an event driven solution for working through an XML document and programmatically responding to just the bits you are interested in. If you only need to process it once, only need to pull a small amount of information out of it and especially if you mightn’t need to read the whole thing it can offer a serious performance boost.

It’s a crime!

For this example I’m going to don my stripy top and facemask and dust off the big brown sack with swag written on it. The UK Government’s Open Data initiative includes XML data on the Government Art Collection and it just so happens I’ve got a space on my sitting room wall to fill.

Our overlords appear to have been quite busy spending my money on pimping their offices as the master list (Entity.xml) contains over 11,000 artworks in an 18Mb file. Before planning my, ahem, “tax refund”, I need to narrow the options down a bit. The space on my lounge wall is quite limited so size would seem a logical place to start.

Each work of art appears in an <Entity/> element which looks something like this:-

<Entity>
  <entity_key>11407</entity_key>
  …
  <object_dimensions>
    height: 39.50 cm, width: 58.00 cm
  </object_dimensions>
  …
</Entity>

So what I need to do is locate each compatible <object_dimensions/> element and get a list of associated <entity_key/> values to look at further. It’s one pass and needs only a little information from the document – perfect for SAX.

Here’s how we do it.

Creating a content handler

SAX parsing is event driven with a method being invoked for each “thing” found in the XML document. Tidily all these methods are collected together in a single class DefaultHandler and so to do our own XML processing we just need to create a new subclass and override the event methods we’re interested in. There are quite a lot of events we can tap into but for most requirements, certainly the one we’re working here, startElement, endElement and characters are all we need.

Let’s get the boilerplate code out of the way and then look at the implementation of each of these three methods:-

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
 
public class SwagContentHandler extends DefaultHandler {

    // Constructor assigned
    private int targetWidth = 0;
    private int targetHeight = 0;
    private int tolerance = 0;

    // Parsing state information
    private String currentElement = null;
    private StringBuilder swagId = new StringBuilder();
    private StringBuilder swagDimensions = new StringBuilder();

    // Results holder
    private List<String> results = new ArrayList<String>();

    // Regex matcher for the dimensions string
    private Pattern dimensionsPattern =
        Pattern.compile("height: ([\\d\\.]+) cm, width: ([\\d\\.]+) cm");

    public SwagContentHandler(int targetWidth, int targetHeight, 
            int tolerance) {
        this.targetWidth = targetWidth;
        this.targetHeight = targetHeight;
        this.tolerance = tolerance;
    }

    public List<String> getResults() {
        return results;
    }

    @Override
    public void startElement(String uri, String localName, String qName,
            Attributes attributes) throws SAXException {
        super.startElement(uri, localName, qName, attributes);
        // TODO
    }

    @Override
    public void endElement(String uri, String localName, String qName)
            throws SAXException {
        super.endElement(uri, localName, qName);
        // TODO
    }

    @Override
    public void characters(char[] ch, int start, int length)
            throws SAXException {
        super.characters(ch, start, length);
        // TODO
    }
}

My SwagContentHandler has a constructor to specify the dimensions I’m looking for and I’ve added a list of strings to hold the results along with an accessor getResults() method.

To explain the other fields its important to remember again that SAX is event driven – the only context you have when a particular element appears is context that you store yourself. Our source document contains a list of <Entity/> elements containing a number of child elements only two of which we are interested in. So a sensible way to handle this is save the value of <object_dimensions/> and <entity_key/> as we encounter them, and then process them when the parent element is closed. In order to do this I need fields to store the current swagDimensions and swagId along with a currentElement so I know which to write values to and when.

Finally I’ve defined a regular expression dimensionsPattern which will match candidate dimension values and let me get at the numbers quickly.

Now we can fill in our event handlers. Let’s start at the beginning with startElement which will get invoked each time a new element is encountered. We have two tasks here. If it’s the start of an <Entity/> then I need to reset the values of the fields in which we’re keeping the current entity context and I also need to update currentElement so the other methods know where we are in the document:-

@Override
public void startElement(String uri, String localName, String qName,
        Attributes attributes) throws SAXException {
    super.startElement(uri, localName, qName, attributes);
    currentElement = qName;
    if("Entity".equals(currentElement)) {
        swagId.setLength(0);
        swagDimensions.setLength(0);
    }
}

Next we’ll look at the characters method which gets invoked whenever a block of non-markup text is encountered. What we want to do here is see if the currentElement is one we’re interested in and if so, append the characters to the current value:-

@Override
public void characters(char[] ch, int start, int length)
        throws SAXException {
    super.characters(ch, start, length);
    if("object_dimensions".equals(currentElement)) {
        swagDimensions.append(ch, start, length);
    }
    if("entity_key".equals(currentElement)) {
        swagId.append(ch, start, length);
    }
}

And finally we can write our endElement handler which has only one job to do – detect the closing of an <Entity/> element and if the dimensions are suitable add its identifier to the results. Our regular expression, whilst perhaps a little brittle in practice, helps to keep this reasonably clean:-

@Override
public void endElement(String uri, String localName, String qName)
        throws SAXException {
    super.endElement(uri, localName, qName);
    if("Entity".equals(qName)) {
        Matcher dimensionsMatcher = dimensionsPattern.matcher(
                swagDimensions.toString().trim());
        if(dimensionsMatcher.matches()) {
            Float width = Float.valueOf(dimensionsMatcher.group(2));
            Float height = Float.valueOf(dimensionsMatcher.group(1));
            if((width > targetWidth - tolerance && width < targetWidth + tolerance) &&
                    (height < targetHeight + tolerance && height > targetHeight - tolerance)) {
                results.add(swagId.toString().trim());
            }
        }
    }
}

Note the use of trim() when processing the values we’ve pulled out of the XML – whitespace between an element value and its tags is far from uncommon and will be included in the value string.

We’ve now defined the logic that we need to carry out as the SAX parser works its way through our data; now we need to make that happen.

Processing the XML

We’ll create a SwagParser class to parse the XML and with a nod to reusability we’ll decouple it from the content handler implementation by having the caller supply the content handler they want to use in the constructor (we could also pass in the XML file as well in which case it would be totally generic):-

public class SwagParser {

    private DefaultHandler contentHandler = null;

    public SwagParser(DefaultHandler contentHandler) {
        this.contentHandler = contentHandler;
    }

    public void processSwag() {
        // TODO
    }
}

Now to implement processSwag() which will parse the XML using the supplied content handler. Anyone at all familiar with Java’s XML support are probably expecting a block of boilerplate code here and they won’t be too disappointed:-

public void processSwag() throws
        IOException, ParserConfigurationException, SAXException {
    try (FileInputStream xmlInputStream =
                 new FileInputStream(new File("Entity.xml"))) {
        SAXParserFactory parserFactory = 
                 SAXParserFactory.newInstance();
        SAXParser saxParser = parserFactory.newSAXParser();

        XMLReader xmlReader = saxParser.getXMLReader();
        xmlReader.setContentHandler(contentHandler);

        InputSource xmlSource = new InputSource();
        xmlSource.setByteStream(xmlInputStream);

        xmlReader.parse(xmlSource);
    }
}

First we create an InputStream to read the Entity.xml file. Next we follow the typical pattern of getting a parser factory and creating a parser instance to work with, in this case a SAXParser instance.

We then get the XMLReader from this parser instance and assign our content handler to it so that our event handling methods get invoked.

Finally we create an InputSource to wrap our input stream and pass this as a parameter to the XMLReader‘s parse() method. We’re not interested in return values here (parse() is actually a void method anyway) because the SAX parser isn’t building anything for us – instead our content handler will hold the results of our custom event processing when parsing completes.

Planning our heist

We now just need to throw together a bit of code to invoke our SwagParser with our SwagContentHandler. Enter SwagLocator:-

public List<String> locate(int width, int height, int tolerance)
        throws IOException, ParserConfigurationException, SAXException {
    SwagContentHandler contentHandler =
            new SwagContentHandler(width, height, tolerance);
    SwagParser swagParser = new SwagParser(contentHandler);
    swagParser.processSwag();
    return contentHandler.getResults();
}

This one hardly needs explaining; we instantiate a content handler with the appropriate dimensions, then instantiate a parser with this content handler, run the parse and return the results to the caller.

I’m quite picky about my lounge wall so I’m going to look for a canvas 100cm x 50cm with only a 5cm tolerance:-

List<String> mySwag = new SwagLocator().locate(100, 50, 5);

Which yields seven matches warranting further investigation:-

[19525, 23372, 25016, 10215, 27780, 28835, 32286]

Moreover it yields 7 matches in an average of 750ms on my ageing MacBook, which isn’t at all bad for parsing and processing an 18Mb XML document.

Now, where did I leave my crowbar ….

Leave a Reply

Your email address will not be published. Required fields are marked *