Java Tutorial

Using SAX

To process an XML document with SAX, you first have to establish contact with the parser that you want to use. The first step towards this is to create a SAXParserFactory object like this:

SAXParserFactory spf = SAXParserFactory.newInstance();

The SAXParserFactory class is defined in the javax.xml.parsers package along with the SAXParser class that encapsulates a parser. The SAXParserFactory class is abstract but the static newInstance() method will return a reference to an object of a class type that is a concrete implementation of SAXParserFactory. This will be the factory object for creating a particular parser object, normally the default parser that comes with the SDK. To use a different parser, you would need to obtain a reference to a factory object for that parser. We will see how you can arrange to do this a little later in this chapter.

The SAXParserFactory object has methods for determining whether the parser that it will attempt to create will be namespace aware, or will validate the XML as it is parsed:

isNamespaceAware()	Returns true if the parser to be created is namespace aware, and false otherwise.
isValidating()	Returns true if the parser to be created will validate the XML during parsing, and false otherwise.

You can set the factory object to produce namespace aware parsers by calling its setNamespaceAware() method with an argument value of true. An argument of false sets the factory object to produce parsers that are not namespace aware. A parser that is namespace aware recognizes the structure of names in a namespace – with a colon separating the namespace prefix from the name. A namespace aware parser will report the URI and local name separately for each element and attribute. A parser that is not namespace aware will only report an element or attribute name as a single name even when it contains a colon. In other words, a parser that is not namespace aware will treat a colon as just another character that is part of a name.

Similarly, calling the setValidating() method with an argument of true will cause the factory object to produce parsers that will validate the XML as a document is parsed. A validating parser will verify that the document body has a DTD and that the document content is consistent with the DTD and any internal subset that is included in the DOCTYPE declaration. Of course, if you configure the factory object to create a parser that is either namespace aware or validating, the parser you intend to use must include the capability, otherwise a request to create a parser will fail.

We can now use our SAXParserFactory object to create a SAXParser object as follows:

SAXParser parser = null;
try {
 parser = spf.newSAXParser();
}catch(SAXException e){
  e.printStackTrace(System.err);
  System.exit(1);
} catch(ParserConfigurationException e) {
  e.printStackTrace(System.err);
  System.exit(1);
}

The SAXParser object that is created here will encapsulate the default parser. The newSAXParser() method for the factory object can throw the two exceptions we are catching here. A ParserConfigurationException will be thrown if a parser cannot be created consistent with the configuration determined by the SAXParserFactory object and a SAXException will be thrown if any other error occurs. For instance, if you call the setValidating() option and the parser does not have the capability for validating documents, this exception would be thrown. This should not arise with the parser supplied by default, though, since it supports both of these features.

The ParserConfigurationException class is defined in the javax.xml.parsers package, but the SAXException class is in the org.xml.sax package so we need a separate import statement for that. Now let's see what the default parser is by putting the code fragments we have looked at so far together in a working example.

Try It Out – Accessing a SAX Parser

Here's the code to create a SAXParser object and output some details about it to the command line:

import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.ParserConfigurationException;
import org.xml.sax.SAXException;
public class TrySAX {
  public static void main(String args[]) {
    // Create factory object
    SAXParserFactory spf = SAXParserFactory.newInstance(); 
    System.out.println("Parser will "+(spf.isNamespaceAware()?"":"not ") + 
                       "be namespace aware");
    System.out.println("Parser will "+(spf.isValidating()?"":"not ") +
                       "validate XML");

    SAXParser parser = null;                          // Stores parser reference
    try {
     parser = spf.newSAXParser();                     // Create parser object
    }catch(ParserConfigurationException e){// Thrown if a parser cannot be created
                                           // that is consistent with the 
      e.printStackTrace(System.err);      // configuration in spf
      System.exit(1);    
    } catch(SAXException e) {             // Thrown for any other error
      e.printStackTrace(System.err);
      System.exit(1);    
    } 

    System.out.println("Parser object is: "+ parser);
  }
}

When I ran this I got the output:

Parser will not be namespace aware
Parser will not validate XML
Parser object is: org.apache.xerces.jaxp.SAXParserImpl@fd13b5

How It Works

The output shows that the default configuration for the SAX parser produced by our SAXParserFactory object, spf, will be neither namespace aware nor validating. The parser supplied with the SDK is one that was first developed by Sun and was subsequently donated to the XML Apache Project. It is referred to by the name Crimson. You can find information on the advantages and limitations of this particular SAX parser on the http://xml.apache.org web site.

The code to create the parser works as we have already discussed. Once we have an instance of the factory method we use that to create an object encapsulating the parser. Although the reference is returned as type SAXParser, the object is of type SAXParserImpl, which is a concrete implementation of the abstract SAXParser class for a particular parser.

The Crimson parser is capable of validating XML and can be namespace aware. All we need to do is to specify which of these options we require by calling the appropriate method. We can set the parser configuration for the factory object, spf, so that we get a validating and namespace aware parser by adding two lines to the program:

    // Create factory object
    SAXParserFactory spf = SAXParserFactory.newInstance();
    spf.setNamespaceAware(true);
    spf.setValidating(true);

If you compile and run the code again, you should get output something like:

Parser will be namespace aware
Parser will validate XML
Parser object is: org.apache.xerces.jaxp.SAXParserImpl@13582d

We arrive at a SAXParser instance without tripping any exceptions and we clearly now have a namespace aware and validating parser.

Using a Different Parser

You might like to try a different parser at this point. There are SAX parsers available from a number of sources but the Xerces parser produced by the XML Apache Project is easy and free to obtain, and a snap to set up. As well as supporting SAX version 2 (SAX2), it also supports DOM level 2 (DOM2). You can download the latest version, currently Xerces 2, from the download page on their web site at http://xml.apache.org. The binaries are distributed in a .zip file that you can unzip to a suitable location on your hard drive – the archive unzips to create its own directory structure. You will find everything you need in there, including documentation.

The simplest way to try out an alternative parser without making it a permanent selection over the default is to include the path to the .jar file that contains the parser implementation in the -classpath option on the command line. For instance, if you have downloaded the Xerces 2 parser from the Apache web site and extracted the file from the zip directly to your C:\ drive, you can run the example with the Xerces parser like this:

java -classpath .;C:\xerces-2_0_0\xercesImpl.jar -enableassertions TrySAX

Don't forget the period in the classpath definition that specifies the current directory. Without it the TrySAX.class file will not be found. If you omit the –classpath option, the program will revert to using the default parser. Of course, you can use this technique to select a particular parser when you have several installed on your PC. Just add the path to the directory that contains the JAR for the parser to the classpath.

If you want to make the choice of the Xerces parser more permanent, you can copy the xercesImpl.jar file to the ext directory for the JRE. This will be the jdk1.4\jre\lib\ext directory. A JAR containing a parser in the ext directory will always be found before the default parser.

Updating the Default Parser

The Crimson parser that comes with the JDK is developed independently by the Apache Project so it is quite possible there could be newer releases of this that you might want to instal if only to fix any bugs that might have appeared. You can override the default parser by placing a .jar archive containing the Crimson release you want to use in the directory jdk1.4\jre\lib\endorsed. Indeed you can use this directory to override any of the externally developed packages that are distributed with the SDK.

Parser Features and Properties

Specific parsers such as Xerces define their own features and properties that control and report on the processing of XML documents. A feature is an option that is either on or off, so it is set as a boolean value, either true or false. Namespace awareness and validating capability are both features of a parser. A property is an option with a value that is an object, usually a String object. Some properties have values that you set to influence the parser's operation while the values for other properties are set by the parser for you to retrieve to provide information about the parsing process.

You will find details of the features and properties supported by the Xerces 2 parser in the documentation that appears in the /doc directory that was created when you unzipped the Xerces archive.

Features of a Parser

As we have seen, namespace awareness and whether a parser is validating or not are both features, and these can be set by calling either setNamespaceAware() or setValidating() for the SAXParserFactory object, but a parser will typically have a number of other features. You can set other features for a parser by calling the setFeature() method for the SAXParserFactory object before you create the SAXParser instance. The first argument is a String that identifies a particular feature and the second argument is a boolean value that you specify as true or false to set the feature on or off. Of course, all features that you want must be set before you create the parser object. You will find details of the features that a SAX parser may have at http:/www.saxproject.org.

Here's how we might set the parser configuration to use string interning so that fast comparisons for string equality can be used by the parser:

try {
  spf.setFeature("http://xml.org/sax/features/string-interning", true);
} catch(ParserConfigurationException e) { // Serious parser configuration error
  e.printStackTrace();
  System.exit(1);

} catch(SAXNotRecognizedException e) {   // Feature name not recognized
  e.printStackTrace();
  System.exit(1);

} catch(SAXNotSupportedException e) {    // Feature recognized but not supported
{
  e.printStackTrace();
  System.exit(1);
}

The SAXNotRecognizedException and SAXNotSupportedException classes are subclasses of SAXException defined in the org.xml.sax package. You could therefore catch either of these with a single catch block for an exception object of type SAXException. With explicit catch blocks as we have here, you would need an import statement for each of the class names.

There is no set collection of features for a SAX2 parser so a parser may implement any number of arbitrary features. While it is not mandatory, there are a standard set of features that most, if not all, SAX2 parsers are likely to support. They all have names of the form http://xml.org/sax/features/name, and you can find details of these on the official web site for SAX noted earlier.

If you need to check whether a particular feature is set – you might want to check the default status for instance –, you just call the getFeature() method for the SAXParserFactory object with a reference to a string containing the URI for the feature. The method returns a boolean value indicating the status of the feature. Note that it can throw the same exceptions as the setFeature() method so you have to put the call in a try block.

Properties of a Parser

You can set the properties for a parser by calling the setProperty() method for the SAXParser object once you have created it. The first argument is the name of the property as type String and the second argument is the value for the property. The property value can be of any class type as the parameter type is Object but it is usually of type String. The setProperty() method will throw a SAXNotRecognizedException if the property name is not recognized or SAXNotSupportedException if the property name is recognized but not supported. Both of these exception classes are defined in the org.xml.sax package.

You can also retrieve the values for some properties during parsing to obtain additional information about the most recent parsing event. You use the parser's getProperty() method in this case. The argument to the method is the name of the property and the method returns a reference to the property's value.

As with features, there is no defined set of parser properties so you need to consult the parser documentation for information on these. At the time of writing there are four standard properties for a SAX parser, but since these involve the more advanced features of SAX parser operation, they are beyond the scope of this book.

Parsing Documents with SAX

To parse a document using the XMLParser object you simply call its parse() method. You have to supply two arguments to the parse() method. The first identifies the XML document and the second is a reference of type DefaultHandler to a handler object that you will have created to process the contents of the document. The DefaultHandler object must contain a specific set of public methods that the XMLParser object expects to be able to call for each event, where each type of event corresponds to a particular syntactic element it finds in the document.

The DefaultHandler class that is defined in the org.xml.sax.helpers package already contains do nothing definitions of all the callback methods that the XMLParser object expects to be able to call. Thus all you have to do is to define a class that extends the DefaultHandler class, and then override the methods in the DefaultHandler class for the events that you are interested in. But let's not gallop too far ahead. We need to look into the versions of the parse() method that we have available before we get into handling parsing events.

The XMLParser class defines ten overloaded versions of the parse() method but we are only interested in five of them. The other five use a now deprecated handler type, HandlerBase, that was applicable to SAX1, so we shall ignore those and just look at the versions that relate to SAX2. All versions of the method have a return type of void, and the five varieties of the parse() method that we will consider are:

parse(File aFile, DefaultHandler handler)	Parses the document in the file specified by aFile using handler as the object containing the callback methods called by the parser. This will throw an exception of type IOException if an I/O error occurs and of type IllegalArgumentException if aFile is null.
parse(String uri, DefaultHandler handler)	Parses the document specified by uri using handler as the object defining the callback methods. This will throw an exception of type SAXException if uri is null and an exception of type IOException if an I/O error occurs.
parse(InputStream input, DefaultHandler handler)	Parses input as the source of the XML with handler as the event handler. This will throw an exception of type IOException if an I/O error occurs, and of type IllegalArgumentException if input is null.
parse(InputStream input, DefaultHandler handler, String systemID)	Parses input as above, but uses systemID to resolve any relative URIs.
parse(InputSource source, DefaultHandler handler)	Parses the document specified by source using handler as the object providing the callback methods to be called by the parser.

The InputSource class is defined in the org.xml.sax package. It defines an object that wraps a variety of sources for an XML document that you can use to pass a document reference to a parser. You can create an InputSource object from an InputStream object, a Reader object encapsulating a character stream, or a String specifying a URI – either a public name or a URL. If you specify the document source as a URL, it must be fully qualified.

Implementing a SAX Handler

As we have already seen, the DefaultHandler class in the org.xml.sax.helpers package provides a default do-nothing implementation of each of the callback methods a SAX parser may call when parsing a document. These methods are declared in four interfaces that are implemented by the DefaultHandler class:

The ContentHandler interface declares methods that will be called to identify the content of a document to an application. You will usually want to implement all the methods defined in this interface in your subclass of DefaultHandler.
The EntityResolver interface declares one method, resolveEntity(), that is called by a parser to pass a public and/or system ID to your application to allow external entities in the document to be resolved.
The DTDHandler interface declares two methods that will be called to notify your application of DTD-related events.
The ErrorHandler interface defines three methods that will be called when the parser has identified an error of some kind in the document.

All four interfaces are defined in the org.xml.sax package. Of course you can define a handler class that implements these interfaces, but you will probably find it easier to extend the DefaultHandler class.

The basic methods that you must implement to deal with parsing events related to document content are those declared by the ContentHandler interface so let's concentrate on those first. All the methods have a void return type, and they are as follows

startDocument()	Called when the start of a document is recognized.
endDocument()	Called when the end of a document is recognized.
startElement(String uri, JavaScript Editor Java Tutorials Free JavaScript Editor →