We have seen several small examples of XML and in each case it was fairly obvious what the content was meant to represent, but where are the rules that ensure such data is represented consistently and correctly in different documents? Do the <radius> and <position> elements have to be in that sequence in a <circle> element and could we omit one or other of them?
Clearly there has to be a way to determine what is correct and what is incorrect for any particular element in a document. As we mentioned earlier, a Document Type Definition (DTD) defines how valid elements are constructed for a particular type of document, so the XML for purchase order documents in a company could be defined by one DTD, and sales invoice documents by another. The document type definition for a document is specified in a document type declaration – commonly known as a DOCTYPE declaration – that appears in the document prolog following any XML declaration. A DTD essentially defines a vocabulary for describing data of a particular kind – the set of elements that you use to identify the data in other words. It also defines the possible relationships between these elements – how they can be nested. The contents of a document of the type indentified by a particular DTD must be defined and structured according to rules that make up the DTD. Any document of a given type can be checked for validity against its DTD.
A DTD can be an integral part of a document but it is usually, and more usefully, defined separately. Including a DTD in an XML document makes the document self-contained, but it does increase its bulk. It also means that the DTD has to appear within each document of the same type. A separate DTD that is external to a document avoids this and provides a single reference point for all documents of a particular type. An external DTD also makes maintenance of the DTD for a document type easier as it only needs to be changed in one place for all documents that make use of it. Let's look at how we identify the DTD for a document and then investigate some of the ways in which elements and their attributes can be defined in a DTD.
You use a document type declaration (a DOCTYPE declaration) in the prolog of an XML document to specify the DTD for the document. An XML 1.0 document can only have one DOCTYPE declaration. You can include the markup declarations for elements used in the document explicitly within the DOCTYPE statement, in which case the declarations are referred to as the internal subset. You can also specify a URI that identifies the DTD for the document, usually in the form of a URL. In this case the set of declarations is referred to as the external subset. If you include explicit declarations as well as a URI referencing an external DTD, the document has both an internal and an external subset. Here is an example of an XML document that has an external subset:
The name following the DOCTYPE keyword must always match the root element name in the document so the DOCTYPE declaration here indicates that the root element in the document has the name address. The declaration also indicates that the DTD in which this and the other elements in the document are declared is an external DTD located at the URI following the SYSTEM keyword. This URI, which is invariably a URL, is called the system ID for the DTD.
In principle you can also specify an external DTD by a public ID using the keyword PUBLIC in place of the SYSTEM. A public ID is just a unique public name that identifies the DTD – a URN in other words. As you probably know, the idea behind URNs is to get over the problem of changes to URLs. Public IDs are intended for DTDs that are available as public standards for documents of particular types such as SVG. However, there is a slight snag. Since there is no mechanism defined for resolving public IDs to find the corresponding URL, if you specify a public ID you still have to supply a system ID with a URL so the XML processor can find it, so you won't see public IDs in use much.
If the file containing the DTD is stored on the local machine, you can specify its location relative to the directory containing the XML document. For example, the following DOCTYPE declaration implies the DTD is in the same directory as the document itself:
<!DOCTYPE address SYSTEM "AddressDoc.dtd">
The AddressDoc.dtd file includes definitions for the elements that may be included in a document containing an address. In general a relative URL is assumed to be relative to the location of the document containing the reference.
In looking at the details of how we put a DTD together we will use examples where the DTD is an internal subset, but the declarations in an external DTD are exactly the same. Here's an example of a document with an integral DTD:
<?xml version="1.0"?> <!DOCTYPE proverb [ <!ELEMENT proverb (#PCDATA)> ]> <proverb>A little knowledge is a dangerous thing.</proverb>
All the internal definitions for elements used within the document appear between the square brackets in the DOCTYPE declaration. In this case there is just one element declared, the root element, and the element content is PCDATA – parsed character data.
We could define an external DTD in a file with the name proverbDoc.dtd in the same directory as the document. The file would contain just a single line:
The XML document would then be:
<?xml version="1.0"?> <!DOCTYPE proverb SYSTEM "proverbDoc.dtd"> <proverb>A little knowledge is a dangerous thing.</proverb>
The DTD is referenced by a relative URI that is relative to the directory containing the document.
When you want to have both an internal and an external subset you just put both in the DOCTYPE declaration with the external DTD reference appearing first. Entities from both are available for use in the document but where there is any conflict between them the entities defined in the internal subset take precedence over those declared in the external subset.
The syntax for defining elements and their attributes is rather different from the syntax for XML markup. It also can get quite complex so we won't be able to go into it comprehensively here. However, we do need to have a fair idea of how a DTD is put together in order to understand the operation of the Java API for XML, so let's look at some of the ways in which we can define elements in a DTD.
The DTD will define each type of element that can appear in the document using an ELEMENT type declaration. For example, the <address> element could be defined like this:
<!ELEMENT address (buildingnumber, street, city, state, zip)>
This defines the element with the name address. The information between the parentheses specifies what can appear within an <address> element. The definition states that an <address> element contains exactly one each of the elements <buildingnumber>, <street>, <city>, <state>, and <zip> in that sequence. This is an example of element content since only elements are allowed within an <address> element. Note the space that appears between the element name and the parentheses enclosing the content definition. This is required, and a parser will flag the absence of at least one space here as an error. The ELEMENT identifier must be in capital letters and must immediately follow the opening "<!".
The definition of the <address> above makes no provision for anything other than the five elements shown, and in that sequence. Any whitespace that you put between these elements in a document is therefore not part of the content and will be ignored by a parser, and therefore it is known as ignorable whitespace. That said, you can still find out if there is whitespace there when the document is parsed, as we shall see.
We can define the <buildingnumber> element like this:
<!ELEMENT buildingnumber (#PCDATA)>
This states that the element can only contain parsed character data, specified by #PCDATA. This is just ordinary text, and since it will be parsed, it cannot contain markup. The # character preceding the word PCDATA is necessary just to ensure it cannot be confused with an element or attribute name – it has no other significance. Since element and attribute names must start with a letter or an underscore, the # prefix to PCDATA ensures that it cannot be interpreted as such.
The PCDATA specification does provide for markup – child elements – to be mixed in with ordinary text. In this case you must specify the names of the elements that can occur mixed in with the text. If you wanted to allow a <suite> element specifying a suite number to appear alongside the text within a <buildingnumber> element you could express it like this:
<!ELEMENT buildingnumber (#PCDATA|suite)*>
This indicates that the content for a <buildingnumber> element is parsed character data and the text can be combined with <suite> elements. The | operator here has the same meaning as the | operator we met in the context of regular expressions in Chapter 13. It means one or other of the two operands but not both. The * following the parentheses is required here, and has the same meaning as the * operator that we also met in the context of regular expressions. It means that the operand to the left can appear zero or more times.
If you want to allow several element types to be optionally mixed in with the text, you separate them by |. Note that it is not possible to control the sequence in which mixed content appears.
The other elements used to define an address are similar, so we could define the whole document with its DTD like this:
<?xml version="1.0"?> <!DOCTYPE address [ <!ELEMENT address (buildingnumber, street, city, state, zip)> <!ELEMENT buildingnumber (#PCDATA)> <!ELEMENT street (#PCDATA)> <!ELEMENT city (#PCDATA)> <!ELEMENT state (#PCDATA)> <!ELEMENT zip (#PCDATA)> ]> <address> <buildingnumber> 29 </buildingnumber> <street> South Lasalle Street</street> <city>Chicago</city> <state>Illinois</state> <zip>60603</zip> </address>
One point to note is that we have no way to constrain the text in an element definition. It would be nice to be able to specify that the building number had to be numeric, for example, but the DTD grammar and syntax provide no way to do this. This is a serious limitation of DTDs and one of the driving forces behind the development of an alternative, XML Schemas. Schemas are beyond the scope of this book but if you want to know more you should get hold of a copy of Pro XML Schemas by Jon Duckett, Oliver Griffin, et al, Wrox Press Ltd., (ISBN 1-861005-47-4)
If we were to create the DTD for an address document as a separate file, the file contents would just consist of the element definitions:
The DOCTYPE declaration identifies the DTD for a particular document so it is not part of the DTD. If the DTD above were stored in the AddressDoc.dtd file in the same directory as the document, the DOCTYPE declaration in the document would be:
<?xml version="1.0"?> <!DOCTYPE address SYSTEM "AddressDoc.dtd"> <address> <buildingnumber> 29 </buildingnumber> <street> South Lasalle Street</street> <city>Chicago</city> <state>Illinois</state> <zip>60603</zip> </address>
Of course, the DTD file would also include definitions for element attributes, if there were any. These will be useful later, so save the DTD as AddressDoc.dtd, and the XML file above (as Address.xml perhaps), in your Beg Java Stuff directory.
One further possibility we need to consider is that in many situations it is desirable to allow some child elements to be omitted. For instance, <buildingnumber> may not be included in some cases. The <zip> element, while highly desirable, might also be left out in practice. We can indicate that an element is optional by using the cardinality operator, ?. This operator expresses the same idea as the equivalent regular expression operator – that a child element may or may not appear. The DTD would then look like this:
<!DOCTYPE address [ <!ELEMENT address (buildingnumber?, street, city, state, zip?)> <!ELEMENT buildingnumber (#PCDATA)> <!ELEMENT street (#PCDATA)> <!ELEMENT city (#PCDATA)> <!ELEMENT state (#PCDATA)> <!ELEMENT zip (#PCDATA)> ]>
The ? operator following an element indicates that the element may be omitted or may appear just once. This is just one of three cardinality operators that you use to specify how many times a particular child element can appear as part of the content for the parent. The other two cardinality operators are *, which we have already seen, and +. In each case the operator follows the operand to which it applies. We now have four operators that we can use in element declarations and they are each similar in action to their equivalent in the regular expression context:
This operator indicates that there can be one or more occurrences of its operand. In other words there must be at least one occurrence, but there may be more.
This operator indicates that there can be zero or more occurrences of its operand. In other words, there can be none or any number of occurrences of the operand to which it applies.
This indicates that its operand may appear once or not at all.
This operator indicates that there can be an occurrence of either its left operand or its right operand, but not both.
We might want to allow a building number or a building name in an address, in which case the DTD could be written:
<!DOCTYPE address [ <!ELEMENT address ((buildingnumber | buildingname), street, city, state, zip?)> <!ELEMENT buildingnumber (#PCDATA)> <!ELEMENT buildingname (#PCDATA)> <!ELEMENT street (#PCDATA)> <!ELEMENT city (#PCDATA)> <!ELEMENT state (#PCDATA)> <!ELEMENT zip (#PCDATA)> ]>
The DTD now states that either <buildingnumber> or <buildingname> must appear as the first element in <address>. But we might want to allow neither, in which case we would write the third line as:
<!ELEMENT address ((buildingnumber | buildingname)?, street, city, state, zip?)>
The ? operator applies to the parenthesized expression (buildingnumber | buildingname), so it now states that either <buildingnumber> or <buildingname> may or may not appear, so we allow one, or the other, or none.
Of course, you can use the | operator repeatedly to express a choice between any number of elements, or indeed, subexpressions between parentheses. For example, given that you have defined elements Linux, Solaris, and Windows, you might define the element operatingsystem as:
<!ELEMENT operatingsystem (Linux | Solaris | Windows)>
If you wanted to allow an arbitrary operating system to be identified as a further alternative, you could write:
<!ELEMENT operatingsystem (AnyOS | Linux | Solaris | Windows)> <!ELEMENT AnyOS (#PCDATA)>
<!ELEMENT breakfast ((tea|coffee), orangejuice?, ((egg+, (bacon|sausage)) | cereal) , toast)>
This states that <breakfast> content is either a <tea> or <coffee> element, followed by an optional <orangejuice> element, followed by either one or more <egg> elements and a <bacon> or <sausage> element, or a <cereal> element, with a mandatory <toast> element bringing up the rear. However, while you can produce mind-boggling productions for defining elements it is wise to keep things as simple as possible.
After all this complexity, we mustn't forget that an element may also be empty, in which case it can be defined like this:
<!ELEMENT position EMPTY>
This states that the <position> element has no content. Elements can also have attributes so let's take a quick look at how they can be defined in a DTD.
You use an ATTLIST declaration in a DTD to define the attributes for a particular element. As you know, attributes are name-value pairs associated with a particular element and values are typically, but not exclusively, text. Where the value for an attribute is text, it is enclosed between quotation marks, so it is always unparsed character data. Attribute values that consist of text are therefore specified just as CDATA. No preceding # character is necessary in this context since there is no possibility of confusion.
We could declare the elements for a document containing circles as follows:
<?xml version="1.0"?> <!DOCTYPE circle [ <!ELEMENT circle (position)> <!ATTLIST circle radius CDATA #REQUIRED > <!ELEMENT position EMPTY> <!ATTLIST position x CDATA #REQUIRED y CDATA #REQUIRED > ]> <circle radius="15"> <position x="30" y="50"/> </circle>
Three items define each attribute – the attribute name, the type of value (CDATA), and whether or not the attribute is mandatory. This third item may also define a default value for the attribute, in which case this value will be assumed if the attribute is omitted. The #REQUIRED specification against an attribute name indicates that it must appear in the corresponding element. You specify the attribute as #IMPLIED if it need not be included. In this case the XML processor will not supply a default value for the attribute. An application is expected to have a default value of its own for the attribute value that is implied by the attribute's omission.
Save this XML in your /Beg Java Stuff directory with a suitable name such as "circle with DTD.xml" it will come in handy later.
You specify a default value for an attribute between double quotes. For example:
<!ATTLIST circle radius CDATA "1" >
This indicates that the value of radius will be 1 if the attribute is not specified for a <circle> element.
You can also insist that a value for an attribute must be one of a fixed set. For instance, suppose we had a color attribute for our circle that could only be red, blue, or green. We could define it like this:
<!ATTLIST circle color (red|blue|green) #IMPLIED >
The value for the color attribute in a <circle> element must be one of the options between the parentheses. In this case the attribute can be omitted because it is specified as #IMPLIED, and an application processing it will supply a default value. To make the inclusion of the attribute mandatory, we would define it as:
<!ATTLIST circle color (red|blue|green) #REQUIRED >
An important aspect of defining possible attribute values by an enumeration like this is that an XML editor can help the author of a document by prompting with the list of possible attribute values from the DTD when the element is being created.
An attribute that is declared as #FIXED must always have the default value. For example:
<!ATTLIST circle color (red|blue|green) #REQUIRED line_thickness medium #FIXED >
Here the XML processor will only supply an application with the value medium for the thickness attribute. If you were to specify this attribute for the <circle> element in the body of the document you can only use the default value, otherwise it is an error.
You will often need to repeat a block of information in different places in a DTD. A parameter entity identifies a block of parsed text by a name that you can use to insert the text at various places within a DTD. Note that parameter entities are only for use within a DTD. You cannot use parameter entity references in the body of a document. You declare general entities in the DTD when you want to repeat text within the document body.
The form for a parameter entity is very similar to what we saw for general entities except that a % character appears between ENTITY and the entity name separated from both by a space. For example, it is quite likely that you would want to repeat the x and y attributes that we defined in the <position> element in the previous section in other elements. We could define a parameter entity for these attributes and then use that wherever these attributes should appear in an element declaration. Here's the parameter entity declaration:
<!ENTITY % coordinates "x CDATA #REQUIRED y CDATA #REQUIRED">
Now we can use the entity name to insert the x and y attribute definitions in an attribute declaration:
<!ATTLIST position %coordinates; >
A parameter entity declaration must precede its use in a DTD.
The substitution string in a parameter entity declaration is parsed, and can include parameter and general entity references. As with general entities, a parameter entity can also be defined by a reference to a URI containing the substitution string.
An entity defined in the DTD. An entity here is a name identifying an unparsed entity defined elsewhere in the DTD by an ENTITY tag. The entity may or may not contain text. An entity could represent something very simple such as <, which refers to a single character, or it could represent something more substantial such as an image.
A list of entities defined in the DTD separated by spaces.
An ID is a unique name identifying an element in a document. This is to enable internal references to a particular element from elsewhere in the document.
A reference to an element elsewhere in a document via its ID.
A list of references to IDs separated by spaces.
A name conforming to the XML definition of a name. This just says that the value of the attribute will be consistent with the XML rules for a name.
A list of name tokens separated by spaces.
A name identifying a notation – which is typically a format specification for an entity such a JPEG or Postscript file. The notation will be identified elsewhere in the DTD using a NOTATION tag that may also identify an application capable of processing an entity in the given format.
With what we know of XML and DTDs, we can have a stab at putting together a DTD for storing Sketcher files as XML. As we said before, an XML language has already been defined for representing and communicating two-dimensional graphics. This is called Scalable Vector Graphics, and you can find it at http://www.w3.org/TR/SVG/. While this would be the choice for transferring 2D graphics as XML documents in a real-world context, our objective is to exercise our knowledge of XML and DTDs, so we will reinvent our own version of this wheel even though it will have fewer spokes and may wobble a bit.
First, let's consider what our general approach is going to be. Since our objective is to define a DTD that will enable us to exercise the Java API for XML with Sketcher, we will define the language to make it an easy fit to Sketcher, rather than worry about the niceties of the best way to represent each geometric element. Since Sketcher itself was a vehicle for trying out various capabilities of the Java class libraries, it evolved in a somewhat topsy-like fashion with the result that the classes defining geometric entities are not necessarily ideal. However, we will just map these directly in XML in order to avoid the mathematical jiggery-pokery that would be necessary if we adopted a more formal representation of geometry in XML.
A sketch is a very simple document. It's basically a sequence of lines, circles, rectangles, curves, and text. We can therefore define the root element, <sketch>, in the DTD as:
<!ELEMENT sketch (line|circle|rectangle|curve|text)*>
This just says that a sketch consists of zero or more of any of the elements between the parentheses. We now need to define each of these elements.
A line is easy. It is defined by its location, which is its start point, and an end point. It also has an orientation – its rotation angle – and a color. We could define a <line> element like this:
<!ELEMENT line (color, position, endpoint)> <!ATTLIST line angle CDATA #REQUIRED >
A line is fully defined by two points, but our Line class includes a rotation field so we have included that too. Of course, a position is also a point so it would be possible to use a <point> element for this, but differentiating the position for a geometric element will make it a bit easier for a human reader to read an XML document containing a sketch.
We could define color by a color attribute to the <line> element with a set of alternative values, but to allow the flexibility for lines of any color, it would be better to define a <color> element with three attributes for RGB values. In this case we can define the <color> element as:
<!ELEMENT color EMPTY> <!ATTLIST color R CDATA #REQUIRED G CDATA #REQUIRED B CDATA #REQUIRED >
We must now define the <position> and <endpoint> elements. These are both points defined by an (x,y) coordinate pair so you would sensibly define them consistently. Empty elements with attributes are the most economical way here and we can use a parameter entity for the attributes:
<!ENTITY % coordinates "x CDATA #REQUIRED y CDATA #REQUIRED"> <!ELEMENT position EMPTY> <!ATTLIST position %coordinates;> <!ELEMENT endpoint EMPTY> <!ATTLIST endpoint %coordinates;>
A rectangle will be defined very similarly to a line since it is defined by its position, which corresponds to the top left corner, plus the coordinates of the bottom right corner. It also has a color and a rotation angle. Here's how this will look in the DTD:
<!ELEMENT rectangle (color, position, bottomright)> <!ATTLIST rectangle angle CDATA #REQUIRED > <!ELEMENT bottomright EMPTY> <!ATTLIST bottomright %coordinates;>
We don't need to define the <color> and <position> elements since we have already defined these earlier for the <line> element.
The <circle> element is no more difficult. Its position is the center, and it has a radius and a color. It also has a rotation angle. We can define it like this:
<!ELEMENT circle (color, position)> <!ATTLIST circle radius CDATA #REQUIRED angle CDATA #REQUIRED >
The <curve> element is a little more complicated because it is defined by an arbitrary number of points, but it is still quite easy:
<!ELEMENT curve (color, position, point+)> <!ATTLIST curve angle CDATA #REQUIRED> <!ELEMENT point EMPTY> <!ATTLIST point %coordinates;>
Lastly we have the element that defines a text element in Sketcher terms. We need to allow for the font name and its style and point size, a rotation angle for the text, and a color – plus the text itself, of course, and its position. A Text element is also a little different from the other elements, as its bounding rectangle is required to construct it, so we must also include that. We have some choices as to how we define this element. We could use mixed element content in a <text> element, combining the text string with <font> and <position> elements, for instance.
The disadvantage of this is that we cannot limit the number of occurrences of the child elements and how they are intermixed with the text. We can make the definition more precisely controlled by enclosing the text in its own element. Then we can define the <text> element as having element content – like this:
<!ELEMENT text (color, position, font, string)> <!ATTLIST text angle CDATA #REQUIRED> <!ELEMENT font EMPTY> <!ATTLIST font fontname CDATA #REQUIRED fontstyle (plain|bold|italic) #REQUIRED pointsize CDATA #REQUIRED > <!ELEMENT string (#PCDATA|bounds)*> <!ELEMENT bounds EMPTY> <!ATTLIST point width CDATA #REQUIRED height CDATA #REQUIRED >
The <string> element content will be a <bounds> element defining the height and width of the bounding rectangle plus the text to be displayed. The <font> element provides the name, style, and size of the font as attribute values and since nothing is required beyond that it is an empty element. Children of the <text> element that we have already defined specify the color and position of the text.
That's all we need. The complete DTD for Sketcher documents will be:
<!ELEMENT sketch (line|circle|rectangle|curve|text)*> <!ELEMENT color EMPTY> <!ATTLIST color R CDATA #REQUIRED G CDATA #REQUIRED B CDATA #REQUIRED > <!ENTITY % coordinates "x CDATA #REQUIRED y CDATA #REQUIRED"> <!ELEMENT position EMPTY> <!ATTLIST position %coordinates;> <!ELEMENT endpoint EMPTY> <!ATTLIST endpoint %coordinates;> <!ELEMENT line (color, position, endpoint)> <!ATTLIST line angle CDATA #REQUIRED > <!ELEMENT rectangle (color, position, bottomright)> <!ATTLIST rectangle angle CDATA #REQUIRED > <!ELEMENT bottomright EMPTY> <!ATTLIST bottomright %coordinates;> <!ELEMENT circle (color, position)> <!ATTLIST circle radius CDATA #REQUIRED angle CDATA #REQUIRED > <!ELEMENT curve (color, position, point+)> <!ATTLIST curve angle CDATA #REQUIRED> <!ELEMENT point EMPTY> <!ATTLIST point %coordinates;> <!ELEMENT text (color, position, font, string)> <!ATTLIST text angle CDATA #REQUIRED> <!ELEMENT font EMPTY> <!ATTLIST font fontname CDATA #REQUIRED fontstyle (plain|bold|italic|bold-italic) #REQUIRED pointsize CDATA #REQUIRED > <!ELEMENT string (#PCDATA|bounds)*> <!ELEMENT bounds EMPTY> <!ATTLIST bounds width CDATA #REQUIRED height CDATA #REQUIRED >
We can use this DTD to represent any sketch in XML. Stash it away in your Beg Java Stuff directory as sketcher.dtd. We will try it out later.