An XML document basically consists of two parts, a prolog and a document body:
The prolog provides information necessary for the interpretation of the contents of the document body. It contains two optional components, and since you can omit both, the prolog itself is optional. The two components of the prolog, in the sequence in which they must appear, are:
An XML declaration that defines the version of XML that applies to the document, and may also specify the particular Unicode character encoding used in the document and whether the document is standalone or not. Either the character encoding or the standalone specification can be omitted from the XML declaration but if they do appear they must be in the given sequence.
A document type declaration specifying an external Document Type Definition (DTD) that identifies markup declarations for the elements used in the body of the document, or explicit markup declarations, or both.
The document body contains the data. It comprises one or more elements where each element is defined by a begin tag and an end tag. The elements in the document body define the structure of the data. There is always a single root element that contains all the other elements. All of the data within the document is contained within the elements in the document body.
Processing instructions (PI) for the document may also appear at the end of the prolog and at the end of the document body. Processing instructions are instructions intended for an application that will process the document in some way. You can include comments that provide explanations or other information for human readers of the XML document as part of the prolog and as part of the document body.
When an XML document is said to be well-formed, it just means that it conforms to the rules for writing XML, as defined by the XML specification. Essentially an XML document is well-formed if its prolog and body are consistent with the rules for creating these. In a well-formed document there must be only one root element and all elements must be properly nested. We will summarize more specifically what is required to make a document well-formed a little later in this chapter, after we have looked into the rules for writing XML.
An XML processor is a software module that is used by an application to read an XML document and gain access to the data and its structure. An XML processor also determines whether an XML document is well-formed or not. Processing instructions are passed through to an application without any checking or analysis by the XML processor. The XML specification describes how an XML processor should behave when reading XML documents, including what information should be made available to an application for various types of document content.
Here's an example of a well-formed XML document:
<proverb>Too many cooks spoil the broth.</proverb>
The document just consists of a root element that defines a proverb. There is no prolog and, formally, you don't have to supply one, but it would be much better if the document did include at least the XML version that is applicable, like this:
<?xml version="1.0"?> <proverb>Too many cooks spoil the broth.</proverb>
The first line is the prolog and it consists of just an XML declaration, which specifies that the document is consistent with XML version 1.0. The XML declaration must start with <?xml with no spaces within this five character sequence. We could also include an encoding declaration following the version specification in the prolog. For example:
<?xml version="1.0" encoding="UTF-8"?> <proverb>Too many cooks spoil the broth.</proverb>
The first line states that as well as being XML version 1.0, the document uses the "UTF-8" Unicode encoding. If you omit the encoding specification, "UTF-8" or "UTF-16" will be assumed, and since "UTF-8" includes ASCII as a subset, you don't need to specify an encoding if all you are using is ASCII text. The version and the character encoding specifications must appear in the order shown. If you reverse them you have broken the rules so the document would no longer be well-formed.
If we want to specify that the document is not dependent on any external definitions of markup, we can add a standalone specification to the prolog like this:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <proverb>Too many cooks spoil the broth.</proverb>
Specifying the value for standalone as "yes" indicates to an XML processor that the document is self-contained. A value of "no" would indicate that the document is dependent on an external definition of the markup used.
A valid XML document is a well-formed document that has an associated Document Type Definition or DTD (we will learn more about DTDs later in this chapter). In a valid document the DTD must be consistent with the rules for creating a DTD and the document body must be consistent with the DTD. A DTD essentially defines a markup language for a given type of document and is identified in the DOCTYPE declaration in the document prolog. It specifies how all the elements that may be used in the document can be structured, and the elements in the body of the document must be consistent with it.
The previous example is well-formed, but not valid, since it does not have an associated DTD that defines the <proverb> element. Note that there is nothing wrong with an XML document that is not valid. It may not be ideal, but it is a perfectly legal XML document. Valid in this context is a technical term that only means that a document does not have a DTD.
An XML processor may be validating or non-validating. A validating XML processor will check that an XML document has a DTD and that its contents are correctly specified. It will also verify that the document is consistent with the rules expressed in the DTD and report any errors that it finds. A non-validating XML processor will not check that the document body is consistent with the DTD. As we shall see, you can usually choose whether the XML processor that you use to read a document is validating or non-validating simply by switching the validating feature on or off.
Here's a variation on the example from the previous section with a document type declaration added:
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <!DOCTYPE proverb SYSTEM "proverb.dtd"> <proverb>Too many cooks spoil the broth.</proverb>
A document type declaration always starts with <!DOCTYPE so it is easily recognized. The name that appears in the DOCTYPE declaration, in this case proverb, must always match that of the root element for the document. We have specified the value for standalone as "no", but it would still be correct if we left it out because the default value for standalone is "no" if there are external markup declarations in the document. The DOCTYPE declaration indicates that the markup used in this document can be found in the DTD at the URI proverb.dtd. We will see a lot more about the DOCTYPE declaration later in this chapter.
Having an external DTD for documents of a given type does not eliminate all the problems that may arise when exchanging data. Obviously confusion may arise when several people independently create DTDs for the same type of document. My DTD for documents containing sketches created by Sketcher is unlikely to be the same as yours. Other people with sketching applications may be inventing their versions of a DTD for representing a sketch so the potential for conflicting definitions for markup is considerable. To obviate the difficulties that this sort of thing would cause, standard markup languages are being developed in XML that can be used universally for documents of common types. For instance, the Mathematical Markup Language (MATHML) is a language defined in XML for mathematical documents and the Synchronized Multimedia Integration Language (SMIL) is a language for creating documents that contain multimedia presentations. There is also the Scalable Vector Graphics (SVG) language for representing 2D graphics such as design drawings or even sketches created by Sketcher.
XML markup divides the contents of a document up into elements by enclosing segments of the data between tags. As we said, there will always be one root element that contains all the other elements in a document. In the example above, the following is an element:
<proverb>Every dog has his day.</proverb>
In this case this is the only element and is therefore the root element. A start tag, <proverb>, indicates the beginning of an element, and an end tag, </proverb>, marks its end. The name of the element, proverb in this case, always appears in both the start and end tags. The text between the start and end tags for an element is referred to as element content and in general may consist of just data, which is referred to as character data, other elements, which is described as markup, or a combination of character data and markup, or it may be empty. The latter is referred to as an empty element.
When an element contains plain text, then the content is described as parsed character data (PCDATA). This means that the XML processor will parse it – it will analyze it in other words – looking to see if it can be broken down further. In fact PCDATA allows for a mixture of ordinary data and other elements, referred to as mixed content, so a parser will be looking for the characters that delimit the start and end of markup tags. Consequently, ordinary text must not contain characters that might cause it to be recognized as a tag. Thus you can't include < or & characters explicitly as part of the text within an element, for instance. Since it could be a little inconvenient to completely prohibit such characters within ordinary text, you can include them using predefined entities when you need to. XML recognizes the following predefined entities that represent characters that would otherwise be recognized as part of markup:
Here's an element that makes use of a predefined entity:
<text> This is parsed character data within a <text> element'</text>
The content of this element is the string:
This is parsed character data within a <text> element.
Here's an example of an XML document containing several elements:
<?xml version="1.0"?> <address> <buildingnumber>29</buildingnumber> <street> South Lasalle Street</street> <city>Chicago</city> <state>Illinois</state> <zip>60603</zip> </address>
This document evidently defines an address. Each tag pair identifies and categorizes the information between the tags. The data between <address> and </address> is an address and this is a composite of five further elements that each contain character data that forms part of the address. We can easily identify what each of the components of the address is from the elements that enclose each sub-unit of the data.
The tags that delimit an element have a precise form. Each element start tag must begin with < and end with > and each element end tag must start with </ and end with >. The tag name – also known as the element type name – identifies the element and differentiates it from the others. Note that the element name must immediately follow the opening < in the case of a start tag and the </ in the case of an end tag. If you insert a space here it is incorrect and will be flagged as an error by an XML processor.
Since the <address> element contains all of the other elements that appear in the document, this is the root element. When one element encloses another, it must always do so completely if the document is to be well-formed. Unlike HTML where a somewhat cavalier use of the language is usually tolerated, XML elements must never overlap. For instance, you can't have:
An element that is enclosed by another element is referred to as the child of the enclosing element, and the enclosing element is referred to as the parent of the child element. In our example, the <address> element is the parent of the other four because it directly encloses each of them and the enclosed elements are child elements of the <address> element. In a well-formed document each begin tag must always be matched by a corresponding end tag, and vice versa. If this isn't the case, the document is not well-formed.
Don't forget that there must be only one root element that encloses all the other elements in a document. This implies that you cannot have an element of the same type as the root element as a child of any element in the document.
We already know that an element that contains nothing at all, just a start tag immediately followed by an end tag, is called an empty element. For instance:
You have an alternative way to represent empty elements. Instead of writing a start and end tag with nothing between them, you can write an empty element as a single tag with a forward slash immediately following the tag name:
You may be thinking at this point that an empty element is of rather limited use, whichever way you write it. Although by definition an empty element has no content, it can and often does contain additional information that is provided within attributes that appear within the tag. We shall see how we add attributes to an element a little later in this chapter. Additionally, an empty element can be used as a marker or flag to indicate something about the data within its parent. For example, you might use an empty element as part of the content for an <address> element to indicate that the address corresponds to a commercial property. Absence of the <commercial/> element would indicate a private residence.
When you create an XML document using an editor, it is often useful to add explanatory text to the document. You include comments in an XML document like this:
<!-- Prepared on 14th January 2002 –->
Comments can go just about anywhere in the prolog or the document body, but not inside a start tag or an end tag, or within an empty element tag. You can spread a comment over several lines if you wish, like this: