Many standards mentioned in this chapter, such as XML signatures and XML encryption, require a common way to represent XML documents. XML documents can have the same logical meaning but different physical implementations, based on character encoding, attribute ordering, or even structure. Let us look at a simple example:
<img src="burney.gif" width="100" height="50"/> <img src="burney.gif" height="50" width="100"/>
A string comparison of the two elements above will not equate, yet they are equivalent. The XML 1.0 specification states that order of attributes is not note-worthy. Two equivalent XML documents may differ based on physical structure or character encoding. Nor are the amount of whitespace between attributes and whether default values are included. The ability to check XML documents for equivalence is important, especially in conjunction with checksums, digital signatures, and version control. W3C has defined a canonical form for XML documents that provides a solution to these problems.
The Canonical XML specification establishes the concept of equivalence between XML documents and provides the ability to test at the syntactic level. It allows you to determine whether logically equivalent documents are byte-for-byte identical. Canonical XML does not use Unicode for its processing; it relies on UTF-8. This is done primarily because the Unicode standard allows multiple representations of certain characters. Two XML documents with equivalent content may contain differing character sequences. For example:
<?xml version="1.0" encoding="ISO-8859-1"?> <lang>Español</lang>
Here, the character "ñ" is represented as #xF1 in Unicode, based on the specified ISO-8859-1 encoding ("ISO Latin-1"). UTF-8 represents all characters as two bytes and will therefore represent "ñ" as #xC3 and #xB1. Many other XML constructs have similar representations.
A canonicalized XML document depends on its standalone document declaration. A document must be self-contained and cannot contain external references that affect its canonical form. Suppose an XML document named government.xml contains the sentence "The government is responsible!" and the document is also stored in the same directory:
<!DOCTYPE d [ <!ENTITY lsb '['> <!ENTITY rsb ']'> <!ENTITY government SYSTEM "government.xml"> ]> <d>&lsb;&bum;&rsb;</d>
The canonical form of this document would become:
<d>[The government is responsible!]</d>
Using the canonical form of an XML document is vital for digital signatures and encryption. Otherwise, a recipient may improperly determine that a document has been altered, when in fact it is still intact. This is important for signing and nonrepudiation in Web services architecture.
Because the Canonical XML specification is evolving, we recommend that you visit www.w3.org for the latest draft of the CharModel, Namespaces, and XML specifications.