Copyright 1998 Simon St.Laurent
The computing press has found a new savior for the ills that afflict computing and the web: XML. XML is new, it's exciting, and it's got to be good, because the specification for it looks indecipherable. XML's hype level has already drawn fire from some quarters, from those accusing it of 'balkanizing the web' or of increasing the load on an already strained Internet. Most important, many developers are wondering why exactly they need to learn yet another language.
XML's set of tools allows developers to create web pages - and much more. XML allows developers to set standards defining the information that should appear in a document, and in what sequence. XML, in combination with other standards, makes it possible to define the content of a document separately from its formatting, making it easy to reuse that content in other applications or for other presentation environments. Most important, XML provides a basic syntax that can be used to share information between different kinds of computers, different applications, and different organizations without needing to pass through many layers of conversion.
Web developers are the initial target audience, but database developers, document managers, desktop publishers, programmers, scientists, and other academics are all getting involved. XML provides a simple format that is flexible enough to accommodate wildly diverse needs. Even developers performing tasks on different types of applications with different interfaces and different data structures can share XML formats and tools for parsing those formats into data structures that applications can use. XML offers its users many advantages, including:
Extensible Markup Language (XML) provides a foundation for creating documents and document systems. XML operates on two main levels: first, it provides syntax for document markup; and second, it provides syntax for declaring the structures of documents. XML is clearly targeted at the Web, though it certainly has applications beyond it. Users who have worked with HTML before should be able to learn the basics of XML without too much difficulty. XML's simplicity is its key selling point, perhaps even its strongest feature.
XML is derived from (and is technically a subset of) the Standard Generalized Markup Language (SGML). SGML has found its main customer base in organizations handling enormous quantities of documents - the U.S. Government Printing Office, IBM, the U.S. Department of Defense and Internal Revenue Service, and many publishers. SGML's development provides the foundations for XML, but XML has a smaller and simpler syntax, targeted at web developers and others who need a simple solution to document creation, management, and display.
Note:
The following section is not an XML tutorial. It only shows a few pieces of XML to show part of what a document and a DTD look like. For a full tutorial on XML, see my XML: A Primer, or any of a number of other good XML guides arriving at a bookstore near you.Like HTML (an application of SGML), XML uses elements and attributes, which are indicated in a document using tags. Tags begin with a < and close with a >. End tags include a / before the name of the element; empty tags include a / before the closing >. For example, the following bit of a document includes three elements: two elements with content, and one empty tag.
<FIGURE DESCRIPTION="Harvey"><IMAGE/><CAPTION>This is a picture of my invisible friend!</CAPTION></FIGURE>
The first start tag opens the FIGURE element, which has the attribute DESCRIPTION set to "Harvey", and contains an empty IMAGE element and the CAPTION element with its content. End tags close both the CAPTION and the FIGURE elements, producing a nested structure. These nested structures are fairly good at representing typical document and data structures, and a very easy for computer programs to store and manipulate. XML enforces its rules harshly. Unlike HTML browsers, which have been extremely forgiving of bad markup, XML parsers are supposed to produce error messages for illegal or malformed markup. Forcing the author to clean up their markup allows the parsers on the receiving end to do much less work. It also provides authors with confidence that their work will be interpreted consistently, without having to wonder how multiple browsers would interpret the same document.
In addition to providing syntax for document markup, XML provides syntax for specifying document structure. The Document Type Definition (DTD) provides XML parsers a set of rules with which they can validate the document. Validation doesn't imply that the contents of the document are correct, or that certain data fields are numbers or text; rather, it means that all the elements of the document fit into the structure specified by the DTD. For example, the fragment below specifies the structure used in the example above.
<!ELEMENT FIGURE (IMAGE, CAPTION)>
<!ATTLIST FIGURE
DESCRIPTION CDATA #IMPLIED>
<!ELEMENT IMAGE EMPTY>
<!ELEMENT CAPTION (#PCDATA)>
The FIGURE element must contain an IMAGE and a CAPTION element, and the FIGURE element may have a DESCRIPTION attribute. The IMAGE element must be empty, and would probably include a set of attributes providing information about the image if this example was more complicated than an 'invisible friend'. The CAPTION element may contain text, entities, processing instructions, and any other valid XML text except other elements. The full syntax for DTDs provides many more options for declaring elements and attributes and their location in the document structure, as well as entities, which allow the developer to define a chunk of XML content or DTD information and use it by reference.
XML permits the use of documents, called 'well-formed documents' that use only its rules for document syntax, without specifying a DTD. Documents that contain (and/or refer to) a properly written DTD, and meet the requirements it sets, are referred to as 'valid'. Validation can be an important step in the authoring process, and may also be performed at any step in processing. Developers can choose how often, and when, to screen a document to check its structure. Applications which need to process lots of information quickly, or which can't afford the additional processing requirements imposed by validation, can stick to well-formed documents. Well-formed documents also provide an easy bottom rung on the XML learning ladder - by sticking to the basic syntax, developers can create parseable documents with any structure they choose, moving up to more formal DTDs when the need arises.
In practice, though, XML itself should disappear into the background, hiding behind tools for most users. Most people won't need to create their own DTDs - once a standard DTD is created, users can simply apply it, making modifications when it becomes clear the structure needs improvement. As XML becomes ubiquitous (and tools improve), it should become invisible to all but a few, buried underneath authoring tools and plug-in parsers.
XML provides both programmers and document authors with a friendly environment, at least by computing standards. XML's rigid set of rules helps make documents more readable to both humans and machines. XML document syntax contains a fairly small set of rules, making it possible for developers to get started right away. DTDs can be developed through a standards process, set by experts, or through experimentation, based on the structures of documents that seem to work well. XML parsers are also reasonably simple to build, especially parsers that only check well-formedness.
XML documents are built upon a core set of basic nested structures. While the structures themselves can grow complex as layers and layers of detail are added, the mechanisms underlying those structures require very little implementation effort, from either authors or developers. These basic structures can be used to represent complex sets of information, from the full contents of a document to persistent object state information to a set of commands for a program, without needing to change the structures themselves.
XML is extensible in two senses. First, it allows developers to create their own DTDs, effectively creating 'extensible' tag sets that can be used for multiple applications. Second, XML itself is being extended with several additional standards that add styles, linking, and referencing ability to the core XML set of capabilities. As a core standard, XML provides a solid foundation around which other standards may grow.
Creating DTDs is most likely what the creators of XML had in mind when they called it Extensible Markup Language. XML is, after, a meta-language, a set of rules that can be used to create sets of rules for documents. In a certain sense, there's no such thing as an 'XML document' - all the documents that use XML-compliant syntax are really using applications of XML, with tag sets chosen by their creators for that particular document. XML's facilities for creating DTDs give standard-builders a set of tools for specifying what document structures may or must appear in a document, making it easy to define sets of structures. These structures can then be used with XML tools for authoring, parsing, and processing, and used by applications as a guide to the data they should accept.
At the same time that XML is being used to create other standards, other supporting standards for XML are being defined. XML can already use many of the standards applied to HTML, like Cascading Style Sheets (CSS) and HyperText Transfer Protocol (HTTP). W3C working groups are developing additional supporting standards for XML. XML-Linking (XLink) provides linking facilities that are far more sophisticated than those in HTML. XPointers, derived from the Text Encoding Initiative's (TEI) extended pointers, provide a way to consistently reference portions of documents. Extensible Style Language (XSL) provides a more complete set of formatting tools than CSS, and is notable for using XML syntax to define its style sheets. Other standards, including support for data-typing, are under discussion.
XML can be used on a wide variety of platforms and interpreted with a wide variety of tools. Because the document structures behave consistently, parsers that interpret them can be built at relatively low cost in any of a number of languages. XML supports a number of key standards for character encoding, allowing it to be used all over the world in a number of different computing environments. XML complements Java, another force for interoperability, very well, and a considerable amount of early XML development has been in Java. A generic application programming interface (API) for parsers, the Simple API in XML (SAX), is freely available. Parsers are also available in C++, C, JavaScript, Tcl, and Python, with more on the way. XML parser development so far has focused on freeware plug-ins that provide parsing capabilities to XML applications, greatly lowering the cost of building XML-enabled applications.
Although there have been some questions about the process used to create XML, the standard itself is completely open, freely available on the web. The W3C members have early access to standards (and, apart from invited experts, are the only ones who can participate directly in their creation), but once the standard is complete the results are public. The XML Working Group and the Working Groups for the supporting standards also release drafts of their work on a regular basis, making it possible to follow work in progress. Several non-W3C XML developments have also been extremely open, including SAX.
XML documents themselves are also considerably more open than their binary counterparts. Anyone can parse a well-formed XML document, and validate it if a DTD is provided. While companies may still create XML that behaves in a specific way bound to their application, the data in the XML document is available to any application. While developers could create obfuscated DTDs or encrypt their data in a proprietary manner, they would lose most of the benefits of using XML. XML doesn't bar the creation of proprietary formats, but its openness is one of its greatest advantages. Application developers can partition tasks among multiple tools, possibly even from different vendors, allowing them all to operate on the same structured data set.
Successful application of XML will require data modeling expertise and (eventually) the building of a new set of tools. Fortunately, the skills of the SGML community are mostly transferable, providing XML with a large group of 'experts' early in its existence. The XML specification was created for the most part by a group of experienced SGML developers, and has received vocal support from many sectors of the SGML community. Vendors are repurposing their tools, simplifying them for XML. Authors who had been writing SGML texts are focusing on XML as well, bringing markup structure to a wider audience. Companies that need to bring in outside vendors to help with large projects have a pool of firms with established track records to choose from.
XML was designed, first and foremost, to be "straightforwardly usable over the Internet." XML's first home, and the application it rides to ubiquity, is likely to be the Web. This doesn't mean that XML by itself will revolutionize the Web, or that the Web is the only field in which XML is going to be useful. In the long run, XML may become a common thread uniting a wide variety of applications, smoothly managing data across distributed applications.
The Web will (hopefully) be XML's starting point, the 'killer app' for XML. XML and CSS together provide developers with an easy and efficient way to mark up pages for presentation on the web. Simply as a web-development system, XML and CSS can already provide significant advantages over HTML, both for ease of use and for the amount of time needed to create a large site. Because CSS allows designers to centralize formatting information in a style sheet shared by any number of documents, designers only need to design repetitive pages once. The formatting information in the style sheet will link to the XML tags in the document, allowing documents to be marked up by editors based on content without the need to enter precise formatting. Complex pages, like the front pages of many sites, will still need extra attention, but the bulk of pages can be built at a reduced cost. Also, because style sheets can centralize all the formatting information needed for a site, XML sites may actually use less bandwidth than their older HTML equivalents.
While formatting pages is useful, many users are starting to realize that web sites are only marginally more useful than printed or faxed material. Although it's possible to cut and paste information out of a web browser, XML opens up the prospect of reusable page content. With appropriate supporting applications, a user could extract the XML data from a document and keep it in their own private data store, making it easy to manipulate the information later. This information could include site maps, price lists, product information, or nearly any kind of data that can be represented as text. Content-based XML markup enhances searchability as well, making it possible for agents and search engines to categorize data instead of wasting processing power on context-based full-text searches.
At the same time, XML is useful for much more than Web pages. XML's potential as a universal transfer format, allowing even applications of different types to exchange data smoothly, holds as much promise as its role as a document system. XML browsers are a key opening for XML, allowing users to read XML documents freely, but browsers are only the beginning. XML provides a gateway for communication between applications, even applications on wildly different systems. As long as applications can share data (through HTTP, file sharing, or another mechanism), and have an XML parser, they can share structured information that is easily processed. Databases can trade tables, business applications can trade updates, and document systems can share information.
XML provides a core set of standards developers can use to create their own standards. While some have declared that XML will 'balkanize' the Web, the effects are likely to be much more complex - and less destructive - than that suggestion implies. By hammering down a common document syntax, but allowing developers to go their own ways on markup elements, XML makes it possible to create new systems for data management and organization without the many of the incompatibilities and complexities that plagued older systems.
Comments? Suggestions?
Please contact Simon St.Laurent
Some of my other XML essays are also available.
Copyright 1998 Simon St.Laurent