What is XML?

XML, which stands for eXtensible Markup Language, is a syntax that specifies a means of "marking up" textual content for a variety of purposes. eIt is a greatly simplified version of a parent language called SGML, for Standard Generalized Markup Language) that was used to implement HTML. (Despite the word "Language", it is much simpler to understand the basis for markup languages than programming languages.) XML resembles HTML in its use of <...>  and </...> tags to enclose content, except that you can define tags of your own: that is what "extensible" means. XML is great for exchanging data, because it is self-describing: the tags describe every data item, so that one can often figure out what the data means even if one is not a programmer. For example, to record data on a student, we might have:

<STUDENT>
  <FIRST_NAME>
     JOHN
  </FIRST_NAME>
  <LAST_NAME>
     SMITH
  </LAST_NAME>
<DATE_OF_BIRTH>
     12/1/78
  </DATE_OF_BIRTH>
  <SSN>
     000-111-2222
  </SSN>
</STUDENT>

The nice thing about using XML for exchanging textual data (as opposed to using other forms of text such as tab- or comma-delimited files) is that each item is partially "self-describing". If you're familiar with HTML, you have little difficulty figuring out what the student-data XML sample above means, and even if you're not, you shouldn't find it that difficult to figure out. Also a parser. (a program that interprets the XML) can successfully handle the data even if the order of the fields is scrambled. For that matter, it can easily ignore the fields that it doesn't care about.

There are limits to "self-description" - if the tag names are cryptic, for example, or if there are hundreds of different types of tags in a document, few of which are described elsewhere simply and comprehensively, the human reader can end up thoroughly confused. It's like COBOL: even though it was intended to be like plain English - so that non-programmers could supposedly be more comfortable with it - all that this resemblance to English did was to irritate professional programmers, while not being significantly lowering the confusion quotient for everyone else. Despite what some misguided authors state about XML being human-readable, in reality, unless the meaning of individual tags is blindlingly obvious, as in the above, it is more accurate to state that XML is only quasi-human-readable;  it is still more suited to be processed by a computer.

XML specifies content, not appearance per se (as is the case with HTML). The fact that in some cases, the objective of the content is to specify a particular appearance (e.g. graphics) is coincidental. It can very well be used for any other purpose. It is the responsibility of an application to transform a stream of XML any way it wants to. (For example, the last name of a student could be rendered in bold blue letters.) The technology that is most preferred for this purpose is called Extensible Stylesheet Language (XSL). This is effectively a high level programming language (when combined with the scripting languages used for Web applications) that can transform a stream of XML into HTML. Actually, an XSL script can transform XML into anything it wants - even another form of XML.

In XML, tags can be nested within tags (as in the above case, where FIRST_NAME is nested within STUDENT) to an arbitrary level. XML-aware browsers, such as the current versions of Internet Explorer and Mozilla, will display XML content in a hierarchical "tree"-like fashion - so that there is a little minus sign against the <STUDENT> tag. Clicking on this minus sign will hide the contents within, and change the minus to a plus. Clicking on a plus expands the contents.

The individual tags can also have attributes, which are basically property-value pairs separated by the "equals" symbol. For example, in SVG (Scalable Vector Graphics, an XML dialect we discuss later), the following would define a particular circle
<circle cx="600" cy="200" r="100" fill="red" stroke="blue" stroke-width="10" />
The attributes cx and cy are the x and y coordinates of the circle's center (in pixel coordinates), r is the radius (in the same units), fill is the fill color, stroke is the color of the circle's outline, and stroke-width is the outline's width.

Note that XML is verbose, because practically all content is sandwiched between begin-tags (<...>) and end-tags (</...>). Certain tags don't have content inside them, such as the "circle" tag defined in the previous paragraph, so the symbol "/>" ca be used as a short-form for "></circle>". For many documents, the tag/markup content takes up a greater proportion of space than does the stuff inside the tags. This is not the point, however - the objective is to define the "structure" of what information we wish to record and transmit about a student. Space plays second fiddle to understandability and ease of validation (covered in the next section). In any case, if the XML is stored within an engine that understands it, as opposed to being stored as a text file, these tags are compressed, resulting in dramatic space savings.

Content Validation: XML Schemas

Of course, we cannot just put in any tag we like for a student. There must be a definition the valid tags that describe a student, so that a tag like "<RASPBERRY>", for example, is rejected. This definition is stored in what is called the Document Type Definition (DTD) that accompanies a block of XML.

DTDs are rather harder to learn to use than XML itself, and the technology has several limitations in terms of being able to validate content. Notably, you can't specify many kinds of constraints on the contents inside the tags, e.g., some contents must be integer numbers, the numbers must lie within a range of values, or the contents must belong to a limited, predefined set of values. This is why Microsoft and other organizations have proposed an alternative way of defining XML structure, called XML Schemas. Here, the definition language itself is based on XML. XML schemas borrow ideas from the database world, so that the contents of individual tags can be strongly typed (to be integers within a range, for example). 

XML Schemas are right now the preferred technology for specifying and validating XML content. The nice thing about XML Schemas is that there are software tools that will let you specify that a particular XML document must conform to a particular schema. If a particular document does not conform to the desired schema, these tools will even highlight the point within the offending document where conformance is violated and report an error diagnostic. (These diagnostics are not very useful for end-users, but programmers generally know what to do after seeing them.)

Applications of XML

XML is powerful because it combines simplicity with extensibility. However, because it has taken the software world by storm, the phrase "XML" has become a buzzword. There is a lot of misinformation and hype surrounding the technology, and there is considerable potential for abuse - that is, use of the technology when a simpler or alternative one might be more appropriate. 

XHTML

HTML itself can be defined in terms of XML. This definition, called XHTML, is also called "strict" HTML. Apart from the fact that XML allows you to define your own tags, XML differs from vanilla HTML in that

Violations of the above constraints would be treated as syntax errors. Web browsers do not enforce xHTML: if they did, 90% of the pages on the Web would possibly not work. However, when you work with modern HTML editing software and create your HTML through a GUI, the editor can be configured (and is typically set, by default) to generate XHTML rather than HTML. The use of XHTML makes it easier for such software to catch logical errors that would be very hard to catch with HTML.

Extensible  languages for defining document semantics for data interchange

Microsoft has been a leader in the use of XML. In the latest version of Microsoft Office (2000), a word document, an Excel spreadsheet, or a Powerpoint presentation can be saved in XML format, allowing manipulation by any program that knows how to deal with XML (and is also aware of the semantics of the various MS-Office document models, i.e., XML schemas). While the Microsoft Office XML schemas are very complex (and also subject to change, as Microsoft decides to enhance the features of Office), one side effect of the XML representation is that, through the use of XSL, word-processor documents or presentations can be transformed readily into HTML so that the content can be viewed through a Web page without any loss of formatting when the page is viewed with Internet Explorer version 5 or later. (A spreadsheet can even be enabled so that calculations can be done on the Web, provided that you possess a copy of Microsoft Office, which knows that the tags mean.)

To be fair, the "self-describing" ability of XML has its limits for Office documents, as is the case for anything that is sufficiently complex. The Office XML schemas are not documented (in simple, explanatory English) as thoroughly as the third-party developer would like, so that the only software developers who seem to be able to operate on Office XML documents are those within Microsoft. For others, the preferred way of manipulating the document content is still through the Office Application Programming Interface (API).   Pitfall: When designing an interchange format for data in a new, rapidly evolving domain, don't try to define  the underlying data model using XML technologies. XML Schemas are a very blunt tool for data modeling - you're better off using a technology such as the Unified Modeling Language (UML), which is highly visual/graphical, and whose "class diagrams" represent a superset of the functionality of the Entity-Relationship diagrams used in Relational Database modeling. UML diagrams allow iterative cycles of refinement and are far easier for non-technical people to understand than reams of XML.   Once you have the data model in place in a reasonably stable version, the data interchange model will derive from the data model automatically.

The above lessons were learned by the Microarray and Gene Expression Data (MGED) consortium, who have documented them in some detail. (They wasted a lot of effort initially running in circles with XML schemas and DTDs.)

Extensible Semantics for Software Development: Microsoft ASP.NET

Possibly one of the most elegant examples of using XML to specify both appearance and functionality is seen in the ASP.NET technology invented by Microsoft.  Microsoft's toolset specifies high-level user-interface widgets ("Web controls") such as a TreeView that displays arbitrarily hierarchical data, textboxes whose properties can be set interactively, or a "Multiview" that enables you to create an interface like a Tabbed notebook (with one active page at a time). You can drag and drop widgets from a toolbox.

When inspecting the Web page in "Design View" you see a visual rendition of the widget. When, however, you switch to "Source/HTML View", what you see is custom XML for each control. Changing the XML content will change the visual rendition - e.g., foreground and background colors in the widget- and conversely, changing the properties of the control interactively will change the XML. When the Web page actually runs, the XML is translated to HTML, and the translation is often very sophisticated, because the widget may also render program code. Therefore, XML is used here as a means of hiding the complexity of interactive Web page development from the developer.

It is possible to write software tools that generate complete Web pages (or a suite of Web pages) for a database application, based on a specification of the database structure and the desired user interface. The pages contain the appropriate XML tags for the controls that you need.

The truly powerful thing about this technology is that you can build your own special-purpose controls to do things that the Microsoft widgets don't do. When placed in the page, these are also rendered both visually as well as in XML format, using the custom XML tags that you have created to define various aspects of your control's semantics.  You can then write tools that generate Web pages that incorporate your own controls.

Graphics Rendering

As just stated, XML has been used as the basis of data interchange in specific scientific domains. An example is Chemical Markup Language (CML), which is used to describe the structure of molecules. Several software tools will render CML graphically, supporting features such as molecule rotation or highlighting of selected sub-molecular units, as well as tools that allow one to draw molecules and then render the CML automatically.

An interesting application of XML is Scalable Vector Graphics (SVG), which is a W3C-standard for the display and interchange of vector-based Web graphics,  i.e., shapes based primarily on geometric shapes such as lines, circles and arcs rather than on bitmaps, though the latter are also supported. The main force behind SVG is Adobe (www.adobe.com/svg). Adobe distributes a free SVG plug-in that works with a variety of Web browsers. Whenever a Web browser receives a stream of XML that happens to be SVG (as indicated by specific tags), it passes control to the plug-in, which then renders the SVG as a graphic within a "window" inside the browser.  

SVG is based on the client-server model - the Web browser that receives the XML stream is a client to the SVG plug-in (the server), sendng it the XML stream. The server then renders this stream as a picture. The graphics can be highly interactive, allowing selection of individual objects with a mouse, resizing and repositioning, animations (such as fades and transitions), all achieved with a few XML/SVG tags. (For advanced functionality, SVG also supports embedded JavaScript code.) The benefits of SVG are several.

 Internet Explorer used to support a predecessor ot SVG called Vector Markup Language (VML): this technology appears to have stagnated since SVG was introduced. SVG, however, seems to lack the momentum it was expected to have when first introduced. This is partly because the SVG spec is complex enough that, to take full advantage of it, one needs interactive graphics tools that let you compose the SVG visually. Also, when you need more advanced functionality, working with (and debugging) embedded JavaScript is somewhat of a nightmare - the development tools that let you write such code and test it interactively are currently non-existent, and you debug using 1970's-style print statements. Adobe knows how to do graphics software, but they're not quite as good as building tools for software developers.

Microsoft, whose programmers do know how to write tools for developers,  seems to have stayed off the SVG bandwagon- even the latest version of .NET (2005, beta) lacks SVG support. (While Microsoft Visio generates SVG after a fashion, most of the SVG content is a black box that the developer cannot modify.) The long-postponed and forthcoming version of Windows (Longhorn) is intended to support advanced graphics rendering using a flavor of XML which resembles SVG but is not quite SVG. Also, Adobe's Flash format does a lot of what SVG does, but is more developer-friendly. Adobe, which originally championed SVG, seems to have backed off SVG after its acquisition of Macromedia, which developed the Flash technology, so that SVG now seems to be somewhat of an orphan technology, though it spawned several good ideas that have been employed elsewhere.

Managing of XML Content: XML Databases

One use of XML is to mark up clinical text documents - for example, highlighting different sections of a radiology report. When the marked-up reports are stored in a database, one might want to query the database to fetch all tags of a particular kind, and limit the results based on Boolean criteria.

The latest version of high-end databases such as Oracle and Microsoft SQL Server 2008 have several XML-related enhancements.

Pitfall: Just because you can store XML in a database doesn't mean that you always should. The first thing you should do is try to create a normalized model of the data. If you find that the problem is a natural fit, just use old-fashioned relational tables, and use XML (if at all) only to interchange data (if you need to), or for user interface development (if you're using ASP.NET). If and only if you find that your data is highly hierarchical, then XML might be a good match for it.

Having said this, XML appears to be a good match for storing marked up documents in a database (such as Microsoft Word or Excel documents, or for that matter SVG data) in a way that would also support querying. Here, using a relational database might involve creating numerous tables as well as having to implement recursive relationships. Assembling a document from its components (divided across these numerous tables) can be fairly resource-intensive, and in situations where one almost always wishes to operate on the document as a unit, and queries across documents are relatively less important, XML can be advantageous compared to relational designs.

The ability of XML to store arbitrarily hierarchical data is both a strength (when the data are strictly hierarchial) as well as an Achilles heel/weakness (when they are not, as when you have many-to-many relatinships in the data). It is possible that XML may lend itself to applications oriented to hierarchical data, such as computer-aided design applications, without the need to purchase expensive and highly proprietary database engines with very limited or no third-party support. When you use an RDBMS that supports XML, you have hedged your bets - it is unlikely that ALL of your data will fit the XML paradigm, and there may be large parts of it - or even most of it - where traditional relational modeling is a better fit. The fact that pure XML engines seem to have been delegated to a niche market appears to verify the need for modeling flexibility.

Exporting XML Data from Traditional Relational Databases

If you have a relational model, you can readily export the contents of individual tables or views as XML, provided that the column names in your tables are the same as the names of the XMl tags - (this is how they are exported). You can generally do this without any programming at all, or just a single SQL command (typically, you add the phrase " for XML" to the end of your query). Access 2007 also lets you do this when you export data from a table or query: you just choose "XML" as the output format.

The XML that is exported, however, is "rectangular" (like a row-and-column format) rather than highly-nested hierarchical. There is a pitfall here: Beware of inventing XML-based data interchange formats that are highly hierarchical/nested. For example, if you have information about patients, visits for patients, and the interventions performed within each visit, a highly hierarchical format would have  the intervention data nested within the visit data for each patient. To do this, you must do a join of multiple relational tables, followed by fairly involved (though not especially challenging) programming to perform the nesting.

This is wonderful for programmers - it provides them with continued employment - but creates unnecessary work. You might as well have, without any programming, transmitted the data as three separate XML documents, each corresponding to one of the tables. In fact, in a rapidly evolving domain, it is highly likely that your data interchange model may undergo several cycles of change, simply because your data model is undergoing changes. Asking programmers to repeatedly modify their export code will earn you their enmity (as well as those of the people they work for) - it's akin to the old story of Sisyphus, the titan who was condemned to roll a stone to the top of the hill only to have it roll down again.

The fact that an XML schema for a complex data format like Microsoft Word or Excel is liable to change each time a new version of Office comes out and new features are added means that, if a programmer writes code that attempts to manipulate  the XML directly, that code is very likely to break with a new release of Office. This is one of the excuses that Microsoft gives for documenting their XML format minimally, and recommending that you manipulate files using the Office programming interface instead. (This interface hides from you the details of whether the file you're working with is stored in XML format or the original Microsoft-binary format.)

The moral of the story is that XML, like any powerful tool, has pitfalls, and you should be careful about when and how to use it. (Personally, I've never yet found the need to save any Office file in XML format, just because Microsoft says you can do so.)