Page content

Index of parts:
What is a DTD?
What is a schema?
Why use a DTD or Schema?
How to they go together?
Where is there further information?

Conventions:

Some text is highlighted up with colour, this is what they mean:

foo
This is the name of an element or a data type that has been chosen by me. It's an example.
wiz
This indicates that "wiz" is an XML-Schema defined name for a "thing" (an element, in XML terms)
bang
This indicates that "bang" is an attribute to some XML-schema element
boff
This is a piece of example XML-schema code.

What is a DTD?

The purpose of a Document Type Definition (DTD) is to define the legal building blocks of any SGML-based (SGML = Standard Generalized Markup Language) document. It defines the document structure with a list of legal elements.

DTD's have been used since the 1970's

What is a schema?

Schemata (plural of schema) are a "A diagrammatic representation; an outline or model."

Something that formally describes the abstract structure of a set of data can therefor be called schema.

An XML-schema is a document that describes the valid format of an XML data-set. This definition include what elements are (and are not) allowed at any point; what the attibutes for any element may be; the number of occurances of elements; etc..

Note: XML-Schema are not known for their brevity. An XML-Schema document for a reasonably-sized XML instance-document will be fairly large. Disk space is cheap and bandwidth is not a huge bottleneck, so there is no need to worry about it.
It does mean that you will to alot of typing though.

Why use a DTD or Schema?

The majority of XML documents are "well formed" rather than "valid". The former means that there is exactly one root element, and every sub-element (and recursive sub-elements) have delimiting start- and end-tags, and that they are properly nested within each other. On the other hand, a valid document is "well-formed" and conforms to a specified set of production rules.

To validate an XML document, some form of validating rules need to be provided. This can be done by any Document Type Declaration.

Why schemas instead od DTDs?

An XML-Schema sounds very much like a DTD, however there is are some critical differences, the most notable being that XML-Schema can deal with name-spaces, and DTD's can't (see the sidebar at http://www-106.ibm.com/developerworks/xml/library/xml-schema/#sidebar1 for some of the limitations of a DTD)

namespaces

As the main reason for using a schema instead of a DTD is the ability to mix namespaces, it must be mentioned that XML-schema are very dependent on namespaces - so we need to go over them first.

Question: What is a namespace?
Answer: From the W3C web site
We envision applications of Extensible Markup Language (XML) where a single XML document may contain elements and attributes (here referred to as a "markup vocabulary") that are defined for and used by multiple software modules. One motivation for this is modularity; if such a markup vocabulary exists which is well-understood and for which there is useful software available, it is better to re-use this markup rather than re-invent it.
Such documents, containing multiple markup vocabularies, pose problems of recognition and collision. Software modules need to be able to recognize the tags and attributes which they are designed to process, even in the face of "collisions" occurring when markup intended for some other software package uses the same element type or attribute name.
These considerations require that document constructs should have universal names, whose scope extends beyond their containing document. This specification describes a mechanism, XML namespaces, which accomplishes this.
[Definition:] An XML namespace is a collection of names, identified by a URI reference [RFC2396], which are used in XML documents as element types and attribute names. XML namespaces differ from the "namespaces" conventionally used in computing disciplines in that the XML version has internal structure and is not, mathematically speaking, a set. These issues are discussed in "A. The Internal Structure of XML Namespaces".

What this means, basically, is that the validating rules for some elements are defined in one place, and some others in another.

For example, HTML (and xhtml) are defined in one single place [by the W3C people]. This can be defined with a DTD.
 
The RDF (Resource Description Framework), on the other hand, is specifically designed to be a framework for various parties to share data using a common set of XML elements. In the Bibliographic world, there is another framework (called the Dublin Core) which is often used in conjunction with RDF.. This is far more complex, with multiple markup vocabularies, so requires namespaces - which requires schema.

Tying schema to documents

As this is such a fundamental part of schema, I will first cover defining what schema an XML document should use, and the various options that can be specified.

First-off, here is an XML document that makes no reference to a schema. It has well-formdness, but not valid.

A basic XML document. File: basic.xml

<?xml version = "1.0" encoding = "UTF-8"?>

<vehicles>
	<nickname>Bog Hopper</nickname>
	<nickname>Wee Beastie</nickname>
	<nickname>Count Zero</nickname>
</vehicles>

To provide validation, we need two things:

  1. A schema.
  2. A reference in the document to the schema-definition file.
A simple XML document, with a schema. File:simple.xml
<?xml version = "1.0" encoding = "UTF-8"?>

<vehicles
    xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance"
    xsi:noNamespaceSchemaLocation = "http://lucas.ucs.ed.ac.uk/xml-schema/xmlns/simple.xsd"
   >
	<nickname>Bog Hopper</nickname>
	<nickname>Wee Beastie</nickname>
	<nickname>Count Zero</nickname>
</vehicles>

The schema. File: simple.xsd
<?xml version = "1.0" encoding = "UTF-8"?>

<xsd:schema
   xmlns:xsd = "http://www.w3.org/2001/XMLSchema"
   >
    <xsd:element name = "vehicles">
     <xsd:complexType>
      <xsd:sequence>
       <xsd:element name = "nickname" 
                    type = "xsd:string"
                    maxOccurs = "unbounded"/>
      </xsd:sequence>
     </xsd:complexType>
    </xsd:element>
</xsd:schema>

There are some important things to explain at this point:

Namespace declaration in the XML file (an Instance Document):
The line xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance" indicates that we want to use elements defined in the http://www.w3.org/2001/XMLSchema-instance definition. The actual file to load is hard-wired, so the schema is always picked up
The line xsi:noNamespaceSchemaLocation = "http://lucas.ucs.ed.ac.uk/xml-schema/xmlns/simple.xsd" indicates that we are using the schema defined at the location http://lucas.ucs.ed.ac.uk/xml-schema/xmlns/simple.xsd, but we do not want to assocciate any namespace tag to the definitions.Without it, the document has no validating schema.
Schema file definitions:
The line xmlns:xsd = "http://www.w3.org/2001/XMLSchema" indicates that all XML-Schema elements are to be prefixed with an xsd: tag, hence the opening schema element is <xsd:schema.... Again, this is a namespace that is hard-wired, and will always be picked up.

How to they go together?

Ah, the meat of the document!

An XML-schema document is, itself, an XML document.. which deals with the well-form'd-ness of the elment structure.

To review how the schema defines what is valid (and what is not), lets work backwards from an XML instance document:

a sample instance document: file landrover.xml
<?xml version = "1.0" encoding = "UTF-8"?>
<vehicles
  xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance"
  xsi:noNamespaceSchemaLocation = "http://lucas.ucs.ed.ac.uk/xml-schema/xmlns/landrover.xsd">

 <vehicle>
  <nickname>Count Zero</nickname>
  <model>Series I, 80"</model>
  <construction>
   <start>
    <dom>21</dom>
    <month>July</month>
    <year>1949</year>
   </start>
   <end>
    <dom>9</dom>
    <month>August</month>
    <year>1949</year>
   </end>
  </construction>
  <mods>
   <mod>Change Engine</mod>
   <mod>Change pedals</mod>
   <mod>Change gearbox</mod>
   <mod>Fit Rollcage</mod>
  </mods>
 </vehicle>
</vehicles>

This is a relatively simple document, and a map of how it goes together will be something like this:

A pictorial graph of the
		  landrover.xml data

In this map, a (+) in front of an element indicates that one-or-more instances of the element may occur. The square bracketing to the sub-elements indicate that all the ones between the top and bottom element should also be present.

Having got a plan of what the schema should be, here it is:

The Landrover schema: file landrover.xsd
<?xml version = "1.0" encoding = "UTF-8"?>
<xsd:schema xmlns:xsd = "http://www.w3.org/2001/XMLSchema">

  <xsd:element name = "vehicles">
   <xsd:complexType>
    <xsd:sequence>
     <xsd:element ref = "vehicle" maxOccurs = "unbounded"/>
    </xsd:sequence>
   </xsd:complexType>
  </xsd:element>

  <xsd:element name = "vehicle">
   <xsd:complexType>
    <xsd:sequence>
     <xsd:element name = "nickname" type = "xsd:string" maxOccurs = "unbounded"/>
     <xsd:element name = "model" type = "xsd:string"/>
     <xsd:element name = "construction">
      <xsd:complexType>
       <xsd:sequence>
        <xsd:element ref = "start"/>
        <xsd:element ref = "end"/>
       </xsd:sequence>
      </xsd:complexType>
     </xsd:element>
     <xsd:element name = "mods">
      <xsd:complexType>
       <xsd:sequence>
        <xsd:element name = "mod" type = "xsd:string" maxOccurs = "unbounded"/>
       </xsd:sequence>
      </xsd:complexType>
     </xsd:element>
    </xsd:sequence>
   </xsd:complexType>
  </xsd:element>

  <xsd:element name = "start">
   <xsd:complexType>
    <xsd:sequence>
     <xsd:element ref = "dom"/>
     <xsd:element ref = "month"/>
     <xsd:element ref = "year"/>
    </xsd:sequence>
   </xsd:complexType>
  </xsd:element>

  <xsd:element name = "end">
   <xsd:complexType>
    <xsd:sequence>
     <xsd:element ref = "dom"/>
     <xsd:element ref = "month"/>
     <xsd:element ref = "year"/>
    </xsd:sequence>
   </xsd:complexType>
  </xsd:element>

  <xsd:element name = "dom" type = "xsd:string"/>
  <xsd:element name = "month" type = "xsd:string"/>
  <xsd:element name = "year" type = "xsd:string"/>

</xsd:schema>

So, what are the important points raised in this example?

  1. Elements must have a name and a type.
  2. Elements can contain simple, predefined data-types:
    showing examples of elements with
			predefined data types
  3. Elements can be defined to occur more than once:
    an example of an element that
			occurs more than once
  4. Elements can reference some other element definition rather than contain it's own name and type
    an example of an element that
			refers to another element definition
    note The element refered to must be "visible" to the referring element, ie it cannot be "down" another branch of the XML tree.
  5. Elements can have complex types (defined directly within the element definition)
    an example of an element that
			directly defines it's own complex type

In addition to having reference elements and locally defined complexTypes, a complexType can be defined as an entity in it's own right. (this become more important later on, when we look at making a new type based on some other pre-exiting type).

Here is the same schema, but using a global complexType:

An alternative Landrover schema: file landrover2.xsd
<?xml version = "1.0" encoding = "UTF-8"?>

<xsd:schema xmlns:xsd = "http://www.w3.org/2001/XMLSchema">

 <xsd:element name = "vehicles">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element ref = "vehicle" maxOccurs = "unbounded"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name = "vehicle">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element name = "nickname" type = "xsd:string" maxOccurs = "unbounded"/>
    <xsd:element name = "model" type = "xsd:string"/>
    <xsd:element name = "construction">
     <xsd:complexType>
      <xsd:sequence>
       <xsd:element ref = "start"/>
       <xsd:element ref = "end"/>
      </xsd:sequence>
     </xsd:complexType>
    </xsd:element>
    <xsd:element name = "mods">
     <xsd:complexType>
      <xsd:sequence>
       <xsd:element name = "mod" type = "xsd:string" maxOccurs = "unbounded"/>
      </xsd:sequence>
     </xsd:complexType>
    </xsd:element>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name = "start" type = "myBuildDate"/>
 <xsd:element name = "end" type = "myBuildDate"/>

 <xsd:element name = "dom" type = "xsd:string"/>
 <xsd:element name = "month" type = "xsd:string"/>
 <xsd:element name = "year" type = "xsd:string"/>

 <xsd:complexType name = "myBuildDate">
  <xsd:sequence>
   <xsd:element ref = "dom"/>
   <xsd:element ref = "month"/>
   <xsd:element ref = "year"/>
  </xsd:sequence>
 </xsd:complexType>

</xsd:schema>

We've shown that you can define your own types, as shown by the line <xsd:element name = "start" type = "myBuildDate"/>. There are, in fact, two user-definable types: ComplexType and SimpleType.

SimpleType
Simple types are elements that contain data.
They may not contain attributes or sub-elements
New simple types are defined by deriving them from existing simple types (built-in's and derived).
Simpletype definitions are used when a new data type needs to be defined, where this new type is a modification of some other existing simpleType-type.
See the fuller explanation below for further details.
ComplexType
Complex types are elements that allow sub-elements and/or attributes.
Complex types are defined by listing the elements and/or atributes nested within them.
See the fuller explanation below for further details.

simpleType

simpleType is used to create a new datatype, one which is based on an existing simple-type. For example, we could be more definitive in what we mean by dom (DayOftheMonth):

An integer-only DayOftheMonth element
     <xsd:element name="dom" type="xsd:int" />
DayOftheMonth, as an Integer derivitive
     <xsd:element name="dom" type="mySimpleDayOfMonth" />

     <xsd:simpleType name="mySimpleDayOfMonth" >
      <xsd:restriction base="xsd:positiveInteger" >
       <!-- positiveInteger defines the minimum to be 1 -->
       <xsd:maxInclusive value="31" /> 
      </xsd:restriction >
     </xsd:simpleType >

complexType

complexType is used to define a complex type. The element requires an attribute called name, which is uded to refer to the complexType definition. The element then contains the list of sub-elements

There are three examples of complexType definition in the main example, so I won't repeat them.

This is, however, the time to mention what the content of a complexType is:
  1. There may be an annotation
  2. This must be followed by one of the following:
    1. simpleContent
    2. complexContent
    3. In sequence, the following:
      1. zero or one from the following grouping terms:
        1. group
        2. all
        3. choice
        4. sequence
      2. followed by any number of either
        1. attibute
        2. attributeGroup
      3. then zero or one anyAttribute

(see section 3.4.2 of the Structures document for the meaning of each entity, what follows is a very simple summary)

The simple explanations:

Here is an example of a complexType, using s simpleContent We will modify the model element to include the attribute aka:

A (farily simple) complex element
    <xsd:element name = "model">
     <xsd:complexType>
      <xsd:simpleContent>
       <xsd:extension base = "xsd:string">
        <xsd:attribute name = "aka" use = "required" type = "xsd:string"/>
       </xsd:extension>
      </xsd:simpleContent>
     </xsd:complexType>
    </xsd:element>

the XML element for the modified model definition

    <model aka="80" >Series I, 80"</model>

For the keen, here is a version of the schema of the zblsa service:

A version of the zblsa schema. File zblsa.xsd (and there is a matching XML data file)

<?xml version = "1.0" encoding = "UTF-8"?>

<schema xmlns = "http://www.w3.org/2001/XMLSchema"
  targetNamespace = "http://lucas.ucs.ed.ac.uk/test/"
  xmlns:zblsa = "http://lucas.ucs.ed.ac.uk/test/"
  xmlns:xsd = "http://www.w3.org/2001/XMLSchema"
  version = "0.4"
  elementFormDefault = "qualified">
 
 <element name = "ZBLSA">
  <annotation>
   <appinfo>
    <xsd:documentation>The root element</xsd:documentation>
   </appinfo>
  </annotation>
  <complexType>
   <sequence>
    <element ref = "zblsa:source" maxOccurs = "unbounded"/>
   </sequence>
  </complexType>
 </element>

 <element name = "search">
  <annotation>
   <appinfo>
    <xsd:documentation>The data about a search on the data
        providersdata-set</xsd:documentation>
    <xsd:documentation>The following Dublin Core elements are
        used: dc:Description; dc:Type; dc:Format;
        dc:Rights</xsd:documentation>
    <xsd:documentation>The URI attribute contains the URI
        request that was used to query the data providers
        system</xsd:documentation>
    <xsd:documentation>attempted means that the request was
        attempted; available means that some form of reply was recieved;
        result means that we got some results; and verified means that we are
        sure that there is something there (but no mention is made of how
        useful :)</xsd:documentation>
   </appinfo>
  </annotation>
  <complexType>
   <sequence>
    <any namespace = "http://purl.org/dc/elements/1.1/" 
          processContents = "skip" minOccurs = "0"
          maxOccurs = "unbounded"/>
    <element name = "genre" type = "string" minOccurs = "0">
     <annotation>
      <appinfo>
       <xsd:documetnation>This is the genre for the
           search</xsd:documetnation>
      </appinfo>
     </annotation>
    </element>

    <element name = "field" minOccurs = "0"
        maxOccurs = "unbounded">
     <annotation>
      <appinfo>
       <xsd:documentation>This is the data that was used to
           search the data providers data-set.</xsd:documentation>
       <xsd:documentation>This is usually the same as the field
           requested.</xsd:documentation>
       <xsd:documentation>The attibute "name" states
           the name of the field that was searched in the data providers
           data-set</xsd:documentation>
      </appinfo>
     </annotation>
     <complexType>
      <simpleContent>
       <extension base = "string">
        <attribute name = "name" use = "optional" type = "string"/>
       </extension>
      </simpleContent>
     </complexType>
    </element>

    <element name = "datalist" minOccurs = "0">
     <annotation>
      <appinfo>
       <xsd:documentation>If present, this indicates that there
           is some physical data.</xsd:documentation>
      </appinfo>
     </annotation>
     <complexType>
      <sequence>
       <element name = "data" maxOccurs = "unbounded">
        <annotation>
         <appinfo>
          <xsd:documentation>This is a single unit of data,
              which may appear in a number of versions</xsd:documentation>
         </appinfo>
        </annotation>
        <complexType>
         <sequence>
          <element name = "version" maxOccurs = "unbounded">
           <annotation>
            <appinfo>
             <xsd:documentation>This is a version of the
                 data</xsd:documentation>
             <xsd:documentation>The type attribute indicates the
                 mime-type of the data</xsd:documentation>
            </appinfo>
           </annotation>
           <complexType mixed = "true">
            <sequence>
             <any namespace = "http://www.w3.org/2001/XMLSchema"
                 processContents = "skip"
                 minOccurs = "0" maxOccurs = "unbounded"/>
            </sequence>
            <attribute name = "type" use = "required" type = "string"/>
           </complexType>
          </element>
         </sequence>
        </complexType>
       </element>
      </sequence>
     </complexType>
    </element>
   </sequence>
   <attribute name = "URI"       use = "required" type = "anyURI"/>
   <attribute name = "attempted" use = "required" type = "string"/>
   <attribute name = "available" use = "required" type = "string"/>
   <attribute name = "result"    use = "required" type = "string"/>
   <attribute name = "verified"  use = "required" type = "string"/>
  </complexType>
 </element>

 <element name = "source">
  <annotation>
   <appinfo>
    <xsd:documentation>information about the data
        provider</xsd:documentation>
    <xsd:documentation>The URI refers to a page of information
        about the data provider</xsd:documentation>
    <xsd:documentation>The following Dublin Core tags are used:
        dc:Title; dc:Description; dc:Rights</xsd:documentation>
   </appinfo>
  </annotation>
  <complexType>
   <sequence>
    <any namespace = "http://purl.org/dc/elements/1.1/"
        processContents = "skip" maxOccurs = "unbounded"/>
    <element ref = "zblsa:infoURL"/>
    <element ref = "zblsa:logoURL"/>
    <element ref = "zblsa:search"/>
   </sequence>
   <attribute name = "URI" use = "optional" type = "anyURI"/>
  </complexType>
 </element>

 <element name = "infoURL" type = "anyURI">
  <annotation>
   <appinfo>
    <xsd:documentation>The URI that refers to information about
        the data provider</xsd:documentation>
   </appinfo>
  </annotation>
 </element>

 <element name = "logoURL" type = "anyURI">
  <annotation>
   <appinfo>
    <xsd:documentation>A URI that refers to a logo for the data
        provider</xsd:documentation>
   </appinfo>
  </annotation>
 </element>

</schema>

(keep going....................)

Where is there further information?

Definitive referenced

The main list of references is at http://www.w3.org/XML/Schema. I did my work against the 20010205 version, defined at:

Material written and maintailed by Eric van der Vlist

Validators

I also verified by work with the following validators:

[an error occurred while processing this directive]

A page from lucas.ucs.ed.ac.uk

Page title: XML Schema, a brief introduction