[Notes] Markup languages: DTD and XML Schema

Miguel Menéndez

DTD and XML Schema.

DTD

A DTD is an SGML document that includes the syntax rules for a specific document type. It includes the elements that are allowed and their attributes, as well as rules that affect the nesting of the former and the values ​​of the latter. Contrasting a document with its DTD it is possible to check if it is valid or not.

DTD not is XML.

Bind a DTD to an XML document

<!DOCTYPE rootElementName [declarations]>
<!DOCTYPE rootElementName SYSTEM "fileName.dtd">
<!DOCTYPE rootElementName PUBLIC "FPI" "URI">

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
          "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

Structure of a DTD

The contents that can be included in an XML document are defined in the DTD using four possible terms: ELEMENT (to declare elements), ATTLIST (to declare attributes), ENTITY (to declare reusable contents), and NOTATION.

ELEMENT

<!ELEMENT elementName specsContent …>
  1. Element type element content: Elements that are only allowed to contain other elements:
<!ELEMENT A (B, C)>

Element A contains element B followed by element C (in that order). The comma indicates that they must necessarily appear in that order.

  1. Element type text content: Elements that can only contain textual data (character strings). It also allows the element to be empty:
<!ELEMENT A (#PCDATA)>

Element A contains textual data.

  1. Element type mixed: Elements that can contain anything (textual data, elements, both or nothing):
<!ELEMENT TO ANY>
<!-- or -->
<!ELEMENT A (#PCDATA | child1 | child2)*>

Element A contains anything (textual data, elements, nothing). The vertical bar (|) means that the element contains only one of the elements defined on either side of the bar.

  1. Element type empty element: Elements that lack any type of content. It implies that this element will have the form <tag /> inside the document:
<!ELEMENT hr EMPTY>

The hr element contains nothing.

Operators (, |) can be nested using parentheses:

<!ELEMENT personal data (name, address, (phone | mail))>

Elements that contain textual data (#PCDATA), nothing (EMPTY), or anything (ANY) are terminal elements.

Cardinality or frequency operators

  • Operator ? indicates that the element is optional, that is, it may or may not appear, but if it does appear, it can only do so once (0 or 1 times).
  • Operator + indicates that the element must appear at least once but can do so n times. (1 or n times).
  • Operator * indicates that the element in question is also optional but if it appears it can do so n times. (0 or n times).
<!ELEMENT personaldata (name+, address?, (phone | mail)*)>

If no cardinal is used, the default cardinal is 1.

ATTLIST

Elements can have zero, one, or multiple attributes. The attribute specification can only appear within the opening tag.

<!ATTLIST elementName attributeName type flags
                         indicator type attributename
                         … >
  1. CDATA type attribute: The value of the attribute will be of the character string type (not containing >, <, &, ‘, “):
<!ATTLIST address city CDATA #IMPLIED>
It would be valid <address city="Uviéu">...</address>.
  1. enumeration type attribute: Allows establishing a limited number of possible values ​​for a given attribute. Values ​​are separated by a vertical bar (|), are not enclosed in quotes and are case sensitive. Not to be used in combination with #REQUIRED, #IMPLIED or #FIXED:
<!ATTLIST semaphore color (red | yellow | green) "red">

The color attribute can include one and only one of these allowed values. The default value of this attribute has been set to “red”.

  1. ID type attribute: The value of the attribute must be unique throughout the document. No element can have more than one ID attribute. An ID attribute can only be #IMPLIED or #REQUIRED but never #FIXED. The value of an ID must be a valid XML name (consisting of letters, numbers, hyphens, and periods, and must begin with a letter or underscore):
<!ATTLIST employee identifier ID #REQUIRED>

It would be valid: <employee identifier="AA_125">Spider Woman</employee>.

  1. IDREF type attribute: The value of the attribute is the value of the ID of a related element found in the same document XML:
<!ATTLIST dateIncome employee IDREF #REQUIRED>

It would be valid: <dateIncome employee="AA_125">12/5/2000</dateIngress>.

  1. IDREFS type attribute: The value of the attribute is a list of the values ​​of the IDs of the elements that we want to relate separated by a blank space. Only IDs present in the same document can be related:
<!ELEMENT team EMPTY>
<!ATTLIST team members IDREFS #REQUIRED>

It would be valid: <team members="AA_125 AB_320 AA_210 AA_012" />.

  1. NMTOKEN type attribute: The value of the attribute must be a string composed of the characters allowed for valid XML names (only letters, numbers, hyphens and periods, without blank spaces) but no obligation to start with a letter or underscore:
<!ELEMENT river (name)>
<!ATTLIST river country NMTOKEN #REQUIRED>

No would be valid: <river country="United States"><name>Mississippi</name></river>.

It would be valid: <river country="USA"><name>Mississippi</name></river>.

  1. NMTOKENS type attribute: The value of the attribute may contain a number n of values ​​of the type NMTOKEN separated from each other by a blank space:
<!ELEMENT category (#PCDATA)>
<!ATTLIST category type NMTOKENS #REQUIRED>

It would be valid: <category type="chief purchasing manager 1:fixed contract" />.

  1. ENTITY and ENTITIES type attributes: The value of the attribute is an entity or a list of n entities:
<!NOTATION gif SYSTEM "image/gif">
<!ENTITY graphicSales_1 SYSTEM "graphic_1.gif" NDATA gif>
<!ATTLIST graphicSalesResult ENTITY #IMPLIED>

It would be valid: <graphicSalesResult="graphicSales_1"></salesResult>.

  1. NOTATION type attribute: The value of the attribute is some notation defined in the DTD:
<!NOTATION gif SYSTEM "image/gif">
<!NOTATION jpg SYSTEM "image/jpeg">
<!ATTLIST photo type NOTATION ( gif | jpg ) #IMPLIED>

It would be valid: <photo type="jpg"></photo>.

Occurrence of attributes (indicators)

#REQUIRED: The attribute is obligatory although a specific value is not specified for it. Must be present in the XML document, although it may be empty:

<!ATTLIST img alt CDATA #REQUIRED>

It would be valid: <img alt=""></img>.

#IMPLIED: Defines the attribute as optional. If the element has the attribute in the XML document, the value will be the one specified. Otherwise, its value will be undefined:

<!ATTLIST img width CDATA #IMPLIED>

It would be valid: <img width="256"></img> <img width=""></img> <img></img>.

#FIXED: This flag causes the attribute value to always be taken as fixed value, whether the attribute appears in the element or not. If the attribute does not appear in the element, the renderer assumes the one we have assigned to it in this declaration (immediately after and enclosed in quotes). If the attribute appears with a value other than this, a validation error will occur:

<!ATTLIST html xmlns CDATA #FIXED 'http://www.w3.org/1999/xhtml'>

Default value (predefined): In this case, if we do not define a value for the attribute, the processor will take the one we have assigned as its default value:

<!ATTLIST course color CDATA "blue">

It would be valid: <course color="blue"></course> <course color="red"></course> <course></course> … In the latter case, the parser will assign the blue value for that attribute.

REMEMBER: It is mandatory to use one of the indicators #REQUIRED, #IMPLIED, #FIXED or default value.

ENTITY

Entities allow you to define constants in XML documents.

  • All entities are declared in a DTD and are referenced from XML documents or, in the case of parametrics, from DTDs.
  • Internal entities get their replacement text from inside the DTD, the replacement text is specified in the entity declaration.
  • External entities get their replacement text from an external file.

There are different types of entities and their syntax varies depending on the type. Thus, we have the predefined entities that are referenced in XML documents, the general processed entities (internal or external) that are referenced in XML documents and the parameter entities (internal or external) that are referenced in the DTDs.

Predefined entities

&lt; &gt; &amp; &quot; &apos; (except within CDATA sections).

General entities processed

They are those whose content is XML and must be processed by the parser. They are created in the DTD and are replaced by their value (referenced) in the XML document. The value of a processed entity is known as replacement content and can be text or XML code. Once the reference to the entity has been replaced by its content, this content becomes part of the document and is parsed by the processor like the rest of the document.

An inner processed general entity is declared in the DTD:

<!ENTITY entityName "replacement content">

And it is referenced in the XML document:

<element>&EntityName;</element>

Example:

<!-- In the DTD: -->
<!ENTITY location "Strasbourg Street">

<!-- In the XML: -->
<address>
  <number>172</number>
  <street>&location;</street>
  <population>Buenos Aires</population>
</address>

The way to declare in the DTD a general processed external private entity (located in our system) is:

<!ENTITY entityName SYSTEM "URI">

Example:

<!-- In the file authors.txt: -->
Juan Manuel and Jose Ramon

<!-- In the XML: -->
<?xml version="1.0" ?>
<!DOCTYPE writers [
  <!ELEMENT writers (#PCDATA)>
  <!ENTITY authors SYSTEM "authors.txt">
]>
<writers>&authors;</writers>

The way to declare in the DTD a general processed external public entity (public access) is:

<!ENTITY entityName PUBLIC "formal public identifier" "URI">

Example:

<!-- In the writer.txt file: -->
Miguel de Cervantes Saavedra

<!-- In the XML: -->
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<!DOCTYPE texts [
  <!ELEMENT texts (text)+>
  <!ELEMENT text (#PCDATA)>
  <!ENTITY writer PUBLIC "-//W3C//TEXT writer//EN"
                           "http://www.cc.com/dtd/writer.txt">
]>
<texts>
  <text>Don Quixote was written by &escritor;.</text>
</texts>
Parameter Entities

are created in the DTD and referenced in the DTD itself. For them to work, the DTD must be defined in an external file. The syntax is the same as for general entities but when declaring the entity, the character % followed by a blank space must be placed in front of the name of the entity. And when referencing it, the character % without white space is prefixed to the name of the entity.

The way to declare in the DTD an internal parameter entity is:

<!ENTITY % entityName "replacement content">

Example:

<!-- It is declared in the DTD: -->
<!ENTITY % otherValues ​​"age CDATA #IMPLIED
                         weight CDATA #IMPLIED
                         height CDATA #REQUIRED">

<!-- It is referenced, also in the DTD: -->
<!ATTLIST person name CDATA #REQUIRED %otherValues;>

The way to declare in the DTD a private external parameter entity (located in our system) is:

<!ENTITY % entityName SYSTEM "URI">

Example:

<!-- In the file startdateend.ent: -->
<!ENTITY % daysMonth "1|2|3|4|5|10|19|20|21|22|28|29|30|31">
<!ENTITY % monthsYear "January|February|March|June|July|August|September"> <!ENTITY % nextYears "2006 | 2007 | 2008 | 2009 | 2010">

<!-- It is declared in the DTD: -->
<!ENTITY % EndStartDate SYSTEM "endStartDate.ent">

<!-- It is referenced, also in the DTD: -->
%EndStartDate;
<!ELEMENT course (description, startDate, endDate)>
<!ELEMENT description (#PCDATA)>
<!ELEMENT startDate EMPTY>
<!ATTLIST startDate
          day (%daysMonth;) #REQUIRED
          month (%monthsYear;) #REQUIRED
          year (%nextYears;) #REQUIRED>
<!ELEMENT EndDate EMPTY>
<!ATTLIST EndDate
          day (%daysMonth;) #REQUIRED
          month (%monthsYear;) #REQUIRED
          year (%nextYears;) #REQUIRED>

The way to declare in the DTD a public external parameter entity (publicly accessible) is:

<!ENTITY % entityName PUBLIC "formal public identifier" "URI">

Example:

<!-- In the file xhtml-lat1.ent: -->
<!ENTITY Aacute "&#193;">
<!ENTITY nbsp "&#160;">
...

<!-- It is declared in the DTD: -->
<!ENTITY % HTMLlat1 PUBLIC "-//W3C//ENTITIES Latin 1 for XHTML//EN"
                           "http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent">

<!-- It is referenced, also in the DTD: -->
%HTMLlat1;
Conditional Sections

Sometimes it can be useful to hide some parts of the DTD declaration or to include or exclude rules in the DTD based on conditions. Can only be located in external DTDs. Their use makes sense when combined with parameter entity references. The conditional sections are INCLUDE and IGNORE, and will allow, respectively, to include or ignore declaration sections within a DTD:

<![INCLUDE[visible declarations]]>
<!-- Example: -->
<![INCLUDE[<!ELEMENT description (#PCDATA)]]>

<![IGNORE[statements to hide]]>
<!-- Example: -->
<![IGNORE[<!ELEMENT password (#PCDATA)]]>

Example:

<!ENTITY % long "INCLUDE">
<!ENTITY % short "IGNORE">

<[%length;[<!ELEMENT book (comment+, title, body, attachments?)>]]>
<[%short;[<!ELEMENT book (title, body, attachments?)>]]>

In this case the XML document supports one or more comment elements as child elements of book. So that the book element does not admit the comment element as a child, it is enough to modify the association of values ​​from %short to INCLUDE and from %long to IGNORE:

<!ENTITY % short "INCLUDE">
<!ENTITY % long "IGNORE">

XML Schema (XSD)

XSD yes is XML.

In XML Schema the schema declarations are always found in an external document.

In the XML document:

<?xml version="1.0" encoding="UTF-8" ?>
<rootElement xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
              xsi:noNamespaceSchemaLocation="nameSchema.xsd">

In the XML schema:

<?xml version="1.0" encoding="UTF-8" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
           elementFormDefault="qualified">
  [subelements]
</xsd:schema>

Structure of an SXD

XML Schema has two types of component definitions: Simple and Complex.

  • Simple components only contain textual data (text strings, numbers, dates, etc.).
  • Complex components contain subelements or have attributes.

Attributes are always simple components.

When elements and attributes are declared in the schema, the data type must be specified for each of them, whether it is simple or complex.

ELEMENT

<element name="elementName" type="elementType" />

ATTRIBUTE

<attribute name="attributeName" type="attributeType" [use=""] />

Schema Construction Models

  • Plain design
  • Layout with reusable named types (more below)
  • Nested design (of Russian dolls)

Whichever construction model is followed, the order in which elements are declared in a schema is not significant and does not affect its operation.

Plain design

  • Simple elements (terminals) and attributes are declared first.
  • The declaration of the complex elements appears below, it details the subelements of which the complex element is made up, relating them to those previously declared (with the ref attribute).
  • The last complex element to be defined is the root element.

Cardinality of elements

Cardinality defines the number of times an element can appear.

  • minOccurs attribute: Indicates the minimum number of times an element can appear. The value 0 for the minOccurs attribute means that this element is optional, it may or may not appear within the document. The value 1 means that it must appear at least once. The value of this attribute can be any positive integer, 112 for example.
  • maxOccurs attribute: Indicates the maximum number of times an element can appear. Its value can also be a positive integer or the term unbounded, with no limit, to indicate that there is no maximum number of occurrences.

The default value of both attributes is 1. If both attributes are omitted, the element must appear exactly once.

Occurrence of attributes

Attributes defined for XML elements can appear once or not at all, but no other number of times.

  • required: Means required, mandatory.
  • optional: Means optional.
  • prohibited: Means that the attribute must not appear.

The default value of the use attribute is optional.

<xs:attribute name="code" type="xs:string" use="required" />

Default values ​​and fixed values ​​in simple elements and attributes

<xs:element name="price" type="xs:decimal" default="5" />
<xs:attribute name="currency" type="xs:NMTOKEN" default="euro" />

Default values ​​of attributes are applied when they are not present, and default values ​​of elements are applied when these are empty.

<xs:element name="price" type="xs:decimal" fixed="5" />
<xs:attribute name="currency" type="xs:NMTOKEN" fixed="euro" />

The price element must appear in the XML document with the value 5 and the currency attribute, if it appears, must contain the value “euro”.

The concepts of a fixed value and a default value are mutually exclusive, and thus it is an error for a declaration to contain both attributes fixed and default.

Items
attribute Values ​​ Observations
minOccurs 0 to n Default: 1
maxOccurs 0 to n Default: 1
unbounded Unlimited number of times
fixed value The content of the element must be the specified value
default value If the element has no content, its content is the default value. If it has content, the default value is not applied
Attributes
attribute Values ​​ Observations
use optional Default value. The attribute may or may not appear.
required The attribute must appear.
prohibited The attribute must not appear.
fixed value If the attribute appears, the value must be the set value.
default value If the attribute does not appear, its value is the default value. If the attribute has a value, its value is the one with.

Type of data

  • Predefined: These are the ones that are integrated into the XML Schema specification. There are 44 predefined types (string, byte, integer, decimal…). See the other chop.
  • Constructed: These are types generated by the user based on a predefined type or on a previously constructed type.

Layout with types with reusable names

Simple or complex data types are defined and identified by a name (they form templates). When declaring elements and attributes, you indicate that they are of one of the previously defined named types.

Simple data types
<simpleType name="typeName">
  [restrictions]
</simpleType>

<!-- Example: -->
<xs:simpleType name="nameType">
  <xs:restriction base="xs:string">
    <xs:maxLength value="32" />
  </xs:restriction>
</xs:simpleType>
Another simple non-atomic type: List

List types are composed of sequences of atomic types.

Yes you can create new types of lists by derivation of existing atomic types. Not is allowed to create list types from existing list types, nor from complex types.

<!-- Let's remember how the definition of this type was -->
<xs:simpleType name="myInteger">
  <xs:restriction base="xs:integer">
    <xs:minInclusive value="10000" />
    <xs:maxInclusive value="99999" />
  </xs:restriction>
</xs:simpleType>

<!-- Here we define the list -->
<xs:simpleType name="listOfMyIntegers">
  <xs:list itemType="myInteger" />
</xs:simpleType>

<!-- We define the element -->
<xs:element name="myIntegers" type="listOfMyIntegers" />
Another simple non-atomic type: Union

A union type is a data type created from other data types that are declared in the memberTypes attribute, usually atomic types or list types. A valid value for a union type would be one that belongs to some of the types that form the union.

<xs:simpleType name="europaCod">
  <xs:union memberTypes="countrylist listOfMyIntegers" />
</xs:simpleType>

<!-- We define the element -->
<xs:element name="codigosEur" type="europaCod" />
Complex data types
<complexType name="typeName">
  [composition]
</complexType>

<!-- Example: -->
<xs:complexType name="itemType">
  <xs:sequence>
    <xs:element name="title" type="nameType" />
  </xs:sequence>
</xs:complexType>
Complex data type: String

The elements appear one after the other in a certain order. It is declared with the composer <xs:sequence>. The elements of the sequence can appear a variable number of times, configurable with the minOccurs and maxOccurs attributes (both are 1 by default).

<xs:element name="article" >
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="title" />
      <xs:element ref="author" minOccurs="0" maxOccurs="unbounded" />
      <xs:element ref="numberwords" />
      <xs:element ref="text" />
    </xs:sequence>
  </xs:complexType>
</xs:element>
Complex data type: Composer choice

The <xs:choice> composer allows you to declare elements that appear as alternatives to each other. Only one is chosen. The chosen element can appear a variable number of times, configurable with the minOccurs and maxOccurs attributes (both are 1 by default).

<xs:element name="book">
  <xs:complexType>
    <xs:sequence>
      <xs:choice>
        <xs:element name="prolog" type="xs:string" />
        <xs:element name="preface" type="xs:string" />
        <xs:element name="introduction" type="xs:string" />
      </xs:choice>
      <xs:element name="title" type="xs:string"/>
      ...
    </xs:sequence>
  </xs:complexType>
</xs:element>
Complex data type: Composer all

There is a third option to restrict the elements of a group, that is the compositor <xs:all>. This compositor states that all elements of an all group can appear once or not at all, and can appear in any order.

XML Schema states that an all group must appear as a single child at the highest level of the content model. Furthermore, children of all must all be individual elements, not groups, and no element in the model must appear more than once, i.e. the allowed values ​​of minOccurs and maxOccurs are 0 and 1 (1 by default).

<xs:element name="OrderSheet">
  <xs:complexType>
    <xs:all>
      <xs:element name="shipTo" type="USaddress" />
      <xs:element name="billTo" type="USaddress" />
      <xs:element name="comment" type="xs:string" minOccurs="0" />
      <xs:element name="elements" type="Elements" />
    </xs:all>
    <xs:attribute name="orderdate" type="xs:date" />
  </xs:complexType>
</xs:element>

Groups of elements

<!-- Definition of a group of elements -->
<xs:group name="InfoMagazine">
  <xs:sequence>
     <xs:element name="name" type="xs:string"/>
     <xs:element name="exitdate" type="xs:string"/>
   </xs:sequence>
</xs:group>

Attribute Groups

<xs:attribute name="productNumber" type="ID" use="required" />
<!-- We add the weightKg attribute -->
<xs:attribute name="weightKg" type="xs:decimal" />
<!-- We add the sendBy attribute because it is a simple derived type -->
<xs:attribute name="sendBy">
   <xs:simpleType>
     <xs:restriction base="xs:string">
       <xs:enumeration value="air" />
       <xs:enumeration value="land" />
     </xs:restriction>
   </xs:simpleType>
</xs:attribute>

Comments

Found a bug? Do you think something could be improved? Feel free to let me know and I will be happy to take a look.