Processing XML 1.1 documents with XML Schema 1.0 processors

W3C

Processing XML 1.1 documents with XML Schema 1.0 processors

W3C Working Group Note 11 May 2005

This version:
http://www.w3.org/TR/2005/NOTE-xml11schema10-20050511
Latest version:
http://www.w3.org/TR/xml11schema10
Editor:
Henry S. Thompson, University of Edinburgh/W3C <ht@inf.ed.ac.uk>

This document is also available in these non-normative formats: http://www.w3.org/TR/2005/NOTE-xml11schema10-20050511/11sp.xml.


Abstract

XML Schema 1.0 did not anticipate new versions of XML, and mandated XML 1.0 documents as the starting point for schema-validity assessment. Some users and specifications would like to use XML Schema processors which process XML 1.1 documents, and some implementors of XML Schema processors would like to provide XML 1.1 support.

This Note suggests an implementation strategy for implementors to adopt to enable users and specifications to get such support in a consistent way. All aspects of XML Schema which are liable to re-interpretation as a result of changes in XML 1.1 are discussed.

An implementation of schema-validity assessment employing such a strategy is strictly speaking non-conformant to the current version of the XML Schema specification. The XML Schema WG none-the-less believes that interoperability will best be served by such non-conformant processors being made available to users, until such time as a subsequent version of XML Schema addressing this issue normatively is approved.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document is a Working Group Note prepared by the W3C XML Schema Working Group, as part of the W3C XML Activity, and published on 11 May 2005. It describes methods of supporting XML 1.1 documents with schema processors designed to support XML Schema 1.0.

XML Schema 1.0 parts 1 and 2 refer normatively to XML 1.0 and makes no explicit provision for support of later versions of the XML specification; this lack is sometimes advanced as a reason for W3C specifications which depend on XML Schema not to support XML 1.1. But there are strong reasons to encourage the wide adoption of XML 1.1, which is more successfully internationalized than XML 1.0. At the time this Note is published, the question of how best to support XML 1.1 in XML Schema is still open.

This Note offers strategies for supporting XML 1.1, based on the implementation experience of some members of the XML Schema Working Group. It is hoped that the techniques described here will be helpful to other implementors and to users. Equally, the Working Group hopes that this Note will elicit discussion in the larger XML community concerning the best way for the XML Schema Working Group to balance the competing demands of flexibility in references to other specifications, stability, and interoperability. This Note is published with the full consensus of the XML Schema Working Group.

Comments on this document and the issues it raises are welcome; please send comments on this document to www-xml-schema-comments@w3.org (archive).

Publication as a Working Group Note does not imply endorsement by the W3C Membership. This document may be updated, replaced or obsoleted by other documents at any time. The XML Schema Working Group does not currently expect to produce further versions or revisions of this document, but experience with the subject matter of this Note may lead to changes in the normative text of future versions of the XML Schema specification.

Table of Contents

1 Introduction
2 Survey of XML 1.1 challenges for XML Schema 1.0
3 First step towards XML 1.1: the parser
4 Recommended strategy: Move to 1.1-compatible type definitions
5 The details
6 Backward incompatibilities
7 Summary of Recommendations for Interoperability


1 Introduction

As published the XML Schema specification references XML 1.0and XML Namespaces 1.0 explicitly, and incorporates by reference certain key definitions, in particular those of the Char, Name, QName and S character classes. The contents of these classes has changed in XML 1.1and XML Namespaces 1.1, so although nothing in the existing XML Schema specification specifically bars the processing of infosets produced by XML 1.1 conformant parsers, such infosets, if they exploit any of the relevant changes in XML 1.1, will not be accepted as valid by conformant XML Schema 1.0 processors.

The XML Schema WG has judged that any changes to the existing specification to support XML 1.1 go beyond what could be considered as errata, and so will have to wait for a new version of the specification. As this may take some time, this Note addresses the question of what should be done in the interim to best serve the XML community.

In the sections which follow, a non-normative strategy is set out suggesting a number of changes which processors implementing the XML Schema specification can make to enable sensible and interoperable support for XML 1.1. Any implementation of XML Schema employing such a strategy is strictly speaking non-conformant to the current version of the XML Schema specification. The XML Schema WG none-the-less believes that interoperability will best be served by the availability of such non-conformant processors until such time as a subsequent version of XML Schema addressing this issue normatively is approved.

2 Survey of XML 1.1 challenges for XML Schema 1.0

Consider the following four cases:

  1. C1 vs. C0 in content, e.g. #x83 vs. #x03

  2. Old vs. new name chars in element names, e.g. y (25th letter in English alphabet) vs. ij (25th letter in Dutch alphabet)

  3. Old vs. new name chars in ID-typed content, e.g. y vs. ij

  4. LF vs NEL in length-specified list-typed content

(ij == U+0133 (#x133) is common in Dutch, e.g. in the word ijs == English ice-cream. It's a good example of something arbitrarily and irritatingly not allowed as a name character in XML 1.0 which is allowed as a name character in 1.1).

In each of the above cases, the first alternative is OK and has the same behaviour with respect to Schema validation in both XML 1.0 and XML 1.1, whereas the second alternative either is not Schema-valid under the strict XML 1.0 interpretation (1-3) or might be expected to have different behaviour between XML 1.0 and XML 1.1 (4).

In other words, if you used a conformant XML Schema validator on the following four instances (Figure 1), using the same schema document (Figure 2) each time, all four would have validity problems.

3 First step towards XML 1.1: the parser

The first obvious step for anyone considering modifying an existing XML Schema processor of any kind to allow XML 1.1 documents is replacing its front end, presumably currently an XML 1.0 parser, i.e. a parser which converts only documents with a version='1.0' XML declaration (or none), and enforces XML 1.0 well-formedness, with an XML 1.1 parser, i.e. one which enforces either XML 1.0 or XML 1.1 well-formedness, depending on the version stated in the XML declaration.

The resulting behaviour will be as follows:

XML 1.0 DeclarationXML 1.1 Declaration
XML 1.0 Content
DocOutcome
AOK
BOK
COK
DOK
DocOutcome
AOK
BOK
COK
DOK
XML 1.1 Content
DocOutcome
AX1
BX1
CX2
DX3
DocOutcome
AOK/**
B**
C**
DOK

Note that by "XML 1.0 Content" is meant documents exemplifying the first member of each of the four pairs of differences introduced above, and by "XML 1.1 Content" is meant documents exemplifying the second member thereof. The top two cells then require no explanation -- these are just the existing XML Schema processor, using an XML 1.1 parser front end, behaving correctly on data it already should be processing correctly.

The bottom two cells are the interesting ones. The bottom-left cell is characterised by what I'll call misaligned XML versions. Let's consider the outcomes here one at a time. Note that these cases cover not only what our putative XML Schema 1.0 processor with an XML 1.1 parser would do, but also what an unmodified 1.0/1.0 processor should do today.

A, B (misaligned versions): X1

These cases are (correctly) rejected as ill-formed by the front-end XML parser, because they break the 1.0 rules for CDATA content (A) and element names (B).

C (misaligned versions): X2

This case is (correctly) rejected as schema-invalid by the XML Schema processor -- a string with an ij in it is not an NCName per XML 1.0.

D (misaligned versions): X3

This case is (correctly) rejected as schema-invalid by the XML Schema processor -- a 'list' with only NEL separators is a single token when considered as XML 1.0 content.

Moving on to the final, lower-right, cell, this is of course where things get interesting:

A (aligned versions): OK/**

The behaviour of this case depends on an implementation choice. Some processors, which take their input only in the form of encoded character streams and always use an XML parser as a front end, depend on that front end to enforce the basic constraint that all xs:strings consist of XML 1.0 Chars. Other XML Schema processors, particularly those which also accept synthetic infosets as input, enforce that constraint explicitly. It follows that a processor of the first kind, simply by changing to use an XML 1.1 front-end, will thereby accept case A documents, but processors of the second kind will not, because they will still be explicitly checking instances of xs:string using its XML Schema 1.0 definition."

D (aligned versions): OK

This case is (correctly) accepted -- a 'list' with a NEL separator will have been normalized to have a space (#x20) separator by the XML 1.1 front-end parser, and so the XML Schema processor will find two tokens.

C (aligned versions): **

This case is (incorrectly) rejected as schema-invalid by the XML Schema processor -- because the ID type is derived from the Name type, which in turn has a pattern facet based on the XML 1.0 definition for Names, which does not allow the ij.

B (aligned versions): **

This case is actually very similar to the previous one, but with respect to a different document, that is, the schema document. That document is (incorrectly) rejected as schema-invalid by the XML Schema processor -- because the relevant element name turns up as the value of the name attribute on the xs:element element, and that attributes type in the schema for schema documents is NCName, which is derived from the Name type, which in turn has a pattern facet based on the XML 1.0 definition for Names, which does not allow the ij.

4 Recommended strategy: Move to 1.1-compatible type definitions

What does it mean to say the last two results are incorrect? It means that type definitions which enforce XML-1.0-appropriate constraints are being applied to self-identified XML 1.1 data.

The simplest resolution is to simply change the XML Schema processor itself so that the relevant built-in type definitions enforce the XML 1.1 contraints. This will make all the entries in the lower-right quadrant 'OK'.

5 The details

The XML Schema 1.0 type definitions which include either direct dependencies on XML 1.0 productions (that is, xsd:Name, which depends on XML 1.0 Name, xsd:NMTOKEN, which depends on XML Nmtoken, xsd:QName, which depends on XML 1.0 Letter, Digit, CombiningChar and Extender via XML Namespaces QName and xsd:string, which depends on XML 1.0 Char), as well as those type definitions which inherit from them (that is, xsd:NCName, xsd:ID, xsd:IDREF, xsd:IDREFS, xsd:ENTITY, xsd:ENTITIES, xsd:NMTOKENS, xsd:normalizedString, xsd:token and xsd:language), must use the XML 1.1 productions.

This change will fix the B and C results by using the XML 1.1 definition of Name. For processors which don't depend on their XML front-end parser to check CDATA, it will also fix the incorrect result they get for the A example by using the XML 1.1 definition of Char.

6 Backward incompatibilities

The approach selected here isn't perfect. The unconditional switch to 1.1-appropriate type definitions means that version 1.0 XML documents with 1.1-only Name characters in e.g. ID-typed attributes will be valid, where an unmodified Schema 1.0 processor would find them invalid.

The immediate negative consequences of this are presumably small, since anyone already schema-validating their XML 1.0 documents will presumably have corrected any examples of this. But as and when processors implementing this Note are widespread, it may be that documents with such attribute type definitions and values will be created, identified as version 1.0 and validated by modified processors, only to be (correctly) rejected by unmodified processors. We judge the risk of this having serious negative consequences are small enough to be discounted, but it is of course open to implementors to detect this case and issue a warning.

The other weakness is with respect to cases where no front-end XML parser is involved, that is where schema validity assessment is carried out on what are sometimes called "synthetic infosets".

Since on this proposal enforcement of XML 1.0 conformance for element names and character content is the responsibility of the front-end parser, it follows that for a synthetic infoset to contain for example an element with an XML-1.1-only element name will never be a problem solely because of its name, even if it has a document information item [version] property with value 1.0.

Again we judge the likelihood of this causing a problem to be vanishingly small, particularly as any attempt to serialize such a synthetic infoset should raise an error.

7 Summary of Recommendations for Interoperability

To produce an XML-1.1-friendly version of an XML Schema 1.0 processor:

  1. Replace its XML 1.0 front-end parser with an XML 1.1 front-end parser;

  2. Change its implementations of the XML Schema types Name, NMTOKEN, QName and string, to use the relevant XML (Namespaces) 1.1 productions;