Corrigendum #9: Clarification About Noncharacters
Corrigendum |
Effective Date |
Applicable Versions |
Fixed Version |
Result Documented In: |
Corrigendum #9: Clarification About Noncharacters |
2013-Jan-30 [134-C15] |
3.1.0 to 6.3.0 |
7.0.0 2014-June |
Chapter 3, Conformance |
Background
The formal wording of the definition of noncharacter
in the standard has led some implementers to interpret any presence
of a noncharacter code point in a Unicode string as causing that
string to be ill-formed, and thereby has led to inappropriate
over-rejection of some Unicode strings in APIs, components, or applications
that should handle (i.e., either process or pass through) all well-formed Unicode
strings.
Noncharacters in the Unicode Standard are intended for internal use
and have no standard interpretation when exchanged outside the context
of internal use. However, they are not illegal in interchange nor do
they cause ill-formed Unicode text. This has always been the intent
of the standard, as expressed by the Unicode Technical Committee. This is
necessary for the effective use of noncharacters, because anytime a
Unicode string crosses an API boundary, it is in effect being
"interchanged". Furthermore, for distributed software, it is
often very difficult to determine what constitutes an "internal" versus
an "external" context for any particular software process.
The real intent of noncharacters is that they are permanently
prohibited from being assigned standard, interchangeable meanings,
rather than that they are prohibited from occurring in Unicode
strings which happen to be interchanged.
Corrigendum #9 provides a means for implementations that openly interchange
noncharacters to claim conformance to versions of the standard in which
Definition D14 nominally prohibits such interchange. This corrigendum does not
affect the fact that when so interchanged, the intended semantics of noncharacters
may not be interpretable.
Changes to the Content of the Core Specification
Change D14 in Section 3.4, Characters and Encoding, as indicated:
Noncharacter: A code point that is permanently reserved for internal use
and that should never be interchanged. Noncharacters consist of the values
U+nFFFE and U+nFFFF (where n is from 0 to 1016)
and the values U+FDD0..U+FDEF.
Note that in Unicode 3.1.0 through Unicode 4.1.0, the definition in
question was labeled D7b, instead of D14.
There is associated informative text in the Core Specification concerning
noncharacters. That text will also be clarified when the text of this
corrigendum is applied in a future revision of the Core Specification.