UAX #31: Identifier and Pattern Syntax
[Unicode]  Technical Reports
 

Unicode Standard Annex #31

Identifier and Pattern Syntax

Version 4.1.0
Authors Mark Davis (mark.davis@us.ibm.com)
Date 2005-03-25
This Version http://www.unicode.org/reports/tr31/tr31-5.html
Previous Version http://www.unicode.org/reports/tr31/tr31-4.html
Latest Version http://www.unicode.org/reports/tr31/
Revision 5


Summary

This document describes specifications for recommended defaults for the use of Unicode in the definitions of identifiers and in pattern-based syntax. It incorporates the Identifier section of Unicode 4.0 (somewhat reorganized) and a new section on the use of Unicode in patterns. As a part of the latter, it presents recommended new properties for addition to the Unicode Character Database. It also incorporates guidelines for use of normalization with identifiers (from UAX #15).

Status

This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.

A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, but is published as a separate document. The Unicode Standard may require conformance to normative content in a Unicode Standard Annex, if so specified in the Conformance chapter of that version of the Unicode Standard. The version number of a UAX document corresponds to the version number of the Unicode Standard at the last point that the UAX document was updated.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

Contents


1 Introduction

A common task facing an implementer of the Unicode Standard is the provision of a parsing and/or lexing engine for identifiers. To assist in the standard treatment of identifiers in Unicode character-based parsers, a set of specifications is provided here as a recommended default for the definition of identifier syntax. These guidelines are no more complex than current rules in the common programming languages, except that they include more characters of different types. This document also provides guidelines for the user of normalization and case-insensitivity with identifiers, expanding on a section that was originally in UAX #15: Unicode Normalization Forms [UAX15].

These specifications provide a definition of identifiers that is guaranteed to be backward compatible with each successive release of Unicode, but also makes available any appropriate new Unicode characters. Unicode properties are also provided for stable pattern syntax: syntax that is stable over future versions of the Unicode Standard. These can either be used alone or with the identifier characters.

The following types of code points are defined (the sizes of the boxes are not to scale):

Character Classes for Programming
Identifier Start
Characters
 

Pattern Syntax
Characters

 

Unassigned Code Points
Identifier Only-Continue
Characters
Pattern Whitespace
Characters
 

Other Assigned
Code Points

 

The set consisting of both Identifier Start and Only-Continue characters is are known as Identifier Characters, also as Identifier Continue characters.

There are certain features that developers can depend on for stability:

In successive versions of Unicode, only the following changes are allowed, from one of the above classes to another:

Permitted Changes in Future Versions
  Identifier Start Identifier Only Continue Other Assigned
Unassigned + + +
Other Assigned + +  
Identifier Only Continue +    

The Unicode Consortium has formally adopted a stability policy on identifiers. For more information, see [Stability].

Each programming language standard has its own identifier syntax; different programming languages have different conventions for the use of certain characters such as $, @, #, or _ in identifiers. To extend such a syntax to cover the full behavior of a Unicode implementation, implementers may combine those specific rules with the syntax and properties provided here.

That is, each programming language can define their identifier syntax as relative to the Unicode identifier syntax, such as saying that identifiers are defined by the Unicode properties, with the addition of "$". By addition or subtraction of a small set of language specific characters, a programming language standard can easily track a growing repertoire of Unicode characters in a compatible way.

Similarly, each programming language can define white space characters or syntax characters relative to the Unicode pattern white space or syntax characters, with some specified set of additions or subtractions.

Systems that want to extend identifiers so as to encompass words used in natural languages may add characters identified in Section 4 Word Boundaries of [UAX29] with the property values Katakana, ALetter, and MidLetter, plus characters described in the notes at the end of that section.

Note that to preserve the disjoint nature of categories illustrated in the diagram "Character Classes for Programming", any character added to one of the categories must be subtracted from the others.

In some cases there are security implications that may require additional constraints on identifiers. For more information, see [UTR36].

1.1 Conformance

The following describes the possible ways that an implementation can claim conformance to this technical standard.

C1. An implementation claiming conformance to this specification at any Level shall identify the version of this specification and the version of the Unicode Standard.
 
C2. An implementation claiming conformance to Level 1 of this specification shall describe which of the following it observes:

2 Default Identifier Syntax

The formal syntax provided here captures the general intent that an identifier consists of a string of characters beginning with a letter or an ideograph, and following with any number of letters, ideographs, digits, or underscores. It provides a definition of identifiers that is guaranteed to be backward compatible with each successive release of Unicode, but also adds any appropriate new Unicode characters.

D1. Default Identifier Syntax

<identifier> := <ID_Start> <ID_Continue>*

Identifiers are defined by the following sets of character categories from the Unicode Character Database.

Syntactic Classes for Identifiers
Properties Alternates General Description of Coverage
ID_Start XID_Start Uppercase letters, lowercase letters, titlecase letters, modifier letters, other letters, letter numbers, stability extensions
ID_Continue XID_Continue All of the above, plus nonspacing marks, spacing combining marks, decimal numbers, connector punctuations, stability extensions. These are also known simply as Identifier Characters, since they are a superset of the ID_Start. The set of ID_Start characters minus the ID_Continue characters are known as ID_Only_ Continue characters.

The innovations in the identifier syntax to cover the Unicode Standard include the following:

2.1 Combining Marks

Combining marks are accounted for in identifier syntax: a composed character sequence consisting of a base character followed by any number of combining marks is valid in an identifier. Combining marks are required in the representation of many languages, and the conformance rules in Chapter 3, Conformance of [Unicode] require the interpretation of canonical-equivalent character sequences.

Enclosing combining marks (such as U+20DD..U+20E0) are excluded from the syntactic definition of ID_Continue, because the composite characters that result from their composition with letters are themselves not normally considered valid constituents of these identifiers.

2.2 Layout and Format Control Characters

Certain Unicode characters are used to control joining behavior, bidirectional ordering control, and alternative formats for display. These have the General Category value of Cf. Unlike space characters or other delimiters, they do not indicate word, line, or other unit boundaries.

While it is possible to ignore these characters in determining identifiers, the recommendation is to not ignore them, and not permit them in identifiers except in special cases. This is because of the possibility for confusion between two visually identical strings: see [UTR36]. Some possible exceptions are the ZWJ and ZWNJ in certain contexts, such as between certain characters in Indic words.

2.3 Specific Character Adjustments

Specific identifier syntaxes can be treated as tailorings of the generic syntax based on character properties. For example, SQL identifiers allow an underscore as an identifier part, but not as an identifier start; C identifiers allow an underscore as either an identifier part or an identifier start. Specific languages may also want to exclude the characters that have a decomposition_type other than canonical or none, or to exclude some subset of those, such as those with a decomposition_type equal to font.

For programming language identifiers, normalization and case have a number of important implications. For a discussion of these issues, see Normalization and Case.

2.4 Backward Compatibility

Unicode General Category values are kept as stable as possible, but they can change across versions of the Unicode Standard. The bulk of the characters having a given value are determined by other properties, and the coverage expands in the future according to the assignment of those properties. In addition, the Other_ID_Start property adds a small list of characters that qualified as ID_Start characters in some previous version of Unicode solely on the basis of their General Category properties, but that no longer qualify in the current version. In Unicode 4.1.0, this list consists of four characters:

U+2118 Script Capital P
U+212E Estimated Symbol
U+309B Katakana-Hiragana Voiced Sound Mark
U+309C Katakana-Hiragana Semi-Voiced Sound Mark

Similarly, the Other_ID_Continue property adds a small list of characters that qualified as ID_Continue characters in some previous version of Unicode solely on the basis of their General Category properties, but that no longer qualify in the current version. In Unicode 4.1.0, this list consists of nine characters:

U+1369 ETHIOPIC DIGIT ONE
...
U+1371 ETHIOPIC DIGIT NINE

The Other_ID_Start and Other_ID_Continue properties are thus designed to ensure that the Unicode identifier specification is backward compatible: Any sequence of characters that qualified as an identifier in some version of Unicode will continue to qualify as an identifier in future versions.

R1 Default Identifiers
  To meet this requirement, an implementation shall use the D1 and the properties ID_Start and ID_Continue (or XID_Start and XID_Continue) to determine whether a string is an identifier or not.

Or, it shall declare that it uses a modification, and provide a precise list of characters that are added to or removed from the above properties, and/or provide a list of additional constraints on identifiers.
 

3 Alternative Identifier Syntax

The disadvantage of working with the syntactic classes defined above is the storage space needed for the detailed definitions, plus the fact that with each new version of the Unicode Standard new characters are added, which an existing parser would not be able to recognize. In other words, the recommendations based on that table are not upwardly compatible.

This problem can be addressed by turning the question around. Instead of defining the set of code points that are allowed, define a small, fixed set of code points that are reserved for syntactic use and allow everything else (including unassigned code points) as part of an identifier. All parsers written to this specification would behave the same way for all versions of the Unicode Standard, because the classification of code points is fixed forever.

The drawback of this method is that it allows “nonsense” to be part of identifiers because the concerns of lexical classification and of human intelligibility are separated. Human intelligibility can, however, be addressed by other means, such as usage guidelines that encourage a restriction to meaningful terms for identifiers. For an example of such guidelines, see the XML 1.1 specification by the W3C [XML1.1].

By increasing the set of disallowed characters, a reasonably intuitive recommendation for identifiers can be achieved. This approach uses the full specification of identifier classes, as of a particular version of the Unicode Standard, and permanently disallows any characters not recommended in that version for inclusion in identifiers. All code points unassigned as of that version would be allowed in identifiers, so that any future additions to the standard would already be accounted for. This approach ensures both upwardly compatible identifier stability and a reasonable division of characters into those that do and do not make human sense as part of identifiers.

Some additional extensions to the list of disallowed code points can be made to further constrain “unnatural” identifiers. For example, one could include unassigned code points in blocks of characters set aside for future encoding as symbols, such as mathematical operators.

With or without such fine-tuning, such a compromise approach still incurs the expense of implementing large lists of code points. While they no longer change over time, it is a matter of choice whether the benefit of enforcing somewhat word-like identifiers justifies their cost.

Alternatively, one can use the properties described below, and allow all sequences of characters to be identifiers that are neither pattern syntax nor pattern whitespace. This has the advantage of simplicity and small tables, but allows many more “unnatural” identifiers.

R2 Alternative Identifiers
  To meet this requirement, an implementation shall define identifiers to be any string of characters that contains neither Pattern_White_Space nor Pattern_Syntax characters.

Or, it shall declare that it uses a modification, and provide a precise list of characters that are added to or removed from the sets of code points defined by these properties.
 

4 Pattern Syntax

There are many circumstances where software interprets patterns that are a mixture of literal characters, whitespace, and syntax characters. Examples include regular expressions, Java collation rules, Excel or ICU number formats, and many others. These patterns have been very limited in the past, and forced to use clumsy combinations of ASCII characters for their syntax. As Unicode becomes ubiquitous, some of these will start to use non-ASCII characters for their syntax: first as more readable optional alternatives, then eventually as the standard syntax.

For forward and backward compatibility, it is advantageous to have a fixed set of whitespace and syntax code points for use in patterns. This follows the recommendations that the Unicode Consortium made regarding completely stable identifiers, and the practice that is seen in XML 1.1 [XML1.1]. (In particular, the consortium committed to not allocating characters suitable for identifiers in the range 2190..2BFF, which is being used by XML 1.1.)

With a fixed set of whitespace and syntax code points, a pattern language can then have a policy requiring all possible syntax characters (even ones currently unused) to be quoted if they are literals. By using this policy, it preserves the freedom to extend the syntax in the future by using those characters. Past patterns on future systems will always work; future patterns on past systems will signal an error instead of silently producing the wrong results.

Example:

In version 1.3 of program X, '≈' is a reserved syntax character, e.g. it does not perform an operation, but you have to quote it. In version 1.4, '≈' gets a real meaning, for example, "uppercase the subsequent characters". In program X, '\' quotes the next character; that is, causes it to be treated as a literal instead of a syntax character.

As of Unicode 4.1.0, there are two Unicode character properties that can be used for for stable syntax: Pattern_White_Space and Pattern_Syntax.  Particular pattern languages may, of course, override these recommendations (for example, adding or removing other characters for compatibility in ASCII).

For stability, the property values are absolutely invariant; not changing with successive versions of Unicode. Of course, this does not limit the ability of the Unicode Standard to add more symbol or whitespace characters, but the syntax and whitespace characters recommended for use in patterns will not change.

When generating rules or patterns, all whitespace and syntax code points that are to be literals require quoting, using whatever quoting mechanism is available. For readability, it is recommended practice to quote or escape all literal whitespace and default ignorable code points as well.

Example: consider the following, where the items in angle brackets indicate literal characters.

Since <SPACE> is a Pattern_White_Space character, it requires quoting. Because <ZERO WIDTH SPACE> is a default ignorable character, it should also be quoted for readability. So if in this example \uXXXX is used for hex expression, but resolved before quoting, and single quotes are used for quoting, this might be expressed as:

R3 Pattern Whitespace and Syntax Characters
  To meet this requirement, an implementation shall use Pattern_White_Space characters as all and only those characters interpreted as whitespace in parsing, and shall use Pattern_Syntax characters as all and only those characters with syntactic use.

Or, it shall declare that it uses a modification, and provide a precise list of characters that are added to or removed from the sets of code points defined by these properties.

  • Note: all characters other than those defined by these properties would be available as identifiers or literals.

5 Normalization and Case

R4 Normalized Identifiers
  To meet this requirement, an implementation shall specify the normalization form, and shall provide a precise list of any characters that are excluded from normalization, and if the normalization form is NFKC, shall apply the modifications in NFKC Modifications given by the properties XID_Start and XID_Continue. Except for identifiers containing excluded characters, any two identifiers that have the same normalization form shall be treated as equivalent by the implementation.
R5 Case-Insensitive Identifiers
  To meet this requirement, an implementation shall specify either simple or full case folding, and adhere to the Unicode specification for that folding. Any two identifiers that have the same case-folded form shall be treated as equivalent by the implementation.

This section discusses issues that must be taken into account when considering normalization and case folding of identifiers in programming languages or scripting languages. Using normalization avoids many problems where apparently identical identifiers are not treated equivalently. Such problems can appear both during compilation and during linking, in particular across different programming languages. To avoid such problems, programming languages can normalize identifiers before storing or comparing them. Generally if the programming language has case-sensitive identifiers then Normalization Form C is appropriate, while if the programming language has case-insensitive identifiers then Normalization Form KC is more appropriate.

Note: In mathematically-oriented programming languages which make distinctive use of the Mathematical Alphanumeric Symbols such as U+1D400 MATHEMATICAL BOLD CAPITAL A, an application of NFKC must filter characters to exclude characters with the property value decomposition_type=font. For related information, see UTR #30: Character Foldings.

If programming languages are using NFKC to fold differences between characters, then they use the following modification of the identifier syntax from the Unicode Standard to deal with the idiosyncrasies of a small number of characters. These characters fall into three classes:

NFKC Modifications

  1. Middle Dot. Because most Catalan legacy data will be encoded in Latin-1, U+00B7 MIDDLE DOT needs to be allowed in ID_Continue. (If the programming language is using a dot as an operator, then U+2219 BULLET OPERATOR or U+22C5 DOT OPERATOR should be used instead. However, care should be taken when dealing with U+00B7 MIDDLE DOT, as many processes will assume its use as punctuation, rather than as a letter extender.)
  2. Combining-like characters. Certain characters are not formally combining characters, although they behave in most respects as if they were. Ideally, they should not be in ID_Start, but rather in ID_Continue, along with combining characters. In most cases, the mismatch does not cause a problem, but when these characters have compatibility decompositions, they can cause identifiers not to be closed under Normalization Form KC. In particular, the following four characters are to be in ID_Continue and not ID_Start:
    • 0E33 THAI CHARACTER SARA AM
    • 0EB3 LAO VOWEL SIGN AM
    • FF9E HALFWIDTH KATAKANA VOICED SOUND MARK
    • FF9F HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK
  3. Irregularly decomposing characters. U+037A GREEK YPOGEGRAMMENI and certain Arabic presentation forms have irregular compatibility decompositions, and must be excluded from both ID_Start and ID_Continue. It is recommended that all Arabic presentation forms be excluded from identifiers in any event, although only a few of them must be excluded for normalization to guarantee identifier closure.

With these amendments to the identifier syntax, all identifiers are closed under all four Normalization forms. This means that for any string S,

isIdentifier(S) implies

isIdentifier(toNFD(S))
isIdentifier(toNFC(S))
isIdentifier(toNFKD(S))
isIdentifier(toNFKC(S))

Identifiers are also closed under case operations (with one exception), so that for any string S,

 isIdentifier(S) implies

isIdentifier(toLowercase(S))
isIdentifier(toUppercase(S))
isIdentifier(toFoldedcase(S))

The one exception is U+0345 COMBINING GREEK YPOGEGRAMMENI. In the very unusual case that U+0345 is at the start of S,  U+0345 is not in ID_Start, but its uppercase and case-folded version are. In practice this is not a problem, because of the way normalization is used with identifiers.

Note: Those programming languages with case-insensitive identifiers should use the case foldings described in Section 3.13 Default Case Operations of [Unicode] to produce a case-insensitive normalized form.

When source text is parsed for identifiers, the folding of distinctions (using case mapping or NFKC) must be delayed until after parsing has located the identifiers. Thus such folding of distinctions should not be applied to string literals or to comments in program source text.

The UCD provides support for handling case folding with normalization: the property FC_NFKC_Closure can be used in case folding, so that a case folding of an NFKC string is itself normalized. These properties, and the files containing them, are described in the UCD documentation [UCD].

Acknowledgements

Thanks to Eric Muller, Asmus Freytag, and Martin Duerst for feedback on this document.

References

[Feedback] Reporting Errors and Requesting Information Online
http://www.unicode.org/reporting.html
[Stability] Unicode Consortium Stability Policies
http://www.unicode.org/standard/stability_policy.html
[Reports] Unicode Technical Reports
http://www.unicode.org/reports/
For information on the status and development process for technical reports, and for a list of technical reports.
[UCD] Unicode Character Database.
http://www.unicode.org/ucd/
For an overview of the Unicode Character Database and a list of its associated files
[Unicode] The Unicode Standard
For the latest version see:
http://www.unicode.org/versions/latest/.
For the current version see: http://www.unicode.org/versions/Unicode4.1.0/.
For the last major version see: The Unicode Consortium. The Unicode Standard, Version 4.0. (Boston, MA, Addison-Wesley, 2003. 0-321-18578-1).
[Unicode4.0] The Unicode Consortium. The Unicode Standard, Version 4.0. Reading, MA, Addison-Wesley, 2003. 0-321-18578-1.
[Unicode4.0.1]

The Unicode Consortium. The Unicode Standard, Version 4.0.1, defined by:
The Unicode Standard, Version 4.0 (Boston, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1), as amended by Unicode 4.0.1 (http://www.unicode.org/versions/Unicode4.0.1/).

[UAX15]

UAX #15: Unicode Normalization Forms
http://www.unicode.org/reports/tr15/

[UAX29] UAX #29: Text Boundaries
http://www.unicode.org/reports/tr29/
[UAX36] UTR #36: Security Considerations for the Implementation of Unicode and Related Technology
http://unicode.org/reports/tr36/
in draft state, as of the publication of this document
[Versions] Versions of the Unicode Standard
http://www.unicode.org/versions/
For information on version numbering, and citing and referencing the Unicode Standard, the Unicode Character Database, and Unicode Technical Reports.
[XML1.1] Extensible Markup Language (XML) 1.1
http://www.w3.org/TR/xml11/

Modifications

The following summarizes modifications from previous revisions of this document.

5
  • Removed section 4.1, since the two properties have been accepted for Unicode 4.1.
  • Expanded introduction
  • Adding information about stability, and tailoring for identifiers.
  • Added the list of characters in Other_ID_Continue .
  • Changed <identifier_continue> and <identifier_start> to just use the property names, to avoid confusion.
  • Included XID_Start and XID_Continue in R1 and elsewhere.
  • Added reference to UTR #36, and the phrase "or a list of additional constraints on identifiers" to R1.
  • Changed "Coverage" to "General Description of Coverage", since the UCD value are definitive.
  • Added clarifications in 2.4
  • Revamped 2.2 Layout and Format Control Characters
  • Minor editing
3
  • Made draft UAX
  • Incorporated Annex 7 from UAX #15
  • Added Other_ID_Continue for Unicode 4.1
  • Added conformance clauses
  • Changed <identifier_extend> to <identifier_continue> to better match the property name.
  • Some additional edits.
2
  • Modified Pattern White Space to remove compatibility characters
  • Added example explaining use of Pattern White Space
1
  • First version: incorporated section from Unicode 4.0 on Identifiers plus new section on patterns.