Unicode 5.0.0
[Unicode]  Unicode 5.0.0 Home | Site Map | Search
 

Unicode 5.0.0

Released: 2006 July 14

Version 5.0.0 has been superseded by the latest version of the Unicode Standard.

The Unicode Standard, Version 5.0

Version 5.0.0 of the Unicode Standard consists of the core specification (The Unicode Standard, Version 5.0), together with the delta and archival code charts for this version, the 5.0.0 Unicode Standard Annexes, and the 5.0.0 Unicode Character Database (UCD).

The core specification gives the general principles, requirements for conformance, and guidelines for implementers. The code charts show representative glyphs for all the Unicode characters. The Unicode Standard Annexes supply detailed normative information about particular aspects of the standard. The Unicode Character Database supplies normative and informative data for implementers to allow them to implement the Unicode Standard.

Version 5.0.0 of the Unicode Standard should be referenced as:

The Unicode Consortium. The Unicode Standard, Version 5.0.0, defined by: The Unicode Standard, Version 5.0 (Boston, MA, Addison-Wesley, 2007. ISBN 0-321-48091-0)

A complete specification of the contributory files for Unicode 5.0.0 is found on the page Components for 5.0.0.


Contents of This Document

Online Edition
Overview
What's New in Version 5.0
New Characters
Unicode Character Database
Conformance
Unicode Standard Annexes
Errata

Online Edition

The text of The Unicode Standard, Version 5.0, as well as the delta and archival code charts, is available online via the navigation links on this page. The charts and the Unicode Standard Annexes may be printed, while the other files may viewed but not printed. The Unicode 5.0 Web Bookmarks page has links to all sections of the online text.

Overview

Unicode 5.0.0 is a major version of the Unicode Standard and supersedes all previous versions.

Unicode 5.0 covers the full repertoire of ISO/IEC 10646:2003, including Amendments 1 and 2, which add characters required for some languages of India, for mathematicians, for minority languages, and for academic use.

What's New in Version 5.0

For the first time, the book provides the complete text of the standard, including all the Unicode Standard Annexes. The book is printed in a smaller, lighter, easier-to-use format. See also Note on Printed CJK Code Charts.

For stability of protocols on the Internet and elsewhere, Unicode 5.0 also makes changes to guarantee case-folding stability. Unicode 5.0 incorporates all the changes introduced in Unicode 4.1, including full interoperability with the most recent versions of GB 18030, JIS X 0213, and HKSCS, and support for stable identifiers and pattern syntax characters.

Unicode 5.0 revises and improves property values and behavioral specifications in areas such as character, word, line, and sentence segmentation, and tightens conformance requirements on Bidi implementations (used for Arabic and Hebrew). The text is significantly revised for clarity and completeness, especially for Unicode conformance.

The Unicode Standard is closely connected with other Unicode software globalization standards in such key areas as collation (used for sorting, searching, and matching), character set conversion, regular expressions, and the interchange and registration of locale data for the world's languages and local cultural conventions [CLDR]. It has been further significantly augmented by several new Unicode Technical Standards that provide recommendations and data to assist in secure implementation of Unicode, and to establish the registration mechanism for Ideographic Variation Sequences needed by the publishing industry for Chinese and Japanese.

Other major additions to Version 5.0 since Version 4.0 are discussed in the sections below.

New Characters

1,369 new character assignments were made to the Unicode Standard, Version 5.0 (over and above what was in Unicode 4.1.0). These additions include new characters for Cyrillic, Greek, Hebrew, Kannada, Latin, math, phonetic extensions, symbols, and five new scripts: Balinese, N’Ko, Phags-pa, Phoenician, and Sumero-Akkadian Cuneiform.

The new character additions were to both the BMP and the SMP (Plane 1). The following table shows the allocation of code points in Unicode 5.0.0. For more information on the specific characters, see the file DerivedAge.txt in the Unicode Character Database.

Graphic 98,884
Format 140
Control 65
Private Use 137,468
Surrogate 2,048
Noncharacter 66
Reserved 875,441

The character repertoire corresponds to ISO/IEC 10646:2003 plus Amendment 1, Amendment 2, and four Sindhi characters from Amendment 3. For more details of character counts, see Appendix D, Changes from Unicode Version 4.0.

Unicode Character Database

The Unicode Character Database (UCD) was extended to cover the character repertoire additions, and new block definitions and script values were added. A number of other updates were made, as listed here:

  • Scripts. Unassigned code points were given a new Script property value of "Zzzz": this may require some change in code using this property. Three Mongolian punctuation marks and two archaic letters changed script value.
  • Case-Related Properties. To allow for the new policy on case-folding stability, lowercase variants of several characters were added, and the mappings for the uppercase variants changed.
  • Bidirectional Behavior. The list of characters with the Bidi_Mirrored property was made consistent for brackets and quotation marks, in preparation for new constraints on bidi mirroring. The Bidi_Class property for five archaic characters was changed to L.
  • Line Break. The Line_Break property of seven punctuation characters and two bracket characters was changed to Alphabetic (AL) to better match their expected behavior. Numerous characters for Southeast Asian scripts, which require complex contextual linebreaking, were changed to Complex_Context (SA).
  • New Properties. Normative_Name_Alias and the metaproperty, Deprecated, were added. The Jamo_Short_Name property was documented as a contributory property.
  • General Category. Seven archaic characters plus U+0294 LATIN LETTER GLOTTAL STOP changed categories.
  • Numeric Properties. The archaic character U+10341 GOTHIC LETTER NINETY was given the numeric value 90.
  • Unihan. The kIICore field was made a normative property, and three new provisional properties were added: kCheungBauer, kCheungBauerIndex, and kFourCornerCoverage. There were numerous additions to the kCangjie property.
  • Text Breaking. Grapheme_Link was deprecated as a property.

For more information, see the file UCD.html in the Unicode Character Database.

Conformance

Details regarding the conformance changes to the standard for Version 5.0 are specified in the text of the standard itself, including the Unicode Standard Annexes. As noted above, the book will be available in the fourth quarter of 2006.

Chapter 3, Conformance, was substantially improved by incorporating much of the Unicode Property Model, enhancing the treatment of combining characters, and further clarifying canonical ordering behavior through the addition of clearly defined principles. Additionally, conformance clauses and definitions were renumbered for overall readability and clarity of the text. Significant clarifications or modifications to character behavior include those listed below:

  • Stability of Cased Letters. If uppercase characters are added in cased scripts, the corresponding lowercase characters will be added as well, so that case folding is stable.
  • Stability of Named Character Sequences. An initial provisional phase was incorporated into the process for defining Named Character Sequences, so that approved Named Character Sequences will be immutable.
  • Disunification of Diacritics. Criteria for disunifying diacritics were established.
  • Indic Scripts. Zero width joiner and zero width non-joiner can now be used to encourage or discourage ligation in Bengali; the sequence for Gurmukhi double vowels was determined, and the shaping of ra in Tamil was updated.
  • Combining Marks. The use of combining grapheme joiner with Latin script diacritics was clarified.

Unicode Standard Annexes

  • In UAX #9, "Bidirectional Algorithm," for better interoperability, the algorithm was modified to tighten up the conformance requirements for using mirrored glyphs for characters. Higher level protocols are discouraged, due to interoperability and security considerations. The definition of directional run was changed to be the same as level run, and the use of soft-hyphen with bidi text was clarified.
  • In UAX #14, "Line Breaking Properties," a number of rules were modified, the use of soft hyphen in cursive scripts was documented, the conformance clauses were restated and the algorithm was reorganized into tailorable and non-tailorable sections, and the normative status was made consistent with Chapter 3, Conformance. As a result of the restatement of conformance, the Line_Break property became normative.
  • In UAX #15, "Unicode Normalization Forms," the new Stream-Safe Text Format was added, allowing the use of normalization in protocols designed for streaming. The stability guarantees are described in more detail, with guidelines provided for guaranteeing process stability, and a new appendix listing precisely those characters sequences that require special handling. Additional figures clarify the effects of normalization, and the types of characters affected.
  • In UAX #29, "Text Boundaries," the format of the rules was changed to make them much easier to implement --  without changing the results. The guidelines for how to use regex-style rules was revamped completely. A number of edge cases are also now handled properly, and information was added on the relation to identifiers, use of normalization, tailoring, application to spelling checkers, and how to use the supplied test data. Tailorings for text boundaries can now also be entered into the Unicode Common Locale Data Repository [CLDR].
  • UAX #31, "Identifier and Pattern Syntax," introduced profiles, and added notes on profiles of identifiers for natural languages and the use of spaces in identifiers.

Errata

Errata incorporated into Unicode 5.0 are listed by date in a separate table. For corrigenda and errata after the release of Unicode 5.0, see the list of current Updates and Errata.