Unicode 5.0.0
Version 5.0.0 has been superseded by the
latest version
of the Unicode Standard.
|
Version 5.0.0 of the Unicode Standard consists of the core
specification (The Unicode Standard,
Version 5.0), together with
the delta and archival code charts for this version, the 5.0.0 Unicode Standard Annexes,
and the 5.0.0 Unicode Character Database (UCD).
The core specification gives the general principles,
requirements for conformance, and guidelines for implementers. The
code charts show representative glyphs for all the Unicode
characters. The Unicode Standard Annexes supply detailed normative
information about particular aspects of the standard. The Unicode
Character Database supplies normative and informative data for
implementers to allow them to implement the Unicode Standard.
|
Version 5.0.0 of the Unicode Standard
should be referenced as:
The Unicode Consortium. The Unicode Standard, Version 5.0.0,
defined by: The Unicode Standard, Version 5.0 (Boston,
MA, Addison-Wesley, 2007. ISBN 0-321-48091-0)
A complete specification of the contributory files for Unicode
5.0.0 is found on the page
Components for 5.0.0.
Contents of This Document
Online Edition
Overview
What's New in Version 5.0
New Characters
Unicode Character Database
Conformance
Unicode Standard Annexes
Errata
Online Edition
The text of The Unicode Standard, Version 5.0, as well as
the delta and archival code charts,
is available online via the navigation links on this page.
The charts and the Unicode Standard Annexes may be printed, while
the other files may viewed but not printed. The
Unicode 5.0 Web Bookmarks page has links to all sections of the
online text.
Overview
Unicode 5.0.0 is a major version
of the Unicode Standard and supersedes all previous versions.
Unicode 5.0 covers the full repertoire of ISO/IEC 10646:2003, including
Amendments 1 and 2, which add characters required for some
languages of India, for mathematicians, for minority languages,
and for academic use.
What's New in Version 5.0
For the first time, the book provides the complete text of the
standard, including all the Unicode Standard Annexes. The book is printed in a smaller, lighter, easier-to-use format.
See also
Note on Printed CJK Code Charts.
For
stability of protocols on the Internet and elsewhere,
Unicode 5.0 also makes changes to guarantee case-folding stability. Unicode 5.0
incorporates all the changes introduced in Unicode 4.1, including full interoperability with
the most recent versions of GB 18030, JIS X 0213, and HKSCS,
and support for stable identifiers and pattern syntax characters.
Unicode 5.0 revises and improves property values and behavioral
specifications in areas such as character, word, line, and sentence
segmentation, and tightens conformance requirements on Bidi
implementations (used for Arabic and Hebrew). The text is
significantly revised for clarity and completeness, especially
for Unicode conformance.
The Unicode Standard is closely connected with other Unicode software
globalization standards in such key areas as collation (used for sorting,
searching, and matching), character set conversion, regular expressions, and the
interchange and registration of locale data for the world's languages and local cultural conventions
[CLDR]. It has been further
significantly augmented by several new Unicode Technical Standards that provide recommendations and data to assist in secure
implementation of Unicode, and to establish the registration mechanism for Ideographic Variation Sequences needed by the
publishing industry for Chinese and Japanese.
Other major additions to Version 5.0 since Version 4.0 are
discussed in the sections below.
New Characters
1,369 new character assignments were made to the Unicode
Standard, Version 5.0 (over and above what was in Unicode 4.1.0).
These additions include new characters for Cyrillic, Greek, Hebrew,
Kannada, Latin, math, phonetic
extensions, symbols, and five new scripts: Balinese, N’Ko, Phags-pa,
Phoenician, and Sumero-Akkadian Cuneiform.
The new character additions were to both the BMP and the SMP
(Plane 1). The following table shows the allocation of code points in Unicode
5.0.0. For more information on the specific characters, see the file
DerivedAge.txt in the
Unicode Character Database.
Graphic |
98,884 |
Format |
140 |
Control |
65 |
Private Use |
137,468 |
Surrogate |
2,048 |
Noncharacter |
66 |
Reserved |
875,441 |
The character repertoire corresponds to ISO/IEC 10646:2003 plus
Amendment 1, Amendment 2, and four Sindhi characters from Amendment
3. For
more details of character counts, see Appendix
D, Changes from Unicode Version 4.0.
Unicode Character Database
The Unicode Character Database (UCD) was extended to cover the
character repertoire additions, and new block definitions and script
values were added. A number of other updates were made, as listed
here:
- Scripts.
Unassigned code points were given a new Script property value of
"Zzzz": this may require some change in code using this property. Three Mongolian punctuation marks and two archaic letters changed script value.
- Case-Related Properties.
To allow for the new policy on case-folding stability, lowercase
variants of several characters were added, and the mappings for the
uppercase variants changed.
- Bidirectional Behavior.
The list of characters with the Bidi_Mirrored property was made consistent
for brackets and quotation marks, in preparation for new constraints on bidi
mirroring. The Bidi_Class property for five archaic characters was changed
to L.
- Line Break.
The Line_Break property of seven punctuation characters and two bracket characters
was changed to Alphabetic (AL) to better match their expected
behavior. Numerous characters for Southeast Asian scripts, which
require complex contextual linebreaking, were changed to Complex_Context (SA).
- New Properties.
Normative_Name_Alias and the metaproperty, Deprecated, were
added. The Jamo_Short_Name property was documented as a contributory
property.
- General Category.
Seven archaic characters plus U+0294 LATIN LETTER
GLOTTAL STOP changed categories.
- Numeric Properties. The archaic character U+10341 GOTHIC LETTER NINETY was given the numeric value 90.
- Unihan.
The kIICore field was made a normative property, and three new
provisional properties were added: kCheungBauer, kCheungBauerIndex, and
kFourCornerCoverage. There were numerous additions to the kCangjie
property.
- Text Breaking.
Grapheme_Link was deprecated as a property.
For more information, see the file
UCD.html in the
Unicode
Character Database.
Conformance
Details regarding the conformance changes to
the standard for Version 5.0 are specified in the text
of the standard itself, including the Unicode Standard
Annexes. As noted above, the book will be available in the fourth quarter of 2006.
Chapter 3, Conformance, was substantially improved by
incorporating much of the Unicode Property Model, enhancing the
treatment of combining characters, and further clarifying canonical
ordering behavior through the addition of clearly defined
principles. Additionally, conformance clauses and definitions were
renumbered for overall readability and clarity of the text.
Significant clarifications or modifications to character behavior
include those listed below:
- Stability of Cased Letters.
If uppercase characters are added in cased scripts, the corresponding
lowercase characters will be added as well, so that case folding is stable.
- Stability of Named Character Sequences.
An initial provisional phase was incorporated into the process for defining Named Character Sequences, so that approved Named Character
Sequences will be immutable.
- Disunification of Diacritics.
Criteria for disunifying diacritics were established.
- Indic Scripts.
Zero width joiner and zero width non-joiner can now be used to encourage or discourage ligation in Bengali; the sequence for Gurmukhi double vowels was determined, and the shaping of ra in Tamil was updated.
- Combining Marks.
The use of combining grapheme joiner with Latin script diacritics was clarified.
Unicode Standard Annexes
- In UAX #9, "Bidirectional Algorithm," for better interoperability, the algorithm was modified to tighten up the conformance requirements for using mirrored glyphs for characters. Higher level protocols are discouraged, due to interoperability and security considerations. The definition of directional run was changed to be the same as level run, and the use of soft-hyphen with bidi text was clarified.
- In UAX #14, "Line Breaking Properties," a number of rules were modified, the use of soft hyphen in cursive scripts was documented, the conformance clauses were restated and the algorithm was reorganized into tailorable and non-tailorable sections, and the normative status was made consistent with Chapter 3, Conformance. As a result of the restatement of conformance, the Line_Break property became normative.
- In UAX #15, "Unicode Normalization Forms," the new Stream-Safe Text Format was added, allowing the use of normalization in protocols designed for streaming. The stability guarantees are described in more detail, with guidelines provided for guaranteeing process stability, and a new appendix listing precisely those characters sequences that require special handling. Additional figures clarify the effects of normalization, and the types of characters affected.
- In UAX #29, "Text Boundaries," the format of the rules was changed to make them much easier to implement -- without changing the results. The guidelines for how to use regex-style rules was revamped completely. A number of edge cases are also now handled properly, and information was added on the relation to identifiers, use of normalization, tailoring, application to spelling checkers, and how to use the supplied test data. Tailorings for text boundaries can now also be entered into the Unicode Common Locale Data Repository [CLDR].
- UAX #31, "Identifier and Pattern Syntax," introduced profiles, and added notes on profiles of identifiers for natural languages and the use of spaces in identifiers.
Errata
Errata incorporated into Unicode 5.0 are listed by date in
a separate table. For corrigenda and errata after the release of Unicode 5.0, see the list of current
Updates and Errata.