Unicode 4.0.0
[Unicode]  Unicode 4.0.0 Home | Site Map | Search
 

Unicode 4.0.0

Version 4.0.0 has been superseded by the latest version of the Unicode Standard.

The Unicode Standard, Version 4.0

Version 4.0.0 of the Unicode Standard consists of the core specification, The Unicode Standard, Version 4.0, the delta and archival code charts for this version, the Unicode Standard Annexes, and the Unicode Character Database (UCD).

The core specification gives the general principles, requirements for conformance, and guidelines for implementers. The code charts show representative glyphs for all the Unicode characters. The Unicode Standard Annexes supply detailed normative information about particular aspects of the standard. The Unicode Character Database supplies normative and informative data for implementers to allow them to implement the Unicode Standard.

Version 4.0.0 of the Unicode Standard should be referenced as:

The Unicode Consortium. The Unicode Standard, Version 4.0.0, defined by: The Unicode Standard, Version 4.0 (Boston, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1)

A complete specification of the contributory files for Unicode 4.0.0 is found on the page Components for Version 4.0.0.


Online Edition

The text of The Unicode Standard, Version 4.0, as well as the delta and archival code charts, is available online via the navigation links on this page. These files may not be printed. The Unicode 4.0 Web Bookmarks page has links to all sections of the online text.

Overview

Unicode 4.0.0 is a major version of the Unicode Standard. The text of the standard has been extensively rewritten to improve its structure and clarity.

Major additions to Version 4.0 since Version 3.0 include:

  • major changes to the introductory and conformance chapters, and extensive revisions to the discussion of punctuation, symbols, and format characters
  • extensive additions of CJK characters to cover dictionaries and historic usage
  • many new symbols for mathematical and technical publication
  • many individual characters such as currency symbols were added to other scripts, including Indic, Khmer, Latin, Greek, Arabic, and Syriac
  • substantially improved specification of conformance requirements, incorporating the character encoding model
  • encoding of supplementary characters
  • formalized policies for stability of the standard
  • clarification of semantics of special characters, including the byte order mark
  • major expansion of Unicode Character Database properties and of specifications for text boundaries and casing
  • more minority scripts, including Limbu, Tai Le, Osmanya, and Philippine scripts
  • more historic scripts, including Linear B, Cypriot, and Ugaritic
  • tightened definition of encoding terms, including UTF-32
  • substantial improvements to the script descriptions, particularly for Indic scripts and Khmer.

New Characters

1,226 new character assignments were made to the Unicode Standard, Version 4.0 (over and above what was in Unicode 3.2). These additions include currency symbols, additional Latin and Cyrillic characters, the Limbu and Tai Le scripts; Yijing Hexagram symbols, Khmer symbols, Linear B syllables and ideograms, Cypriot, Ugaritic, and a new block of variation selectors (especially for future CJK variants). Double diacritic characters were added for dictionary use.

These new characters extend the set of modern currency symbols, and represent a greater coverage of minority and historical scripts. The following table shows the allocation of code points in Unicode 4.0.0. For more information on the specific characters, see the file DerivedAge.txt in the Unicode Character Database.

Graphic

96,248

Format

134

Control

65

Private Use

137,468

Surrogate

2,048

Noncharacter

66

Reserved

878,083

The character repertoire corresponds to ISO/IEC 10646:2003. For more details of character counts, see Appendix D, Changes from Unicode Version 3.0.

Unicode Character Database

Unicode Version 4.0.0 introduced the concept of provisional properties, clarified the relationships between properties, and provided precisely defined fallback properties for characters not explicitly defined in the data files. The documentation was coalesced into UCD.html, with a combined list of Properties.

Other property changes include:

  • Prefix Format Control. U+06DD arabic end of ayah and U+070F syriac abbreviation mark were reclassified and have significantly different behavior as prefix format control characters. The new characters U+0600..U+0603 were given this behavior as well.
  • New Properties. The Hangul Syllable Type and identifier Other_ID_Start properties were added. The Unicode Radical Stroke property was classified as informative; all other Unihan properties were classified as provisional. PropertyValueAliases also adds block names.
  • Numeric Properties. CJK numeric values added; the properties Decimal Number (Nd) and the Numeric Type decimal digit were aligned in value.
  • Default Ignorables. Added Hangul Filler characters, U+00AD soft hyphen, CGJ,  and ZWS
  • Soft Hyphen. U+00AD soft hyphen was also changed to General Category Cf. Its semantics were clarified: it marks a position for hyphenation, rather than being itself a hyphen character. (The Hyphen property itself was stabilized, and thus not changed to reflect this.)
  • Modifier Letters. The General Category of U+02B9..U+02BA, U+02C6..U+02CF changed to General Category Lm.
  • Grapheme_Extend. The halfwidth katakana marks, and most combining marks (except as needed for canonical equivalence) were removed.
  • Mongolian Vowel Separator. U+180E mongolian vowel separator was changed to General Category Zs.
  • Deprecated Characters. Two Khmer characters, U+17A3 khmer independent vowel qaq and U+17D3 khmer sign bathamasat, were deprecated. Four others are strongly discouraged.
  • Enclosing combining marks. The scope has been defined more clearly.
  • ZWJ. The semantics with cursive scripts has been revised.
  • Normalization Corrections. There were corrections for characters U+2F868; U+2F874; U+2F91F; U+2F95F; U+2F9BF.

For more information, see the file UCD.html in the Unicode Character Database.

Conformance

Chapter 3 was substantially improved by incorporating the Unicode Character Encoding Model, resulting in fully specified definitions and conformance requirements of UTF-8, UTF-16, and UTF-32. As a part of this, the related concept of Unicode String is defined, which is a sequence of code units for internal processing; a sequence that is not necessarily a valid Unicode Encoding Form.

Clearer terminology was introduced for code points assignments, including the seven main categories given in the above table. The conformance status of UAXes, UTSes and UTRs was also clarified. In addition:

  • Identifiers. A structure for ensuring backwards-compatible programming language identifiers was introduced using the new property Other_ID_Start. There is also an alternate definition for complete stability of identifiers.
  • Bidi. The bidi algorithm was updated and moved to UAX #9 (see below).
  • Line Breaking and Boundaries. U+00AD soft hyphen was reclassified. Text boundaries were clarified.
  • Case Folding. The text from UAX #21, “Case Mappings,” was incorporated and updated for case folding and other new properties. The definition of titlecase uses word boundaries, and there is a clearer definition of string functions:
    • isUpper(), isLower(), isTitle(), isFold()
    • toUpper(), toLower(), toTitle(), toFold()

Unicode Standard Annexes

The following Unicode Standard Annex was added:

  • UAX #29: Text Boundaries
    • Now contains information on text boundary conditions formerly published in Chapter 5 of The Unicode Standard, Version 3.0.
    • Provides default definitions for grapheme cluster ('user character'), word, and sentence boundaries

The following Unicode Standard Annexes were updated:

  • UAX #9: The Bidirectional Algorithm
    • Now contains information on the bidirectional algorithm formerly published in Chapter 3 of The Unicode Standard, Version 3.0.
    • Canonically equivalence is now preserved (a data change, not algorithm change)
    • Shaping is done after reordering, but not across directional boundaries
    • There were clarifications of: ZWJ, ZWNJ, and intermediate level processing
  • UAX #14: Line Breaking Properties
    • Negative numbers and dates with hyphens will not break across lines
    • Word-Joiner will link any characters (except hard line breaks)
    • The behavior of soft hyphen is clarified (it marks an opportunity for breaking, not specific graphic appearance)
    • The rules for GL are relaxed: SP and ZW override GL
    • There are new property values: NL, WJ
  • UAX #15: Unicode Normalization Forms
    • There is a description of Stable Code Points, and the notation NFC(x) and isNFC(x)
    • Annex 12: Corrigenda was rewritten for clarity, and to describe the use of Normalization Corrections.
    • Annex 13: Canonical Equivalence was added
  • UAX #11: East Asian Width
    • Extended the range for the default property value to 30000–3FFFD.

The following Unicode Technical Report was upgraded in status to a Unicode Standard Annex:

  • UAX #24: Script Names
    • Added notes on the stability of Q names, the usage of Mn, Me characters, and scripts with regard to spoofing.
    • Added Braille.

The following Standard Annexes were superseded as a result of their incorporation into the text of the Version 4.0.0 core specification:

  • UAX #13: Unicode Newline Guidelines
  • UAX #19: UTF-32
  • UAX #21: Case Mappings
  • UAX #27: Unicode 3.1
  • UAX #28: Unicode 3.2

Errata

Errata incorporated into Unicode 4.0 are listed  by date in a separate table. For corrigenda and errata after the release of Unicode 4.0, see the list of current Updates and Errata.