Unicode® 7.0.0
Released: 2014 June 16 (Announcement)
Version 7.0.0 has been superseded by the latest version of the Unicode Standard.
This page summarizes the important changes for the Unicode Standard, Version 7.0.0.
This version supersedes all previous versions of the Unicode Standard.
A. Summary
B. Technical Overview
C. Stability Policy Update
D. Textual Changes and Character Additions
E. Conformance Changes
F. Changes in the Unicode Character Database
G. Changes in the Unicode Standard Annexes
H. Changes in Synchronized Unicode Technical Standards
M. Implications for Migration
Unicode 7.0 adds a total of 2,834 characters, encompassing 23 new scripts and many new symbols, as well as character additions to many existing scripts. Notable character additions include the following:
- Two newly adopted currency symbols: the manat, used in Azerbaijan, and the ruble, used in Russia and other countries
- Pictographic symbols (including many emoji), geometric symbols, arrows, and ornaments originating from the Wingdings and Webdings sets
- Twenty-three new lesser-used and historic scripts extending support for written languages of North America, China, India, other Asian countries, and Africa
- Letters used in Teuthonista and other transcriptional systems, and a new notational set, Duployan
Other important updates in Unicode Version 7.0 include:
- Significant reorganization of the chapters and layout of the core specification, and a new page size tailored for easy viewing on e-readers and other mobile devices
- Alignment with updates to the Unicode Bidirectional Algorithm
- Further clarification of the case pair stability policy, and a new stability policy for Numeric_Type
- Significant updates to Unihan with the addition of nearly 3,000 new Cantonese pronunciation entries
- Major enhancements to the Indic script properties that lay the foundation for improved, more interoperable display of these scripts
Synchronization
Two other important Unicode specifications are maintained in synchrony with the Unicode Standard, and include updates for the repertoire additions made in Version 7.0, as well as other modifications:
This version of the Unicode Standard is synchronized with ISO/IEC 10646:2012, plus Amendments 1 and 2. Additionally, it includes the accelerated publication of U+20BD RUBLE SIGN.
See Sections D through H below for additional details regarding the changes in this version of
the Unicode Standard, its associated annexes, and the other synchronized Unicode specifications.
Version 7.0 of the Unicode Standard consists of the core specification (download),
the delta and archival code charts for this version, the Unicode Standard Annexes, and
the Unicode Character Database (UCD).
The core specification gives the general principles,
requirements for conformance, and guidelines for implementers. The
code charts show representative glyphs for all the Unicode
characters. The Unicode Standard Annexes supply detailed normative
information about particular aspects of the standard. The Unicode
Character Database supplies normative and informative data for
implementers to allow them to implement the Unicode Standard.
A complete specification of the contributory files for Unicode
7.0 is found on the page Components for 7.0.0.
That page also provides the recommended reference format for Unicode Standard Annexes. For examples of how to cite particular portions of the Unicode Standard, see also the Reference Examples.
The navigation bar on the left of this page provides links to
both the core specification as a single file,
as well as to individual chapters, and
the appendices.
Also provided are links to the code charts, the radical-stroke indices to CJK
ideographs, the Unicode Standard Annexes and the data files for Version 7.0 of the Unicode Character Database.
Version 7.0.0 of the Unicode Standard
should be referenced as:
The Unicode Consortium. The Unicode Standard, Version 7.0.0, (Mountain View, CA: The Unicode Consortium,
2014. ISBN 978-1-936213-09-2)
http://www.unicode.org/versions/Unicode7.0.0/
The terms “Version 7.0” or “Unicode 7.0” are abbreviations for the full version reference, Version 7.0.0.
The citation and permalink for the latest published version of the Unicode Standard is:
The Unicode Consortium. The Unicode Standard.
http://www.unicode.org/versions/latest/
Several sets of code charts are available. They serve different
purposes:
- The latest set of code charts for the Unicode Standard are available online. Those charts are always the most current code charts available, and may be updated at any time. The charts are organized by scripts and blocks for easy reference. An online index by character name is also provided.
For Unicode 7.0.0 in particular two additional sets of code chart pages are provided:
- A set of delta code charts showing the
blocks in which characters were added for Unicode 7.0.0. The new characters are visually highlighted in the charts.
- A set of archival code charts that represent
the entire set of characters, names and representative glyphs at the time of publication of Unicode 7.0.0.
The delta and archival code charts are a stable part of this release of the Unicode Standard. They will never be updated.
Errata incorporated into Unicode 7.0 are listed by date in
a separate table. For corrigenda and errata after the release of Unicode 7.0, see the list of current
Updates and Errata.
- The case pair stability policy has been augmented with further clarification.
- A property value stability policy has been added for Numeric_Type=Digit.
The block descriptions in the core spec were reorganized significantly. Twenty-three new
scripts were added with accompanying new block descriptions:
Bassa Vah |
Mahajani |
Pahawh Hmong |
Caucasian Albanian |
Manichaean |
Palmyrene |
Duployan |
Mende Kikakui |
Pau Cin Hau |
Elbasan |
Modi |
Psalter Pahlavi |
Grantha |
Mro |
Siddham |
Khojki |
Nabataean |
Tirhuta |
Khudawadi |
Old North Arabian |
Warang Citi |
Linear A |
Old Permic |
|
With Version 7.0, support for lesser-used languages was extended worldwide, including:
- Arabic additions for languages of Pakistan and for the African languages Berber and Fulfulde
- Cyrillic additions for languages of Russia
- Myanmar additions for the Tai Laing, Shan Pali, and Shwe Palaung languages
Letters used in Teuthonista and other transcriptional systems and a new notational set, Duployan, used for writing certain shorthands and Native American languages were added. Many symbols originating from the Wingdings and Webdings sets were also added, as well as more emoji and other pictographic symbols.
Changes in the Unicode Standard Annexes are listed in Section G.
Character Assignment Overview
327 characters have been added to the BMP, while 2,507 characters have been added to Plane 1. Most character additions are in new blocks, but there are also character additions to a number of existing blocks.
New Blocks
The newly-defined blocks in Version 7.0 are:
Range |
Block Name |
1AB0..1AFF |
Combining Diacritical Marks Extended |
A9E0..A9FF |
Myanmar Extended-B |
AB30..AB6F |
Latin Extended-E |
102E0..102FF |
Coptic Epact Numbers |
10350..1037F |
Old Permic |
10500..1052F |
Elbasan |
10530..1056F |
Caucasian Albanian |
10600..1077F |
Linear A |
10860..1087F |
Palmyrene |
10880..108AF |
Nabataean |
10A80..10A9F |
Old North Arabian |
10AC0..10AFF |
Manichaean |
10B80..10BAF |
Psalter Pahlavi |
11150..1117F |
Mahajani |
111E0..111FF |
Sinhala Archaic Numbers |
11200..1124F |
Khojki |
112B0..112FF |
Khudawadi |
11300..1137F |
Grantha |
11480..114DF |
Tirhuta |
11580..115FF |
Siddham |
11600..1165F |
Modi |
118A0..118FF |
Warang Citi |
11AC0..11AFF |
Pau Cin Hau |
16A40..16A6F |
Mro |
16AD0..16AFF |
Bassa Vah |
16B00..16B8F |
Pahawh Hmong |
1BC00..1BC9F |
Duployan |
1BCA0..1BCAF |
Shorthand Format Controls |
1E800..1E8DF |
Mende Kikakui |
1F650..1F67F |
Ornamental Dingbats |
1F780..1F7FF |
Geometric Shapes Extended |
1F800..1F8FF |
Supplemental Arrows-C |
- Minor changes were made to reflect updates to the Bidirectional Algorithm in Version 6.3 of the Unicode Standard.
- Corrigendum #9 was applied to D14 (Noncharacter).
- The changes from Version 6.3 of the Unicode Standard were incorporated in D136 (Case-ignorable) in the updated core specification.
The detailed listing of all changes to the contributory data files of the Unicode Character Database
for Version 7.0 can be found in
UAX #44, Unicode Character Database.
The changes listed there include character additions and property revisions to existing characters that will affect implementations.
Some of the important impacts on implementations migrating from earlier versions of the standard are highlighted in
Section M.
There were several changes to Unihan data, including the addition of nearly 3,000 new Cantonese pronunciation entries, significant modification to the syntax for kIICore, and the relocation of kRSUnicode and kCompatibilityVariant to Unihan_IRGSources.txt.
Major enhancements were made to the Indic script properties. New property values were added to enable
a more algorithmic approach to rendering Indic scripts. These include values for joining behavior,
new classes for numbers, and a further division of the syllabic categories of viramas and rephas.
With these enhancements, the default rendering for newly added Indic scripts can be significantly improved.
Other updates include changes to the derivations of the Alphabetic and Case_Ignorable properties, and a number of updates to the Script and Script_Extensions property assignments. Also, the conventions for defining default property values for ranges of code points using “@missing” directives was regularized.
In Version 7.0, some of the Unicode Standard Annexes have had significant revisions. The most important of these changes are listed below. For the full details of all changes, see the Modifications section
of each UAX, linked directly from the following list of UAXes.
Unicode Standard Annex |
Changes |
UAX #9 Unicode Bidirectional Algorithm
|
No significant changes in this version. |
UAX
#11 East Asian Width |
No significant changes in this version. |
UAX
#14 Unicode Line Breaking Algorithm |
No significant changes in this version. |
UAX
#15 Unicode Normalization Forms
|
Corrected note for Table 3, Notational Conventions. |
UAX
#24 Unicode Script Property
|
No significant changes in this version. |
UAX
#29 Unicode Text Segmentation |
Added U+AA7D MYANMAR SIGN TAI LAING TONE-5 to the exception list for SpacingMark in Table 2, Grapheme_Cluster_Break Property Values. Added a note to clarify that Format and Extend characters are not joined to separators like LF, as well as a note about the fact that words can span a sentence break in Section 5.1 Default Sentence Boundary Specification. |
UAX
#31 Unicode Identifier and Pattern Syntax
|
Added many new scripts to Table 4, Candidate Characters for Exclusion from Identifiers. The text on natural-language identifiers was changed to have a stronger recommendation for including the exception characters, and include the Catalan MIDDLE DOT. |
UAX
#34 Unicode Named Character Sequences |
Added definitions for Unicode namespace and the Unicode namespace for character names. Major rewrite of Section 4, Names. |
UAX
#38 Unicode Han Database (Unihan) |
The syntax for the kIICore field has been changed. The kCompatibilityVariant and kRSUnicode fields have been moved to Unihan_IRGSources.txt. |
UAX
#41 Common References for Unicode Standard Annexes |
No significant changes in this version. |
UAX
#42 Unicode Character Database in XML |
Added the value 7.0 for the age attribute, and new values for the attributes blk, jg, sc, KIICore, kIRG_GSource, and InSC. |
UAX
#44
Unicode Character Database |
Updated the derivation of the Alphabetic property and of the Case_Ignorable property. Simplified the discussion of @missing in Section 4.2.10 @missing Conventions, to reflect the revised conventions in the UCD data files, which eliminated special edge cases. Corrected statement about aliases for provisional properties in Section 5.8 Property and Property Value Aliases. |
UAX
#45
U-Source Ideographs |
Clarified meaning of status field. |
There are also significant revisions in the Unicode Technical Standards whose
versions are synchronized with the Unicode Standard. The most important of these changes are listed below.
For the full details of all changes, see the Modifications section
of each UTS, linked directly from the following list of UTSes.
Unicode Technical Standard |
Changes |
UTS #10 Unicode Collation Algorithm |
Changed the text to discuss collation weights more generically, with fewer references to the 16-bit weights used in the DUCET, and Section 6.3.2, Large Values for Secondary or Tertiary Weights was merged into Section 6.2, Large Weight Values. |
UTS #46 Unicode IDNA Compatibility Processing |
Updated statistics for 7.0.0 in Table 4, IDNA Comparisons. Section 4 has been modified to clarify the input and results for each major step in the algorithm. In Section 5 IDNA Mapping Table, added a new value for field 3, XV8,with example. In Section 8.1 Format, made the definition of NV8 consistent with Section 5 IDNA Mapping Table. |
There are a significant number of changes in Unicode 7.0 which may impact implementations
which are upgrading to Version 7.0 from earlier versions of the standard. The most
important of these are listed and explained here, to help focus on the issues most
likely to cause unexpected trouble during upgrades.
Script-related Changes
Version 7.0 adds many new scripts, so implementations which process script data should
be carefully checked. In particular:
- The large number of additional scripts (23) may cause overflows for implementations
which make hard-coded assumptions about the number of scripts in the standard. As of
Unicode 7.0, there are now 127 values of the Script property, which may break
implementations that have stored Script property values in bit fields or a signed byte.
- There have been significant additions and changes to the Script_Extensions property.
Implementations which use the scx property values should check the new data carefully,
especially for common-use characters which may be shared across several scripts.
- The Script property value of U+061C ARABIC LETTER MARK (ALM) was changed from Arabic to Common,
for consistency with the similar directional controls U+200E LEFT-TO-RIGHT MARK (LRM)
and U+200F RIGHT-TO-LEFT MARK (RLM). This change may not be expected for a character
located in the Arabic block.
Rendering Issues
A number of the newly added scripts, and in particular, Manichaean and Psalter Pahlavi,
have complex shaping behavior. For those two scripts, additional values related to
joining behavior appear in ArabicShaping.txt, which may not be expected. In particular:
- New Joining_Group values have been defined for Manichaean.
- Two Manichaean letters have received the unusual Joining_Type value of L, which
formerly had only been used for one Phags-pa letter.
- The provisional properties defined in IndicSyllabicCategory.txt and IndicMatraCategory.txt
were significantly overhauled. As part of the changes, the InSC property was further
subdivided, with many new values added. These values are relevant to implementations of
complex rendering in many Indic scripts and may impact implementations that were
making use of these provisional properties.
Casing-related Changes
In addition to the usual scattering of new case pairs added for the Latin and Cyrillic
scripts, there are noteworthy changes which impact casing behavior:
- Several uppercase letters were added for Latin letters which formerly had no
uppercase counterpart. In addition, an uppercase counterpart was added for the Greek
letter yot.
- Three ranges of enclosed capital Latin alphabetic symbols, U+1F130..U+1F149,
U+1F150..U+1F169, and U+1F170..U+1F189, were assigned the contributory property
Other_Uppercase and thus, by derivation, also the properties Uppercase, Alphabetic,
and the corresponding values of various text segmentation properties. This was
done to bring these relatively recently encoded alphabetic symbols into line
with similar sets of circled alphabetic symbols that have long been present in the standard.
- One of the newly encoded scripts, Pahawh Hmong, is bicameral. The appearance of
a newly encoded bicameral script on Plane 1, with the attendant need for case mapping
and case folding, may break certain assumptions baked into implementations of
casing tables.
Segmentation-related Changes
Segmentation-related changes to existing property values were deliberately kept to a minimum
for Version 7.0, and for the most part reflect just minor corrections to relatively rare
characters. However, there was one significant set of changes impacting two fairly
salient punctuation marks used in Arabic:
- The General_Category and Line_Break properties for U+FD3E ORNATE LEFT PARENTHESIS
and U+FD3F ORNATE RIGHT PARENTHESES were swapped: General_Category Ps ↔ Pe;
Line_Break OP ↔ CL. This change was made because these two characters are the
sole exceptions among paired bracket punctuation marks for bidirectional mirroring.
They do not mirror in a bidirectional context, and are instead entered and edited
based on their visual appearance. The property changes bring the General_Category
and Line_Break values of these characters into line with their visual interpretation. Note that these
changes do not affect the behavior of these characters for bidirectional layout
per the Unicode Bidirectional Algorithm, but the changes may otherwise be unexpected
for implementations of these characters.
CJK Changes
- All of the nearly 10,000 values for kIICore were changed from "2.1" to other values like "AGTHKM". (See UAX #38 for details.) Implementations that treat the presence of kIICore effectively as a boolean value may not need any change. However, implementations that test for a value explicitly, such as with regex expressions like [:kIICore=2.1:], would need to be updated.
- There were nearly 3,000 additions of readings for kCantonese. The simple addition of readings should not cause problems for most implementations, but some may hit internal boundaries due to the number of changes.
- There were 4 changes to kMandarin, for the characters 㐵, 𠦌, 掠 and 略. These updated kMandarin values may cause pinyin readings to change
for implementations which use kMandarin to define them, and may also cause sort
order changes for collations based on pinyin ordering. This applies,
for example, to implementations using CLDR.
- A kTraditionalVariant value was added for 䜤, to 鿁.
UCD File Format Changes
In general, the format of UCD data files is unchanged for Version 7.0.
However, there were some minor updates which may impact some parsers.
- The handling of @missing comment lines was modified somewhat. This may impact
some parsers which choose to extract @missing information from the data files.
For the most part, the changes were to minimize the number of distinct formats
for these @missing lines and to make their locations more predictable and regular.
The file most affected by these changes is PropertyValueAliases.txt.
For more details, see
UAX #44, Unicode Character Database.
- Parsers of NamesList.txt should note that subhead fields in that data file
now contain non-ASCII characters, which was not the case in earlier versions.
See NamesList.html in the UCD for details about the format of NamesList.txt.