Unicode® 10.0.0
Version 10.0.0 has been superseded by the latest version of the Unicode Standard.
This page summarizes the important changes for the Unicode Standard, Version 10.0.0.
This version supersedes all previous versions of the Unicode Standard.
A. Summary
B. Technical Overview
C. Stability Policy Update
D. Textual Changes and Character Additions
E. Conformance Changes
F. Changes in the Unicode Character Database
G. Changes in the Unicode Standard Annexes
H. Changes in Synchronized Unicode Technical Standards
M. Implications for Migration
Unicode 10.0 adds 8,518 characters, for a total of 136,690 characters.
These additions include 4 new scripts,
for a total of 139 scripts, as well as 56 new emoji characters.
The new scripts and characters in Version 10.0 add support for lesser-used languages and unique written requirements worldwide, including:
- Masaram Gondi, used to write Gondi in Central and Southeast India
- Nüshu, used by women in China to write poetry and other discourses until the late twentieth century
- Soyombo and Zanabazar Square, used in historic Buddhist texts to write Sanskrit, Tibetan, and Mongolian
- Syriac letters used for writing Suriyani Malayalam, also known as Garshuni and as Syriac Malayalam
- Gujarati signs used for the transliteration of the Arabic script into Gujarati by Ismaili Khoja communities
- A set of 285 Hentaigana characters used in Japan (historic variants of Hiragana characters)
- CJK Extension F (7,473 Han characters)
Important symbol additions include:
- Bitcoin sign
- 56 emoji characters (full list)
- A set of Typicon marks and symbols
For statistics regarding emoji associated with Unicode 10.0,
see Emoji Counts.
Synchronization
Several other important Unicode specifications have been updated for Version 10.0.
The following three Unicode Technical Standards are versioned in
synchrony with the Unicode Standard, because their data files cover the same repertoire.
All have been updated to Version 10.0:
Additionally, Version 10.0 of the Unicode Standard makes use of the
emoji-related data and behavior specified in Version 5.0 of UTS #51:
Some of the changes in Version 10.0 and associated Unicode Technical Standards
may require modifications
to implementations. For more information, see the migration and modification sections of UTS #10, UTS #39, UTS #46, and UTS #51.
This version of the Unicode Standard is also synchronized with 10646:2017, fifth edition, plus the following additions from Amendment 1 to the fifth edition:
- 56 emoji characters
- 285 hentaigana
- 3 additional Zanabazar Square characters
See Sections D through H below for additional details regarding the changes in this version of
the Unicode Standard, its associated annexes, and the other synchronized Unicode specifications.
Version 10.0 of the Unicode Standard consists of:
- The core specification
- The code charts (delta and archival) for this version
- The Unicode Standard Annexes
- The Unicode Character Database (UCD)
The core specification gives the general principles,
requirements for conformance, and guidelines for implementers. The
code charts show representative glyphs for all the Unicode
characters. The Unicode Standard Annexes supply detailed normative
information about particular aspects of the standard. The Unicode
Character Database supplies normative and informative data for
implementers to allow them to implement the Unicode Standard.
The core specification is available as
a single pdf for viewing.
(12 MB)
Links are also available
in the navigation bar on the left of this page to access
individual chapters and appendices
of the core specification.
Several sets of code charts are available. They serve different
purposes:
- The latest set of code charts for the Unicode Standard is available online. Those charts are always the most current code charts available, and may be updated at any time. The charts are organized by scripts and blocks for easy reference. An online index by character name is also provided.
For Unicode 10.0.0 in particular two additional sets of code chart pages are provided:
- A set of delta code charts showing the
new blocks and any blocks in which characters were added for Unicode 10.0.0. The new characters are visually highlighted in the charts.
- A set of archival code charts that represents
the entire set of characters, names and representative glyphs at the time of publication of Unicode 10.0.0.
The delta and archival code charts are a stable part of this release of the Unicode Standard. They will never be updated.
Links to the individual
Unicode Standard Annexes are available in
the navigation bar on the left of this page. The list of signification changes
in the content of the Unicode Standard Annexes for Version 10.0 can be found
in Section G below.
Data files for Version 10.0 of
the Unicode Character Database are available. The ReadMe.txt in that directory provides a roadmap
to the functions of the various subdirectories.
Zipped versions of the UCD
for bulk download are available, as well.
Version 10.0.0 of the Unicode Standard
should be referenced as:
The Unicode Consortium. The Unicode Standard, Version 10.0.0, (Mountain View, CA: The Unicode Consortium,
2017. ISBN 978-1-936213-16-0)
http://www.unicode.org/versions/Unicode10.0.0/
The terms “Version 10.0” or “Unicode 10.0” are abbreviations for the full version reference, Version 10.0.0.
The citation and permalink for the latest published version of the Unicode Standard is:
The Unicode Consortium. The Unicode Standard.
http://www.unicode.org/versions/latest/
A complete specification of the contributory files for Unicode
10.0 is found on the page Components for 10.0.0.
That page also provides the recommended reference format for Unicode Standard Annexes. For examples of how to cite particular portions of the Unicode Standard, see also the Reference Examples.
Errata incorporated into Unicode 10.0 are listed by date in
a separate table. For corrigenda and errata after the release of Unicode 10.0, see the list of current
Updates and Errata.
There were no significant changes to the Stability Policy of the core specification between Unicode 9.0 and Unicode 10.0.
Four new
scripts were added with accompanying new block descriptions:
Script |
Number of Characters |
Masaram Gondi |
75 |
Nushu |
396 |
Soyombo |
80 |
Zanabazar Square |
72 |
Changes in the Unicode Standard Annexes are listed in Section G.
Character Assignment Overview
8,518 characters have been added.
Most character additions are in new blocks, but there are also character additions to a number of existing blocks. For details, see
Delta Code Charts.
A formal definition of "block" has been added to the Conformance chapter
of the core specification for Unicode 10.0 as D10b.
The detailed listing of all changes to the contributory data files of the Unicode Character Database
for Version 10.0 can be found in
UAX #44, Unicode Character Database.
The changes listed there include character additions and property revisions to existing characters that will affect implementations.
Some of the important impacts on implementations migrating from earlier versions of the standard are highlighted in
Section M.
In Version 10.0, some of the Unicode Standard Annexes have had significant revisions. The most important of these changes are listed below. For the full details of all changes, see the Modifications section
of each UAX, linked directly from the following list of UAXes.
Unicode Standard Annex |
Changes |
UAX #9 Unicode Bidirectional Algorithm
|
Clarified the equivalence between directional formatting characters and HTML5 markup, pointing out the differences from HTML4.0. Updated the table in Section 2.7,
Markup and Formatting Characters with explicit directional formatting characters and equivalent CSS. |
UAX
#11 East Asian Width |
Referred to the new Regional_Indicator property. Updated references to UTS #51, Unicode Emoji, and terminology derived from that UTS. |
UAX
#14 Unicode Line Breaking Algorithm |
Removed Section 7, Pair Table Based Implementation, and other references to it. Strengthened the recommendation to use tailorings based on CLDR rules and emoji properties, for improved line breaking behavior of emoji zwj sequences. Made corrections to descriptions of ID and NS classes. |
UAX
#15 Unicode Normalization Forms
|
No significant changes in this version. |
UAX
#24 Unicode Script Property
|
No significant changes in this version. |
UAX
#29 Unicode Text Segmentation |
Strengthened the recommendation to use tailorings based on CLDR rules and
emoji properties, for improved segmentation behavior of emoji zwj sequences.
Changed the derivation of the Word_Break property value ALetter to include a
set of 35 phonetic modifiers, to prevent word boundaries between those
characters and alphabetic letters. |
UAX
#31 Unicode Identifier and Pattern Syntax
|
Withdrew the table of aspirational use scripts, moving the contents to
the table of limited use scripts, and added a note explaining the reason. |
UAX
#34 Unicode Named Character Sequences |
No significant changes in this version. |
UAX
#38 Unicode Han Database (Unihan) |
Updated the regular expression for the kIRG_HSource field, updated terminology
to reflect the difference between the IRG's U-source and the UTC-source,
and added references to the CJK Unified Ideographs Extension F block. |
UAX
#41 Common References for Unicode Standard Annexes |
Updated all references for Unicode 10.0. |
UAX
#42 Unicode Character Database in XML |
Added new code point attributes, values, and patterns. |
UAX
#44
Unicode Character Database |
Updated the description of the Name property value.
Updated the discussion of immutable properties and the list of those
properties in Table 19.
Added new Section 5.13 Property APIs.
Added discussion of new data file DerivedName.txt to Section 5.4,
Derived Extracted Properties.
Added new Section 2.1.3, Properties Dependent on External Specifications to
discuss the dependency of UCD segmentation properties on the non-UCD emoji properties.
Added new Section 5.14, Character Age to further explain the details of the
Age property and its derivation. |
UAX #45
U-Source Ideographs |
Updated terminology to reflect the difference between the IRG's U-source
and the UTC-source. Updates to contents and status values. |
UAX #50
Unicode Vertical Text Layout |
Newly added as an annex in 10.0, converted from an earlier, approved UTR. |
There are also significant revisions in the Unicode Technical Standards whose
versions are synchronized with the Unicode Standard. The most important of these changes are listed below.
For the full details of all changes, see the Modifications section
of each UTS, linked directly from the following list of UTSes.
Unicode Technical Standard |
Changes |
UTS #10 Unicode Collation Algorithm |
The specification underwent a major rewrite to add formal definitions and to clarify the statement of
the main algorithm. The rewrite did not change the algorithm itself or the
expected results for any given input data and version level of DUCET.
Added Nüshu to the list of siniform ideographic
scripts given implicit primary weights similar to Han ideographs. |
UTS #39 Unicode Security Mechanisms |
Removed references to aspirational use scripts because that category has been merged with limited use scripts. That change impacted the results from Section 5.2, Restriction-Level Detection, for the five affected scripts. Extensively reformulated the text in Section 4, Confusable Detection and Section 5, Detection Mechanisms, for clarity and precision. Removed subparts 4 through 6 of conformance clause C2. |
UTS #46 Unicode IDNA Compatibility Processing |
Added three new parameters which allow implementations to reflect current practice in browsers: CheckHyphens, CheckBidi, and CheckJoiners. Updated the counts in Table 4, IDNA Comparisons for Version 10.0, and improved the explanation of the divergence from IDNA2008. |
There are a significant number of changes in Unicode 10.0 which may impact implementations which are upgrading to Version 10.0 from earlier versions of the standard. The most important of these are listed and explained here, to help focus on the issues most likely to cause unexpected trouble during upgrades.
Script-related Changes
Version 10.0 adds four new scripts, so implementations which process script data
should be carefully checked. Some of these scripts have particular attributes
which may cause issues for implementations.
Zanabazar Square and Soyombo are complex, historic abugidas. They were modeled
on Tibetan, and used to write Mongolian, Tibetan, and Sanskrit. The implementation
of these scripts poses challenges, in particular for rendering.
Masaram Gondi is another newly added complex script, inspired by the Brahmi model, but
with its own, distinct rendering issues.
A large collection of Japanese hentaigana has been added. These are effectively
historic variants of Hiragana syllables. However, they are not encoded with
normative decompositions, nor using variation sequences. For collation,
hentaigana syllables do not have default weights the same as the standard Hiragana syllables
they are equivalent to. Instead, they are sorted in a separate range following
all the standard Hiragana syllables.
Shaping Issues
The letters in the Syriac Supplement block, added for Malayalam Garshuni,
include one which can be found with different joining behavior in different sources.
Thus, U+0868 SYRIAC LETTER MALAYALAM LLA can sometimes be found joining on both sides and
sometimes joining only on the right. To help implementations handle both situations,
U+0868 was assigned the Joining_Type property value Dual_Joining. In the cases where
U+0868 needs to be treated as right-joining, U+200C ZERO WIDTH NON-JOINER should be
used to prevent joining to the visual left of the letter.
The alternative approach of assigning a Joining_Type property value of Right_Joining would
have been more onerous for implementations, incurring additional ligatures and contextual forms.
The choice of Joining_Type for U+0868 is also not new: an example of a character with
similar classification and joining behavior is U+10AC0 MANICHAEAN LETTER ALEPH.
Several previously encoded Tai Tham characters and one Javanese mark,
U+A9BF JAVANESE CONSONANT SIGN CAKRA, changed their classification in
Indic_Syllabic_Category and Indic_Positional_Category, respectively.
Shaping implementations that use Indic properties should be aware of the changes,
as they may affect the rendering of the affected characters.
Six Gujarati nonspacing combining marks used for transliteration of Arabic were
added at the end of the Gujarati block: U+0AFA GUJARATI SIGN SUKUN ..
U+0AFF GUJARATI SIGN TWO-CIRCLE NUKTA ABOVE. Some of those marks may occur in
combinations with a single base letter. For example, a nukta or shadda may appear
in combination with sukun over the same letter, and the two marks are usually strung horizontally.
Implementations should handle such sequences so as to avoid unintended visual overlapping.
Similar treatment should be given to U+0AFD GUJARATI SIGN THREE-DOT NUKTA ABOVE when
it appears in combination with a vowel sign that extends above the same base consonant.
Segmentation-related Changes
A set of 35 phonetic modifiers, which includes U+02D7 MODIFIER LETTER MINUS SIGN, were
assigned the Word_Break property value ALetter. As a result of this change, there will no
longer be word boundaries between alphabetic letters and adjacent phonetic modifiers from that set.
This behavior is consistent with the way other IPA modifiers, such as
U+02D0 MODIFIER LETTER TRIANGULAR COLON, attach to letters in word segments.
Implementations of text segmentation will find fewer word boundaries
in the affected sequences. Such sequences are, however, rare edge cases in
standard language orthographies, and are mostly found in specialized transcription
systems.
Note that the reclassification of the 35 phonetic modifiers was done by direct inclusion
in the set of characters with the Word_Break property value ALetter,
without changing their General_Category property values, which continue to
be Modifier_Symbol (gc = Sk).
The UCD properties for line breaking and text segmentation have dependencies
on properties of emoji characters specified in Version 5.0 of UTS #51, Unicode Emoji,
such as the binary
properties Emoji and Emoji_Modifier_Base. Implementations should be aware of
changes in line breaking and text segmentation behavior for some of the emoji
symbols in Unicode 10.0, as a result of emoji data changes in UTS #51 Version 5.0.
(Some of those changes had been introduced in UTR #51 Version 4.0 and carried
forward in UTS #51 Version 5.0.)
For line breaking, the characters that appear as bases in valid emoji modifier
sequences as of Version 5.0 of UTS #51, Unicode Emoji, were assigned the Line_Break property
value E_Base—a change from the previous value Ideographic. This applies to
five previously encoded emoji symbols (U+1F3C2, U+1F3C7, U+1F3CC, U+1F574,
and U+1F6CC), as well as to 16 of the 56 newly encoded emoji symbols.
According to the Unicode Line Breaking Algorithm, line breaks are prevented
between E_Base characters and emoji modifiers for skin tone.
Conversely, two previously encoded emoji symbols (U+1F91D and U+1F93C)
changed their Line_Break property value from E_Base to Ideographic, because
they no longer appear in valid emoji modifier sequences as of UTS #51 Version 5.0.
That change leads to the introduction of line breaking opportunities after
those two characters.
For text segmentation, three symbols which have long been in the standard, U+2640 FEMALE SIGN,
U+2642 MALE SIGN, and U+2695 STAFF OF AESCULAPIUS were assigned the
value Glue_After_Zwj for both their Grapheme_Cluster_Break and Word_Break properties.
The change reflects the new use of those symbols in valid emoji zwj sequences for
genders and roles; the change prevents grapheme cluster and word boundaries between a ZWJ
character and each of those symbols.
Other emoji symbols, some existing and some newly encoded, were assigned the
Grapheme_Cluster_Break and Word_Break property values E_Base or Glue_After_Zwj,
to prevent grapheme cluster and word boundaries around them in emoji sequences.
CJK/Unihan Changes
Unicode 10.0 introduces the new CJK Unified Ideographs Extension F block,
as well as 21 new ideographs at the end of the CJK Unified Ideographs block.
Implementations often have
hard-coded ranges for CJK ideographs, so should be checked carefully to
ensure they pick up the new end range (U+9FEA) for the CJK Unified Ideographs block,
as well as the range for the new CJK Extension F. For the latter, UnicodeData.txt provides
a range of code points using the established syntax for large ranges of characters with
algorithmically derived names, with the identifiers <CJK Ideograph Extension F, First>
and <CJK Ideograph Extension F, Last>.
CJK Extension F contains mostly rare characters, but also includes a number of personal
and placename characters important for government specifications in Japan, in
particular.
Standardized Variation Sequences
There have been significant changes to StandardizedVariants.txt and regarding the
documentation of variation sequences involving emoji, which are now known more specifically as
emoji presentation sequences and text presentation sequences.
All of the emoji and text presentation sequences were moved from the UCD file
StandardizedVariants.txt to the UTS #51 data file emoji-variation-sequences.txt.
The latter is a new data file accompanying Version 5.0 of UTS #51, Unicode Emoji,
whose emoji character repertoire corresponds to Unicode 10.0.
New emoji and text presentation sequences are also included in emoji-variation-sequences.txt.
Implementations should be prepared to consume such sequence data from the new file and,
in general, to use Unicode Emoji Version 5.0 data (or later) in conjunction with UCD 10.0 data.
Other changes in StandardizedVariants.txt include corrections to the labels of a
few Mongolian standardized variation sequences, but without changes to the actual
character sequences. These changes are reflected in the Unicode code charts.
The documentation file StandardizedVariants.html has been removed
altogether from the UCD, as its function has been superseded by other documentation.
Representative glyphs for the standardized variation sequences are still shown
in the Unicode code charts, but emoji and text presentation sequences
are now displayed in the emoji charts, instead.
New Data Files Added to the UCD
Several new data files have been added to the UCD. Implementations which parse
the UCD files may need to be adjusted, depending on whether they require this
new data or not.
NushuSources.txt. This file contains normative information on the source references
for Nüshu characters. The file format is similar to the format of the Unihan data
files and TangutSources.txt. Implementations which support that format for Unihan or
Tangut data should be able to add support for Nüshu data in a similar manner.
VerticalOrientation.txt. Starting with Version 10.0.0 of the Unicode Standard, this
data file, which lists the Vertical_Orientation property values, is formally included
in the Unicode Character Database. The file format has not changed, but certain lines
of data have been updated for consistency with other UCD files.
DerivedName.txt (in the "extracted/" subdirectory). This file provides a complete
listing of the formal Name
property values of characters. In the case of algorithmically derived names,
only those names that follow a simple pattern of a prefix followed by a code
point value are abbreviated. The names of Hangul syllable characters,
as well as all other character names, are listed individually.
Implementations can use this file to conveniently retrieve the formal character
names instead of independently deriving the names.
New Properties
The enumerated property Vertical_Orientation has been incorporated in the UCD,
as part of the change in status of UTR #50 to UAX #50, Unicode Vertical Text Layout.
See VerticalOrientation.txt, noted above.
A new normative binary property Regional_Indicator has been introduced.
This property is referenced in the line breaking and text segmentation algorithms,
to assist in the determination of correct text boundaries around emoji flag sequences.
Code Charts
There are numerous changes in the representative glyphs, some backed by
explicit errata.
There are also glyph changes in the text presentation of a number of emoji and emoticons.
Some of those changes reflect an attempt to make the text presentation glyphs for
emoji converge on common practice among vendors for the emoji presentation glyphs.
Such glyph changes are highlighted in violet in the
delta charts for Version 10.0.