Unicode® 15.0.0
Version 15.0.0 has been superseded by the latest version of the Unicode Standard.
This page summarizes the important changes for the Unicode Standard, Version 15.0.0.
This version supersedes all previous versions of the Unicode Standard.
A. Summary
B. Technical Overview
C. Stability Policy Update
D. Textual Changes and Character Additions
E. Conformance Changes
F. Changes in the Unicode Character Database
G. Changes in the Unicode Standard Annexes
H. Changes in Synchronized Unicode Technical Standards
M. Implications for Migration
Unicode 15.0 adds 4,489 characters,
for a total of 149,186 characters.
These additions include 2 new scripts,
for a total of 161 scripts, along with 20 new emoji characters, and 4,193 CJK (Chinese, Japanese, and Korean) ideographs.
The new scripts and characters in Version 15.0 add support for lesser-used languages
and unique written requirements worldwide, including numerous symbols additions.
Funds from the
Adopt-a-Character
program provided support for some of these additions.
The new scripts and characters include:
- Nag Mundari, a modern script used to write Mundari, a language spoken in India
- A Kannada character used to write Konkani, Awadhi, and Havyaka Kannada in India
- Kaktovik numerals, devised by speakers of Iñupiaq in Kaktovik, Alaska for the counting systems of the Inuit and Yupik languages
Popular symbol additions:
- 20 emoji characters, including hair pick, maracas, jellyfish, khanda, and pink heart. For complete statistics regarding all emoji as of
Unicode 15.0, see
Emoji Counts.
For more information about emoji additions in version 15.0, including new
emoji ZWJ sequences and emoji modifier sequences, see
Emoji Recently Added, v15.0.
Other symbol and notational additions include:
- The nine pointed white star symbol, used by members of the Bahá’í faith
- Eight symbols for celestial bodies, used by astronomers and astrologers
- Twenty-nine additional Egyptian hieroglyph format controls, which will enable Egyptologists to better represent texts
Support for other languages and scholarly work worldwide includes:
- Kawi, a historical script found in Southeast Asia, used to write Old Javanese and other languages
- Three additional characters for the Arabic script to support Quranic marks used in Turkey
- One new Lao sign used to write Lao Pali
- Three Khojki characters found in handwritten and printed documents
- Ten Devanagari characters used to represent auspicious signs found in inscriptions and manuscripts
- Six Latin letters used in Malayalam transliteration
- Sixty-three Cyrillic modifier letters used in phonetic transcription
- One additional Egyptian hieroglyph
Updates to the CJK blocks add:
- 4,192 ideographs in the new CJK Unified Ideographs Extension H block
- One ideograph in the CJK Unified Ideographs Extension C block
Support for CJK unified ideographs was enhanced in Version 15.0
by significant corrections and improvements to the Unihan database.
Changes to the Unihan database include updated source lists,
regular expressions, and new and updated fields.
See UAX #38,
Unicode Han Database (Unihan) for more information on the updates.
Important chart font updates, including:
- A set of updated glyphs for Egyptian hieroglyphs, in addition to standardized variation sequences to support rotated glyphs found in texts
- Improved glyphs for Unified Canadian Aboriginal Syllabics, which provide better support for Carrier and other languages
- A new Wancho font, with improved and simplified shapes
Synchronization
Several other important Unicode specifications have been updated for Version 15.0.
The following four Unicode Technical Standards are versioned in
synchrony with the Unicode Standard, because their data files cover the same repertoire.
All have been updated to Version 15.0:
Some of the changes in Version 15.0 and associated Unicode Technical Standards
may require modifications
to implementations. For more information, see the migration and modification sections of
UTS #10, UTS #39, UTS #46, and UTS #51.
See Sections D through H below for additional details regarding the changes in this version of
the Unicode Standard, its associated annexes, and the other synchronized Unicode specifications.
Version 15.0 of the Unicode Standard consists of:
- The core specification
- The code charts (delta and archival) for this version
- The Unicode Standard Annexes
- The Unicode Character Database (UCD)
The core specification gives the general principles,
requirements for conformance, and guidelines for implementers. The
code charts show representative glyphs for all the Unicode
characters. The Unicode Standard Annexes supply detailed normative
information about particular aspects of the standard. The Unicode
Character Database supplies normative and informative data for
implementers to allow them to implement the Unicode Standard.
The core specification is available as
a single pdf for viewing.
(14 MB)
Links are also available
in the navigation bar on the left of this page to access
individual chapters and appendices
of the core specification.
Several sets of code charts are available. They serve different
purposes:
- The latest set of code charts for
the Unicode Standard is available online. Those charts are always the most current
code charts available, and may be updated at any time. The charts are organized by
scripts and blocks for easy reference.
An online index by character name
is also provided. The Tableaux des caractères
provides a French translation of these latest code charts.
For Unicode 15.0.0 in particular two additional sets of code chart pages are provided:
- A set of delta code charts showing the
new blocks and any blocks in which characters were added for Unicode 15.0.0. The new characters are visually highlighted in the charts.
- A set of archival code charts that represents
the entire set of characters, names and representative glyphs at the time of publication of Unicode 15.0.0.
A French translation of the archival code charts is also available for this version.
The delta and archival code charts are a stable part of this release of the Unicode Standard. They will never be updated.
Links to the individual
Unicode Standard Annexes are available in
the navigation bar on the left of this page. The list of significant changes
in the content of the Unicode Standard Annexes for Version 15.0 can be found
in Section G below.
Data files for Version 15.0 of
the Unicode Character Database are available. The ReadMe.txt in that directory provides a roadmap
to the functions of the various subdirectories.
Zipped versions of the UCD
for bulk download are available, as well.
Version 15.0.0 of the Unicode Standard
should be referenced as:
The Unicode Consortium. The Unicode Standard, Version 15.0.0, (Mountain View, CA: The Unicode Consortium,
2022. ISBN 978-1-936213-32-0)
https://www.unicode.org/versions/Unicode15.0.0/
The terms “Version 15.0” or “Unicode 15.0” are abbreviations for the full version reference, Version 15.0.0.
The citation and permalink for the latest published version of the Unicode Standard is:
The Unicode Consortium. The Unicode Standard.
https://www.unicode.org/versions/latest/
A complete specification of the contributory files for Unicode
15.0 is found on the page Components for 15.0.0.
That page also provides the recommended reference format for Unicode Standard Annexes. For examples of how to cite particular portions of the Unicode Standard, see also the Reference Examples.
Errata incorporated into Unicode 15.0 are listed by date in
a separate table. For corrigenda and errata after the release of Unicode 15.0, see the list of current
Updates and Errata.
The Alias Stability policy of the Unicode Character Encoding Stability Policies
was updated between Versions 14.0 and 15.0. In addition to guaranteeing
that no property alias or property value alias will ever be removed from the
standard, it also now guarantees that the exact spelling of a property
alias or property value alias will never change. This has already long been
the UTC practice for maintaining these aliases, but the additional guarantee is intended
to assist in keeping regular expressions which refer to Unicode property values valid and stable.
A new Property Domain Stability policy has been added to the
Unicode Character Encoding Stability Policies as of Version 15.0. That
stability policy guarantees that any existing property of characters
can never be turned into a property of strings and that any
existing property of strings can never be turned into a property
of characters.
Two new
scripts were added with accompanying new block descriptions:
Script |
Number of Characters |
Kawi |
86 |
Nag Mundari |
42 |
Changes in the Unicode Standard Annexes are listed in Section G.
Character Assignment Overview
4,489 characters have been added.
Most character additions are in new blocks, but there are also character additions to a number of existing blocks. For details, see delta code charts.
New Blocks
The newly-defined blocks in Version 15.0 are:
There are no significant new conformance requirements in Unicode 15.0.
The detailed listing of all changes to the contributory data files of the Unicode Character Database
for Version 15.0 can be found in
UAX #44, Unicode Character Database.
The changes listed there include character additions and property revisions to existing characters that will affect implementations.
Some of the important impacts on implementations migrating from earlier versions of the standard are highlighted in
Section M.
In Version 15.0, some of the Unicode Standard Annexes have had significant revisions. The most important of these changes are listed below. For the full details of all changes, see the Modifications section
of each UAX, linked directly from the following list of UAXes.
Unicode Standard Annex |
Changes |
UAX #9 Unicode Bidirectional Algorithm
|
The text under UAX9-C2 was amended to emphasize that higher-level
protocols should be used to mitigate misleading bidirectional ordering
of source code, including potential spoofing attacks.
An extended example of use of the higher-level protocol HL4 for program text
was added in Section 4.3.2,
HL Example 2 for Program Text. |
UAX
#11 East Asian Width |
No significant changes in this version. |
UAX
#14 Unicode Line Breaking Algorithm |
An outdated note regarding special behavior of U+23B6
was removed from Section 5.1, Description of Line Breaking
Properties (Quotation). |
UAX
#15 Unicode Normalization Forms
|
The text in Section 5.1, Composition Exclusion Types was updated. |
UAX
#24 Unicode Script Property
|
No significant changes in this version. |
UAX
#29 Unicode Text Segmentation |
No significant changes in this version. |
UAX
#31 Unicode Identifier and Pattern Syntax
|
The text now clarifies that contextual restrictions on ZWJ and ZWNJ
are applicable only if the default identifier syntax is customized to
add those characters. Important guidance on profiles for default
identifiers is presented in UAX31-R1. The text now clarifies that
requirement UAX31-R3 Pattern_White_Space and Pattern_Syntax Characters
is applicable not only to pattern syntaxes, but also to programming
languages. In particular, some Pattern_Whitespace characters are
relevant to issues of bidirectional ordering and potential
spoofing attacks. The two
new scripts for Unicode 15.0 were added to the Excluded Scripts table. |
UAX
#34 Unicode Named Character Sequences |
A further clarification was added about medial hyphen in UAX34-R3.
The explanation of the Unicode namespace for character names was
extended in UAX34-D3. |
UAX
#38 Unicode Han Database (Unihan) |
Information about CJK Extension H and the single-character extension
to CJK Extension C were added. The sources and syntax were updated
for kIRG_GSource and kIRG_TSource. The syntax was updated for several
fields dealing with variants. A new field, kAlternateTotalStrokes was added.
Several new sections dealing with details of sources were added to the text. |
UAX
#41 Common References for Unicode Standard Annexes |
All references were updated for Unicode 15.0. |
UAX
#42 Unicode Character Database in XML |
New code point attributes, values, and patterns were added for Unicode 15.0. |
UAX
#44
Unicode Character Database |
The documentation was updated to describe the changes to the UCD for
Version 15.0. |
UAX #45
U-Source Ideographs |
The status "ExtH" was added for the new CJK Extension H block, and the status values for the existing CJK ideograph blocks were improved. A new section was added to the text, describing the Ideographic Description Sequence field in USourceData.txt. |
UAX #50
Unicode Vertical Text Layout |
A short section was added discussing the limits of the applicability
of the Vertical_Orientation property when dealing with right-to-left scripts. |
There are also significant revisions in the Unicode Technical Standards whose
versions are synchronized with the Unicode Standard. The most important of these changes are listed below.
For the full details of all changes, see the Modifications section
of each UTS, linked directly from the following list of UTSes.
Unicode Technical Standard |
Changes |
UTS #10 Unicode Collation Algorithm |
No significant changes in this version. |
UTS #39 Unicode Security Mechanisms |
The zero width joiner (ZWJ) and zero width non-joiner (ZWNJ) characters
are changed from Identifier_Status=Allowed to Identifier_Status=Restricted;
they are therefore no longer allowed by the General Security Profile by default.
Implementations of the General Profile for Identifiers that need to
retain ZWJ and ZWNJ should declare that they use a modification of the
profile per Section 2, Conformance, and should ensure that
they implement the restrictions described in
Section 3.1.1, Joining Controls.
|
UTS #46 Unicode IDNA Compatibility Processing |
A note was added to Section 4.2, ToASCII regarding the empty label
for the DNS root. New data files were added, to define the IDNA Derived
Property (for this version and all earlier versions back to Unicode 6.1). |
UTS #51 Unicode Emoji |
The definition of emoji_zwj_element was updated. The emoji flag
sequence definition was updated to better align with the discussion in
Annex B, Valid Emoji Flag Sequences. The rules in Section 1.4.9, EBNF
and Regex were updated. The text in Section 2.7.1, Emoji and Text Presentation
Selectors was updated to clarify the behavior of the text presentation
selector on emoji ZWJ sequences.
|
There are a significant number of changes in Unicode 15.0 which may impact implementations upgrading
to Version 15.0 from earlier versions of the standard. The most important of these are listed
and explained here, to help focus on the issues most likely to cause unexpected trouble during upgrades.
Script-related Changes
Two new scripts have been added in Unicode 15.0.0. Some of these scripts have
particular attributes which may cause issues for implementations. The more
important of these attributes are summarized here.
- Kawi is a Brahmic script with complex rendering rules. See the original proposal
documentation in L2/20-284
for an extensive discussion. Note also that the UTC recommendation for
handling linebreaking in Kawi is to follow Western linebreaking rules,
depending on use of spaces in text, rather than depending on dictionary
lookup rules.
Numeric Property Issues
- Two new sets of decimal digits have been added, for the Kawi and Nag
Mundari scripts.
Implementations of digits will need to take those
into account.
- Kaktovik numerals have been added. This is another vigesimal
number system, similar in structure to Mayan numerals.
Multiple @missing Lines in UCD Property Files
Starting with Version 15.0, some data files in the UCD may contain multiple
@missing lines defined for the same property. This is currently the case
for DerivedBidiClass.txt, DerivedEastAsianWidth.txt, and DerivedLineBreak.txt.
The effect of this change on implementations that parse the UCD
data files is a bit subtle. There are basically three categories to
take into account when considering migration issues:
- UCD file parsers which completely ignore the @missing lines
and which have been depending on hard-coded ranges for
all default values will not be impacted by this change. However,
such parsers may be in the minority, because they are always
impacted whenever a default property assignment range is changed
for a release. (See below for the change in default Bidi_Class values
for unassigned characters in the newly defined Arabic Extended-C block
in Unicode 15.0.)
- UCD file parsers which completely ignore the @missing lines
but which have been depending on the derived extracted
UCD data files such as DerivedBidiClass.txt to parse the correct
default property values for all unassigned code points will
be impacted by this change. Such parsers will either have to be
updated to use hard-coded ranges or to interpret the multiple @missing
lines correctly, as the unassigned code point values are no longer
listed explicitly in DerivedBidiClass.txt (and similar data files).
- UCD file parsers which do interpret the @missing lines
may be impacted by this change. If they have been treating
@missing lines exactly like the data lines in the file, overriding
defined ranges as they process each line, they should be unaffected.
Such a parsing strategy will simply end up processing more @missing line ranges than before,
but will produce identical results. However, parsers which special
case the @missing lines and/or which expect only a single @missing line
to occur, may need to be updated to get correct results.
See UAX #44 Section 4.2.10, @missing Conventions for more details.
Other Property Issues
- The new Arabic block, Arabic Extended-C, defaults the entire range
of code points in the block, 10EC0..10EFF to Bidi_Class=AL. This is
a change from Unicode 14.0, in which that unassigned range defaulted
to Bidi_Class=R.
- In addition to the new blocks, one existing block had a slight adjustment to its
end range. The Egyptian Format Controls block range was extended by two columns to end at U+1345F, instead of U+1343F.
Implementations should be checked carefully for any hard-coded assumptions about
the end ranges of existing blocks.
CJK/Unihan Changes
- A new provisional property, kAlternateTotalStrokes, has been added to Unihan. This property supplements the existing informative kTotalStrokes property with total number of strokes for ideographs other than those with G and T source identifiers.
- Nearly 50,000 additions to the kKangXi property were derived from the kIRG_GSource and kIRGKangXi properties.
- There are large changes and additions in the values for the kDefinition, kSimplifiedVariant, kTraditionalVariant, kSemanticVariant, and kSpecializedSemanticVariant properties.
- The kCihaiT property has been moved from the Unihan_DictionaryLikeData.txt file
to the Unihan_DictionaryIndices.txt file. Parsers that assume that particular
Unihan properties are included in particular parts of the Unihan database files
will need to be updated.
- The ending value for the range of CJK Unified Ideographs in Extension C was incremented. Because implementations often hard-code
ideographic ranges to short-cut lookups and reduce table sizes, it is
especially important that implementers pay close attention to the
implications of range changes for Version 15.0.0. This extension bumps up the end
range of the encoded ideographs by one code point within the block:
- 1 code point for Extension C: ending at U+2B739
See UAX
#38, Unicode Han Database (Unihan) for further details on these changes,
especially Section 4.2, Listing
by Date of Addition to the Unicode Standard, and Section 4.3, Listing by
Location within Unihan.zip.
UAX #38 also has updated regex values for numerous
Unihan properties.
IDNA Changes
- The file IdnaTestV2.txt is now escapes certain characters using the \uXXXX and \x{XXXX} conventions. This was already documented in the file header, and the same escaping conventions were used in the earlier IdnaTest.txt file.
- New data files have been added, listing the IDNA Derived Property for
all versions of the Unicode Standard beginning with Version 6.1.
Emoji Changes
- 20 new emoji characters have been added. However, in addition
to those individual characters, many new emoji sequences have been
recognized, as well. Implementations supporting emoji should be
checked to reflect changes in
UTS #51, Unicode Emoji
and all of its associated data files.