Unicode® 15.1.0
STATUS: This is a preliminary draft page for an upcoming release. Some details may be missing or incorrect, and some links may be wrong or broken. During the alpha review period, errors are expected and feedback is not necessary. During the beta review period, feedback on errors will be helpful and appreciated.
Version 15.1.0 has been superseded by the latest version of the Unicode Standard.
This page summarizes the important changes for the Unicode Standard, Version 15.1.0.
This version supersedes all previous versions of the Unicode Standard.
A. Summary
B. Technical Overview
C. Stability Policy Update
D. Textual Changes and Character Additions
E. Conformance Changes
F. Changes in the Unicode Character Database
G. Changes in the Unicode Standard Annexes
H. Changes in Synchronized Unicode Technical Standards
M. Implications for Migration
Unicode 15.1 adds 627 characters,
for a total of 149,813 characters.
There are several significant themes for this release of the Unicode Standard.
- The repertoire addition consists almost entirely of urgently needed CJK ideographs,
synchronized with planned additions to the Chinese national standard, GB 18030. The
remaining additions to the repertoire extend the set of ideographic description
characters, to better enable description of unusual CJK ideographs.
- Major updates were made to UAX #9, Unicode Bidirectional Algorithm, UAX #31,
Unicode Identifiers and Syntax, and UTS #39, Unicode Security Mechanisms, to
coordinate with the publication of an important new Unicode Technical Standard: UTS #55,
Unicode Source Code Handling.
- Segmentation rule changes, most notably:
- Support was added to line breaking (UAX #14, Unicode Line Breaking Algorithm) for orthographic syllables in a number of South and Southeast Asian writing systems.
- Grapheme cluster breaking (UAX #29, Unicode Text Segmentation) has adopted the aksara cluster behavior for six scripts. That cluster breaking behavior had previously been widely available via CLDR and ICU.
- These changes involved significant character property updates.
Synchronization
Several other important Unicode specifications have been updated for Version 15.1.
The following four Unicode Technical Standards are versioned in
synchrony with the Unicode Standard, because their data files cover the same repertoire.
All have been updated to Version 15.1:
Some of the changes in Version 15.1 and associated Unicode Technical Standards
may require modifications
to implementations. For more information, see the migration and modification sections of
UTS #10, UTS #39, UTS #46, and UTS #51.
See Sections D through H below for additional details regarding the changes in this version of
the Unicode Standard, its associated annexes, and the other synchronized Unicode specifications.
Version 15.1 of the Unicode Standard consists of:
- The core specification (unchanged from Version 15.0)
- The code charts (delta and archival) for this version
- The Unicode Standard Annexes
- The Unicode Character Database (UCD)
The core specification gives the general principles,
requirements for conformance, and guidelines for implementers. The
code charts show representative glyphs for all the Unicode
characters. The Unicode Standard Annexes supply detailed normative
information about particular aspects of the standard. The Unicode
Character Database supplies normative and informative data for
implementers to allow them to implement the Unicode Standard.
The core specification is available as
a single pdf for viewing.
(14 MB)
Links are also available
in the navigation bar on the left of this page to access
individual chapters and appendices
of the core specification.
Several sets of code charts are available. They serve different
purposes:
- The latest set of code charts for
the Unicode Standard is available online. Those charts are always the most current
code charts available, and may be updated at any time. The charts are organized by
scripts and blocks for easy reference.
An online index by character name
is also provided.
For Unicode 15.1.0 in particular two additional sets of code chart pages are provided:
- A set of delta code charts showing the
new blocks and any blocks in which characters were added for Unicode 15.1.0. The new characters are visually highlighted in the charts.
- A set of archival code charts that represents
the entire set of characters, names and representative glyphs at the time of publication of Unicode 15.1.0.
The delta and archival code charts are a stable part of this release of the Unicode Standard. They will never be updated.
Links to the individual
Unicode Standard Annexes are available in
the navigation bar on the left of this page. The list of significant changes
in the content of the Unicode Standard Annexes for Version 15.1 can be found
in Section G below.
Data files
for Version 15.1 of
the Unicode Character Database are available. The ReadMe.txt in that directory provides a roadmap
to the functions of the various subdirectories.
Zipped versions of the UCD
for bulk download are available, as well.
Version 15.1.0 of the Unicode Standard
should be referenced as:
The Unicode Consortium. The Unicode Standard, Version 15.1.0, (South San Francisco, CA: The Unicode Consortium,
2023. ISBN 978-1-936213-33-7)
https://www.unicode.org/versions/Unicode15.1.0/
The terms “Version 15.1” or “Unicode 15.1” are abbreviations for the full version reference, Version 15.1.0.
The citation and permalink for the latest published version of the Unicode Standard is:
The Unicode Consortium. The Unicode Standard.
https://www.unicode.org/versions/latest/
A complete specification of the contributory files for Unicode
15.1 is found on the page Components for 15.1.0.
That page also provides the recommended reference format for Unicode Standard Annexes. For examples of how to cite particular portions of the Unicode Standard, see also the Reference Examples.
Errata incorporated into Unicode 15.1 are listed by date in
a separate table. For corrigenda and errata after the release of Unicode 15.1, see the list of current
Updates and Errata.
The Case Folding Stability policy has been extended with an explicit statement of the
stability of case folding as applicable to toNFKC_Casefold(S) between versions of
the Unicode Standard. A clarification has been added regarding the subtle distinction
between toNFKC_Casefold(S) and toCasefold(toNFKC(S)).
Changes in the Unicode Standard Annexes are listed in Section G.
Character Assignment Overview
627 characters have been added.
For details, see the delta code charts.
New Blocks
There is one newly-defined block in Version 15.1:
Range |
Block Name |
2EBF0..2EE5F |
CJK Unified Ideographs Extension I |
The block for CJK Unified Ideographs Extension I was placed
near the end of Plane 2, immediately after Extension F, instead of
on Plane 3 after Extension H, in order to make best use of the allocation
space available on Plane 2.
There are no new conformance requirements for the core specification in Unicode 15.1.
However, the conformance clauses in several Unicode Standard Annexes
and Unicode Technical Standards have been reorganized and split in some
cases to make it easier to exactly specify conformance to tailored versions
of some Unicode algorithms. UAX #29 has added new conformance clauses.
The detailed listing of all changes to the contributory data files of the Unicode Character Database
for Version 15.1 can be found in
UAX #44, Unicode Character Database.
The changes listed there include character additions and property revisions to existing characters that will affect implementations.
Some of the important impacts on implementations migrating from earlier versions of the standard are highlighted in
Section M.
In Version 15.1, some of the Unicode Standard Annexes have had significant revisions. The most important of these changes are listed below. For the full details of all changes, see the Modifications section
of each UAX, linked directly from the following list of UAXes.
Unicode Standard Annex |
Changes |
UAX #9 Unicode Bidirectional Algorithm
|
There was significant clarification added for the text regarding BD16 and
the interaction of control flow between W4, W5, and W6. The use of sos and the
treatment of AN/EN with brackets in N0 was also clarified. The text regarding
retaining BNs and explicit formatting characters was updated. A major example of
the use of HL4 for URLs was added in Section 4.3.3, and a reference to the
new UTS #55 was added in Section 4.3.2. |
UAX
#11 East Asian Width |
No significant changes in this version. |
UAX
#14 Unicode Line Breaking Algorithm |
Support was added for line breaking at orthographic syllable boundaries,
including the introduction of five new line breaking classes for characters.
Rule LB15 was split into LB15a and LB15b, to improve the handling of French
style quotation marks. A clearer characterization of allowed tailorings was
added to Section 8.1. Various other clarifications and small updates to
the text and examples were also made. |
UAX
#15 Unicode Normalization Forms
|
No significant changes in this version. |
UAX
#24 Unicode Script Property
|
No significant changes in this version. |
UAX
#29 Unicode Text Segmentation |
Explicit conformance rules for each type of segmentation were added to
the Conformance section. Support for orthographic syllable breaking
was adding in a new rule GB9c. The definition of "crlf" was updated in the table of
Regex Definitions. Multiple changes were made to the table of Word_Break
Property Values. A note was added in Section 3.1.1 clarifying that each emoji sequence
constitutes a single grapheme cluster. |
UAX
#31 Unicode Identifiers and Syntax
|
This UAX was retitled to better reflect its scope. Multiple changes
were made to the section of Default Identifiers, including the removal
of UAX31-R1a, Restricted Format Characters. A significant example was
added to UAX31-R1b, Stable Identifiers. Section 4 was completely rewritten,
separating the discussion of whitespace and of syntax. The section on
limited contexts for joining controls was moved out of this annex and
into UTS #39, instead. Section 7
was added, with three new standard profiles: mathematical compatibility notation,
emoji, and default ignorable exclusion. |
UAX
#34 Unicode Named Character Sequences |
No significant changes in this version. |
UAX
#38 Unicode Han Database (Unihan) |
Documentation was added for CJK Unified Ideographs Extension I
and for 6 new provisional properties. 7 existing provisional properties
were removed. The syntax, list of sources, and/or descriptions were
updated for the kIRG_GSource, kIRG_KSource, and kIRG_KPSource
properties. Syntax
and descriptions were also updated for several other properties,
including kRSUnicode. |
UAX
#41 Common References for Unicode Standard Annexes |
All references were updated for Unicode 15.1. |
UAX
#42 Unicode Character Database in XML |
New code point attributes, values, and patterns were added for Unicode 15.1. |
UAX
#44
Unicode Character Database |
The documentation was updated to describe the changes to the UCD for
Version 15.1. |
UAX #45
U-Source Ideographs |
A new Section 3 was added, documenting the ranges of U-source ideographs that were
added in each version of the Unicode Standard. The N, V, W, and X status values
were updated to the more descriptive FutureWS, Variant, Rejected, and NoAction,
respectively. The now-obsolete UK-2015 and WS-2017 status values were removed. |
UAX #50
Unicode Vertical Text Layout |
No significant changes in this version. |
There are also significant revisions in the Unicode Technical Standards whose
versions are synchronized with the Unicode Standard. The most important of these changes are listed below.
For the full details of all changes, see the Modifications section
of each UTS, linked directly from the following list of UTSes.
Unicode Technical Standard |
Changes |
UTS #10 Unicode Collation Algorithm |
No significant changes in this version. |
UTS #39 Unicode Security Mechanisms |
The definition and discussion of the contexts for joining controls was
moved from UAX #31 into this UTS. The definition of confusability was updated
to take default ignorable code points into account. A new confusability
relation suitable for identifiers containing bidirectional text was added. |
UTS #46 Unicode IDNA Compatibility Processing |
Transitional processing of Deviation characters has been deprecated. All
major implementations now use nontransitional processing. Step 7 in Section 6 was
changed to no longer check for NFD validity; this changed three characters
from disallowed_STD3_valid to valid.
In nontransitional processing, U+1E9E capital sharp s (ẞ) now maps to U+00DF small sharp s (ß). |
UTS #51 Unicode Emoji |
A short discussion of the interactions of emoji with computer language
syntaxes was added. Minor updates were also made to account for new emoji sequences
added in Version 15.1. |
There are a significant number of changes in Unicode 15.1 which may impact implementations upgrading
to Version 15.1 from earlier versions of the standard. The most important of these are listed
and explained here, to help focus on the issues most likely to cause unexpected trouble during upgrades.
Script-related Changes
Because of the limited scope of new repertoire for Version 15.1, there are
no migration issues of note specifically tied to various scripts, other than
the Han script (see below).
General Character Property Issues
- There are 5 new ideographic description characters. These extend the
syntax of ideographic description sequences.
- Two of the new ideographic description characters function
as unary operators, which necessitated introduction of a new binary property: IDS_Unary_Operator.
- There are two new properties, ID_Compat_Math_Start and
ID_Compat_Math_Continue, for the new
Mathematical Compatibility Notation Profile in UAX #31.
- There is a new property NFKC_Simple_Casefold which establishes
another normalization form like NFKC_Casefold does. The new one
uses Simple_Case_Folding mappings rather than full Case_Folding
mappings. This is intended for use in systems that support
case-insensitive identifiers based on simple (1:1) case folding
mappings.
- Five new values have been added to the Line_Break property, in support
of new orthographic line breaking rules for a significant number of South and Southeast Asian
scripts.
Segmentation
There is a new grapheme cluster segmentation rule GB9c in UAX #29 which refers to a new enumerated property Indic_Conjunct_Break. The list of scripts affected by this rule is expected to
expand in subsequent versions of
the Unicode Standard. (Note that this outcome differs from the preliminary solution discussed during
the beta review for Version 15.1.0, which used macros instead of a new property in the
statement of rule GB9c.)
There is a new line breaking rule LB28a in UAX #14, to prevent breaks inside orthographic syllables
of Brahmic scripts. That new rule uses the new Line_Break property values. It also includes the
use of a dotted circle in its regex expressions. The dotted circle is a literal character—that is, it matches U+25CC ◌ DOTTED CIRCLE.
Numeric Property Issues
- There is one large new value in extracted/DerivedNumericValues.txt: 10000000000000000 (for U+4EAC)
- U+5146 has two kPrimaryNumeric values: 1000000, 1000000000000
- U+79ED has two kPrimaryNumeric values: 1000000000, 1000000000000
CJK/Unihan Changes
- A new CJK unified ideograph block, Extension I, has been added, with
622 characters in the range U+2EBF0..U+2EE5D. Implementers should check
carefully for any hard-coded assumptions about CJK ranges.
To keep the CJK block ranges as compact as possible, Extension I has
been added to Plane 2, instead of directly after Extension H on Plane 3.
Implementers should also check that their code does not assume that CJK extensions
all occur in alphabetic order by the extension letter.
- Some kRSUnicode values now include double-apostrophe radicals, sometimes as the only values for a code point.
- Seven old provisional properties have been removed.
- Six new provisional properties have been added.
See UAX
#38, Unicode Han Database (Unihan) for further details on these changes,
especially Section 4.2, Listing
by Date of Addition to the Unicode Standard, and Section 4.3, Listing by
Location within Unihan.zip.
UAX #38 also has updated regex values for numerous
Unihan properties. For the double-apostrophe radicals, see:
UTS #46 (IDNA) Changes
- Transitional processing (see conformance clause C1) has now been deprecated
in UTS #46, Unicode IDNA Compatibility Processing.
- In nontransitional processing, U+1E9E capital sharp s (ẞ) now maps to U+00DF small sharp s (ß), so that domain names with either input character always match. Until Unicode 15.0, capital sharp s mapped to "ss", which is the same as the mapping for small sharp s in transitional processing.
- U+2260 (≠), U+226E (≮), and U+226F (≯) are now unconditionally valid, rather than disallowed_STD3_valid.
- There are a couple of additional, minor changes to the validity criteria. See the UTS #46 Modifications section for details.
Changes to Code Charts
- The code charts for the main CJK Unified Ideographs block (U+4E00) has
an updated format that uses 7 columns for source glyphs, instead of 6. The
KP source glyphs have been explicitly added to the code charts.
- The font used for the representative glyphs of the Alchemical Symbols
block has been updated.
Collation-related Changes
There has been an update to DUCET regarding the weighting of quotation
marks. Various single quotation marks are now weighted as secondary variants
of U+0027 (') APOSTROPHE, and various double quotation marks are now weighted
as secondary variants of U+0022 (") QUOTATION MARK. U+05F3 (׳) HEBREW PUNCTUATION GERESH
is also weighted as a secondary variant of U+0027, and U+05F4 (״) HEBREW PUNCTUATION
GERSHAYIM is weighted as a secondary variant of U+0022. This change enables
better behavior of geresh and gershayim for searching and sorting, and brings
UCA more in line with the CLDR tailorings for quotation marks, geresh, and
gershayim.
Emoji Changes
There are no new emoji characters in Unicode 15.1, but 118 new RGI emoji ZWJ
sequences and 17 presentation sequences have been added to the overall
emoji repertoire. For details, see the Unicode 15.1 emoji charts and Emoji Recently Added, v15.1.