Unicode 4.1.0
Version 4.1.0 has been superseded by the
latest version
of the Unicode Standard.
|
Version 4.1.0 of the Unicode Standard consists of the core
specification, The Unicode Standard,
Version 4.0, as amended by
Unicode 4.0.1 and
further amended by this specification,
the delta and archival code charts for this version, the Unicode Standard Annexes,
and the Unicode Character Database (UCD).
The core specification gives the general principles,
requirements for conformance, and guidelines for implementers. The
code charts show representative glyphs for all the Unicode
characters. The Unicode Standard Annexes supply detailed normative
information about particular aspects of the standard. The Unicode
Character Database supplies normative and informative data for
implementers to allow them to implement the Unicode Standard.
|
Version 4.1.0 of the Unicode Standard
should be referenced as:
The Unicode Consortium. The Unicode Standard, Version 4.1.0,
defined by: The Unicode Standard, Version 4.0 (Boston, MA,
Addison-Wesley, 2003. ISBN 0-321-18578-1),
as amended by Unicode 4.0.1
(http://www.unicode.org/versions/Unicode4.0.1/)
and by Unicode 4.1.0
(http://www.unicode.org/versions/Unicode4.1.0/).
A complete specification of the contributory files for Unicode
4.1.0 is found on
the page
Components for Version 4.1.0.
Contents of This Document
Online Edition
Overview
Notable Changes From Unicode 4.0.1 to Unicode 4.1.0
Conformance Changes to the Standard
Other Changes to the Standard
Superseded Sections
Unicode Character Database
Errata Corrected in This Version
Script Additions
Significant Character Additions
Online Edition
The text of The Unicode Standard, Version 4.0, as well as the
delta and archival code charts, is available online via the navigation links
on this page. These files may not be printed. The
Unicode 4.0
Web Bookmarks page has links to all sections of the online text.
Overview
Unicode 4.1.0 is a
minor version
of the Unicode Standard. 1273 new characters have been added. This document
provides information about those additional characters, as well as
further clarifications of text of the standard. In addition it covers
accumulated corrigenda and errata to the text.
There are significant changes to many of the Unicode Standard
Annexes which are part of Unicode 4.1.0. Each annex has a
modification section listing the changes in that annex.
Notable Changes From Unicode 4.0.1 to Unicode 4.1.0
- Addition of 1273 new characters to the standard, including those to
complete roundtrip mapping of the HKSCS and GB 18030 standards, five new
currency signs, some characters for Indic and Korean, and eight new scripts.
(The exact list of additions can be seen in DerivedAge.txt, in the age=4.1
section.)
- Change in the end of the CJK Unified Ideographs range from U+9FA5 to U+9FBB,
with the addition of some Han characters. The boundaries of such ranges are
sometimes hardcoded in software, in which case the hardcoded value needs to
be changed.
- New Unicode Standard Annexes:
UAX #31, Identifier and Pattern Syntax
and UAX #34, Unicode Named Character Sequences, and significant changes to
other Unicode Standard Annexes.
In addition to the repertoire additions, there have been a number of significant
changes to the Unicode Character Database files and the
properties in them. In particular:
- Three new properties, Grapheme_Cluster_Break, Sentence_Break, and Word_Break, have been added in support of
UAX #29, Text Boundaries. Their enumeration can be found in new data files, located in the new
"auxiliary" subdirectory of the UCD. Also in the "auxiliary" subdirectory are the test data files and HTML break charts associated with UAX #29.
- The new property Other_ID_Continue has been added to support identifier stability. It is enumerated in PropList.txt and is used in the derivation of other identifier-related properties.
- Two new properties, Pattern_Syntax and Pattern_White_Space, have been added in support of
UAX #31, Identifier and Pattern Syntax. Their enumeration can be found in PropList.txt.
- The bidi properties of a few compatibility equivalents of characters whose bidi classes changed for Unicode 4.0.1 have been harmonized.
- The case mapping contexts defined in SpecialCasing.txt have been updated and now override Table 3-13. Context Specification for Casing on p. 89 of The Unicode Standard, Version 4.0.
These changes are described below in the section Modifications to Default Case Operations.
- Alphabetic is now a superset of Lowercase and Uppercase for compatibility
with POSIX-style character classes.
- A new data file NamedSequences.txt has been added in conjunction with
UAX #34, Named Character
Sequences. This data file defines specific names for some significant Unicode
character sequences, giving their USI (Unicode Sequence Identifiers) values.
- The linebreak propeties of Runic, Indic, Mongolian, Tibetan
punctuation, and Hangul have been revised to better match their
expected behavior. (See
UAX #14:
Line Breaking Properties)
The following complete scripts have been added in Unicode
4.1.0:
- New Tai Lue (U+1980..U+19DF)
- Buginese (U+1A00..U+1A1F)
- Glagolitic (U+2C00..U+2C5F)
- Coptic (U+2C80..U+2CFF)
- Tifinagh (U+2D30..U+2D7F)
- Syloti Nagri (U+A800..U+A82F)
- Old Persian (U+103A0..U+103DF)
- Kharoshthi (U+10A00..U+10A5F)
Two scripts have been disunified or reorganized:
- Coptic is now considered a separate script from Greek. This
differs from prior documentation in the standard. A new Coptic
block has been added, including characters for Old Coptic. It should be noted, however, that the 14 Coptic
letters derived from Demotic, which had already been encoded in the Greek and Coptic
block, are unchanged, and need to be included in any complete
implementation of Coptic.
- The Nuskhuri forms of Khutsuri Georgian have been added in
a new Georgian Supplement block (U+2D00..U+2D2F). Those
characters are now to be taken as the lowercase pairs of
the Asomtavruli Georgian encoded at U+10A0..U+10C5. This
introduction of case pairs for Khutsuri is a change from
the previous documentation about Georgian in the standard.
Beyond the addition of entire scripts, there
have been very significant extensions to the repertoire
for the Arabic script and the Ethiopic script. A large number of additional
Latin characters have been added as phonetic extensions to
support various orthographic conventions for minority
languages. There are also significant additions of Greek
symbols and punctuation to support specialist representation
of ancient Greek materials. Several small sets have been added to CJK Unified
Ideographs and to associated blocks.
A few characters have been added to supplement Hebrew, in particular for
support of Biblical Hebrew text representation.
U+060B AFGHANI SIGN has been added. While some glyph variants of
this character do occur, the form shown in the code charts is that
approved by the Ministry of Finance of the Afghanistan government.
U+09CE BENGALI LETTER KHANDA TA has been added. This will necessitate
adjustment of Bengali script implementations. In Unicode 4.1, recommendations
for the representation of Khanda-Ta in Bengali differ from those documented in
Version 4.0.1 and earlier
Conformance Changes to the Standard
Modifications to Default Case Operations
The following amends Section 3.13, Default Case Operations, on p. 89-90 of
The Unicode Standard,
Version 4.0.
Add after D47:
D47a A character C is defined to be
case-ignorable
if C has the Unicode Property Word_Break=MidLetter as defined in Unicode Standard
Annex #29, "Text
Boundaries;" or the General Category of C is Nonspacing Mark (Mn), Enclosing Mark (Me), Format Control
(Cf), Letter Modifier (Lm), or Symbol Modifier (Sk).
D47b A case-ignorable sequence is a sequence of zero or more
case-ignorable characters.
Replace Table 3-13, Context Specification by the following:
A description of each context is followed by the equivalent regular expression(s)
describing the context before C and the context after C, or both. The regular expression uses the syntax of
Unicode Technical Standard
#18, "Unicode Regular Expressions ", with one addition: "!" means that the expression does not match. All regular
expressions below are case-sensitive.
Table 3-13. Context Specification for Casing
Context |
Description |
Regular Expressions |
Final_Sigma |
C is preceded by a sequence consisting of a cased letter and a
case-ignorable sequence, and C is not followed by a sequence consisting of an ignorable
sequence and then a cased letter. |
Before C: |
\p{cased} (\p{case-ignorable})* |
After C: |
! ( (\p{case-ignorable})* \p{cased} ) |
After_Soft_Dotted |
There is a Soft_Dotted character before
C, with no intervening character of combining class 0 or 230 (ABOVE). |
Before C: |
[\p{Soft_Dotted}]
([^\p{cc=230} \p{cc=0}])* |
More_Above |
C is followed by a character of combining class 230 (ABOVE), with no
intervening character of type 0. |
After C: |
[^\p{cc=0}]* [\p{cc=230}] |
Before_Dot |
C is followed by combining dot above (U+0307). Any sequence of characters
with a combining class that is neither 0 nor 230 may intervene between the current character
and the combining dot above. |
After C: |
([^\p{cc=230} \p{cc=0}])* [\u0307] |
After_I |
There is an uppercase I before C, and there is no intervening combining
character class 230 (ABOVE) or 0. |
Before C: |
[I] ([^\p{cc=230} \p{cc=0}])* |
Clarification of Decomposition Mappings
In order to ensure, as intended, that decomposition mappings for each
version of the standard derive from the Unicode Character Database for that
version of the standard, the phrases in D18, D20, and D23 reading "according
to the decomposition mappings found in the names list of Section 16.1,
Character Names List" is changed to "according to the decomposition mappings
found in the Unicode Character Database".
Other Changes to the Standard
Change in status of recommendation of SPACE as a base for display of nonspacing marks.
The UTC has decided that U+0020 SPACE is no longer recommended as
a suitable base character for display of isolated nonspacing
marks. Instead, U+00A0 NO-BREAK SPACE is the preferred base
character for this function.
The explanatory text of The Unicode Standard Version 4.0, page 46, "Spacing Clones of European
Diacritical Marks" is updated to read as follows:
Nonspacing combining marks used by the Unicode Standard may be
exhibited in apparent isolation by applying them to U+00A0 NO-BREAK
SPACE. This convention might be employed, for example, when talking
about the combining mark itself as a mark, rather than using it in its
normal way in text applied as an accent to a base letter or in other
combinations.
Prior to Version 4.1 of the Unicode Standard,
the standard also recommended the use of U+0020 SPACE for
display of isolated combining marks. This is no longer
recommended because of potential
conflicts with the handling of sequences of U+0020 space characters in
such contexts as XML.
The Unicode Standard separately encodes clones of many common European
diacritical marks, primarily for compatibility with existing character
set standards. These cloned accents and diacritics are spacing
characters, and can be used to display the mark in isolation, without
application to a no-break space. They are cross-referenced to the
corresponding combining mark in the names list in Chapter 16, Code
Charts. For example, U+02D8 BREVE is cross-referenced to U+0306
COMBINING BREVE. Most of these spacing clones also have compatibility
decomposition mappings involving U+0020 SPACE, but implementers should
be cautious in making use of those decomposition mappings because of
the complications that can result from replacing a spacing character
with a space + combining mark sequence.
See
UAX #14:
Line Breaking Properties for corresponding changes.
Change in equivalence for NO-BREAK SPACE
The Unicode Standard, Version 4.0, p. 387 states:
U+00A0 NO-BREAK SPACE behaves like the following coded character
sequence: U+FEFF ZERO WIDTH NO-BREAK SPACE + U+0020 SPACE +
U+FEFF ZERO WIDTH NO-BREAK SPACE.
That sentence is stricken from the text of the Unicode Standard,
Version 4.1.0, because it is incorrect. The behavior in bidirectional
text layout is not identical for these sequences (see
UAX #9: The
Bidirectional Algorithm). For linebreaking, there are differences
with respect to a following SPACE character (see
UAX #14:
Line Breaking Properties). In addition, the use of U+FEFF for
word-joining has been deprecated in favor of U+2060 WORD JOINER.
Use of CGJ to Prevent Reordering
The following modifies the section headed Combining Grapheme
Joiner in Section 15.2, Layout Controls on page 392 of
The Unicode Standard,
Version 4.0.
Replace this text on page 392:
U+034F COMBINING GRAPHEME JOINER is used to indicate that
adjacent characters are to be treated as a unit for the purposes of
language-sensitive collation and searching. In language-sensitive collation
and searching, the combining grapheme joiner should be ignored unless it
specifically occurs within a tailored collation element mapping. Thus it is
given a completely ignorable collation element in the default collation
table, like NULL (see Unicode Technical Standard #10, "Unicode Collation
Algorithm," and also ISO/IEC 14651). However, it can be entered into the
tailoring rules for any given language, using the tailoring capabilities of
the collation standards.
by the following text:
U+034F COMBINING GRAPHEME JOINER is used to affect the
collation of adjacent characters for purposes of
language-sensitive collation and searching, and to distinguish
sequences that would otherwise be canonically equivalent.
Formally, the combining grapheme joiner is not a format
control character, but rather a combining mark. It has
the General Category value gc=Mn and the canonical combining class
value ccc=0. These property assignments result in the
following behavior, which can be useful in certain
circumstances.
The presence of a combining grapheme
joiner in the midst of a combining character sequence does
not interrupt the combining character sequence; a process
which is accumulating and processing all the characters
of a combining character sequence would include a
combining grapheme joiner as part of that sequence. (This
differs from the behavior for most format control characters,
whose presence would interrupt a combining character sequence.)
However, because the combining grapheme joiner has a combining class of 0,
canonical reordering will not reorder any adjacent combining marks around a
combining grapheme joiner. (See the definition of canonical reordering in
Section 3.11, Canonical Reordering Behavior in Unicode 4.0.) In turn, this
means that insertion of a combining grapheme joiner
between two combining marks will prevent normalization
from switching the position of those two combining marks,
regardless of their own combining classes.
This side-effect
of the character properties of the combining grapheme
joiner, together with the fact that the combining grapheme
joiner has no visible glyph and no other format effect on
neighboring characters, can be taken advantage of in those
exceptional circumstances where two alternative orderings
of a sequence of combining marks must be distinguished for
some processing or rendering purpose and where normalization
would otherwise eliminate the distinction between the two
sequences.
For example, this is one way to avoid the less-than-optimal
assignment of fixed-position combining classes to certain
Hebrew accents and marks which do in fact interact typographically
and for which accent order distinctions need to be maintained
for analytic and text representational purposes. In
particular:
<lamed, patah, hiriq, finalmem>
is canonically equivalent to:
<lamed, hiriq, patah, finalmem>
because the canonical combining classes of U+05B4 HEBREW POINT
HIRIQ and U+05B7 HEBREW POINT PATAH are distinct. However, if
an application wishes to make a distinction between a patah
following hiriq and a patah preceding a hiriq, the following
sequence would
not be canonically equivalent to the first two sequences cited:
<lamed, patah, CGJ, hiriq, finalmem>
The presence of the ccc=0 combining grapheme joiner blocks the
reordering of hiriq before patah by canonical reordering. That
allows the two sequences to be reliably distinguished, whether
for display or for other processing.
The Unicode Collation Algorithm involves the normalization of
Unicode text strings before collation weighting. The combining
grapheme joiner is ordinarily ignored in collation key weighting
in the UCA, but if, as in this case, it is used to block the
reordering of combining marks in a string, its effect can be
to invert the order of secondary key weights associated with
those combining marks. Because of this, the two strings would
have distinct keys, making it possible to treat them distinctly in
searching and sorting without having to further tailor either
the combining grapheme joiner or the combining marks themselves.
The CGJ can also be used to prevent the formation of contractions in the
Unicode Collation Algorithm. Thus, for example, while "ch" is sorted as a
single unit in a tailored Slovak collation, the sequence <c, CGJ, h>
will sort as a 'c' followed by an 'h'. This can also be used in German, for example,
to force 'ü' to be sorted as 'u' + umlaut (using <u, CGJ, umlaut>), even where a dictionary sort is
being used. This also happens without having to further tailor either the
combining grapheme joiner or the sequence.
Of course, sequences of characters which include the combining grapheme
joiner may also be given tailored weights. Thus the sequence <c, CGJ, h> could be weighted completely differently from the either the contraction
"ch" or how "c" and "h" would have sorted without the contraction. However,
this application of CGJ is not recommended. For more information on the use
of CGJ with sorting, matching, and searching, see UAX #10: Unicode Collation
Algorithm, Version 4.1.0.
Meteg
The following clarifying text regarding the control of positioning of
the meteg in Hebrew, U+05BD HEBREW POINT METEG, should be
added to Section 8.1, Hebrew, p. 194 of The Unicode Standard, Version
4.0.
The basic recommendations for the control of positioning of
meteg established in Version 4.1 are as follows:
U+034F COMBINING GRAPHEME JOINER can be used within a
vowel-meteg sequence to preserve an ordering distinction under
normalization.
So, for instance, to display meteg to the left (after, for a
right-to-left script) of the vowel point sheva,
U+05B0 HEBREW POINT SHEVA, the following sequence can be used:
<sheva, meteg>
Because these marks are canonically ordered, this
sequence is preserved under normalization. Then, to display
meteg to the right of the sheva, the following sequence can
be used:
<meteg, CGJ, sheva>
A further complication arises for combinations of meteg with hataf
vowels. Authors who want to ensure left-position versus
medial-position
display of meteg with hataf vowels across all font implementations
may use joiner characters to distinguish these cases.
Thus, the following encoded representations can be used for different
positioning of meteg with a hataf vowel, such as hataf patah:
left-positioned meteg: |
<hataf patah, ZWNJ, meteg> |
medially-positioned meteg: |
<hataf patah, ZWJ, meteg> |
right-positioned meteg: |
<meteg, CGJ, hataf patah> |
In no case is use of ZWNJ, ZWJ, or CGJ required for
representation of meteg. These recommendations are simply provided for
interoperability in those instances where authors wish to
preserve specific positional information regarding the layout
of a meteg in text.
Rendering of Thai Combining Marks
Thai tone marks are a type of combining mark displayed above an associated
base character; they have a combining class of 107. Other Thai combining marks displayed above
— in particular vowels — have a
combining class of 0. This assignment of combining classes is insufficient to
fully characterize the typographic
interaction between those marks.
For the purpose of rendering, the Thai combining marks above (U+0E31, U+0E34..U+0E37,
U+0E47..U+0E4E) should be displayed outward from the base character they modify, in
the order in which they appear in the text. In particular, a sequence containing <U+0E48
THAI CHARACTER MAI EK, U+0E4D THAI CHARACTER NIKHAHIT> should be displayed with the
nikhahit above the mai ek, and a sequence containing <U+0E4D THAI CHARACTER NIKHAHIT,
U+0E48 THAI CHARACTER MAI EK> should be displayed with the mai ek above the nikhahit.
This does not preclude input processors from helping the user by pointing out
or correcting typing mistakes, perhaps taking into account the language. For example,
because the
string <mai ek, nikhahit> is not useful for the Thai language and is likely a typing
mistake, an input processor could reject it or correct it to <nikhahit, mai ek>.
When the character U+0E33 THAI CHARACTER SARA AM follows one or more tone marks (U+0E48 .. U+0E4B),
the nikhahit that is part of the sara am should be displayed below those tone marks. In particular,
a sequence containing <U+0E48 THAI CHARACTER MAI EK, U+0E33 THAI CHARACTER SARA AM>
should be displayed with the mai ek above the nikhahit.
Superseded Sections
Unicode Character Database
The complete Unicode Character Database
files for this version are available in the
4.1.0 directory.
For more detailed information about the changes in the Unicode
Character Database, see the file
UCD.html in the Unicode Character
Database.
Errata Corrected in This Version
Errata corrected in this version are listed by date in
a separate table. For corrigenda and errata after the release of Unicode 4.1.0, see the list of current
Updates and Errata.
Script Additions
New Tai Lue: U+1980 - U+19DF
The New Tai Lue script, also known as Xishuang Banna Dai, is used
mainly in southern China. The script was developed in the twentieth
century as an orthographic simplification of the historic Lanna script
used to write the Tai Lue language.
New Tai Lue differs from Lanna in that it regularizes the consonant
repertoire, simplifies the writing of consonant clusters and
syllable-final consonants, and uses only spacing vowel signs, which
appear before or after the consonants they modify. By contrast, Lanna
uses both spacing vowel signs and nonspacing vowel signs which appear
above or below the consonants they modify. All vowel signs in New Tai
Lue are considered combining characters and follow their base
consonants in the text stream. Where a syllable is composed of a vowel
sign to the left and a vowel sign or tone mark on the right of the
consonant, a sequence of characters is used, in the order consonant +
vowel + tone mark.
A virama or killer character is not used to create conjunct consonants
in New Tai Lue, because clusters of consonants do not regularly occur.
New Tai Lue has a limited set of final consonants, which are modified
with a hook showing that the inherent vowel is killed.
Similar to the Thai and Lao scripts, New Tai Lue consonant letters
come in pairs that denote two tonal registers. The tone of a syllable
is indicated by the combination of the tonal register of the consonant
letter plus a tone mark written at the end of the syllable.
Buginese: U+1A00 - U+1A1F
The Buginese script is used on the island of Sulawesi, mainly
in the southwest. A variety of traditional literature has been
printed in it. The script is one of the easternmost
of the Brahmi scripts and is perhaps related to Javanese. It
bears some affinity to Tagalog, and it does not traditionally
record final consonants. The Buginese language, an Austronesian
language with a rich traditional literature, is one of the
foremost languages of Indonesia. The script was previously also
used to write the Makassar, Bimanese, and Madurese languages.
Glagolitic: U+2C00 - U+2C5F
Glagolitic, from the Slavic root "glagol" meaning "word", is an
alphabet considered to have been devised by St. Cyril in the ninth
century CE, for his translation of the Scriptures and liturgical books
into Slavonic. Glagolitic was eventually supplanted by the alphabet
now known as Cyrillic, which probably arose in late ninth-century
Bulgaria. In parts of Croatia where a vernacular liturgy was used,
Glagolitic continued in use until modern times; in these areas
Glagolitic is still occasionally used as a decorative alphabet.
Like Cyrillic, the Glagolitic script is written in linear sequence
from left to right with no contextual modification of the letterforms.
Glagolitic is treated as a separate alphabet from
Cyrillic because of its historical primacy, and because the letter
shapes in the two alphabets are completely dissimilar: the one can in
no sense be regarded as a variant of the other.
Glagolitic itself exists in two styles, known as round and square.
Round Glagolitic is the original style
and more geographically widespread; square Glagolitic was used
in Croatia from the thirteenth century. The letterforms used
in the charts are round Glagolitic.
Coptic: U+2C80 - U+2CFF
Coptic is now considered a separate script from Greek in
the Unicode Standard. This differs from prior documentation
in the standard, for which Coptic was considered to be
a stylistic variant of Greek, to be implemented by a
font shift.
Starting with Unicode Version 4.1.0, a separate Coptic
script block has been added at U+2C80..U+2CFF. The block
contains the common Coptic alphabet, but also contains
extensions needed for Old Coptic and dialectal usage of
the Coptic script. It also contains Coptic-specific symbols
and punctuation.
The long-encoded 14 Coptic letters derived from Demotic,
encoded in the range U+03E2..U+03EF in the Greek and Coptic
block, are also considered part of the Coptic script, and
should be included in any complete implementation of the
script.
Any implementations of Coptic predating Unicode Version 4.1.0
should be carefully checked, since use of Greek characters
with Coptic-style fonts is no longer recommended for
Coptic data.
Tifinagh: U+2D30 - U+2D7F
The Tifinagh script is used by around 20 million people in
Morocco for writing Berber languages including Tarifite,
Tamazighe, and Tachelhite. The teaching of Berber, written in
Tifinagh, will be generalized and compulsory in Morocco. It is
scheduled to be taught in all public schools by 2008.
Historically the script has been used in several variant
traditions along the Mediterranean coast from Kabylia to Morocco
and the Canary Islands, the Constantinois and Aurès regions, as
well as in Tunisia.
Syloti Nagri: U+A800 - U+A82F
The Syloti Nagri is a lesser-known Brahmi-derived script used
for writing the Sylheti language. Sylheti is an Indo-European
language spoken by some 5 million speakers in the Barak Valley
region of northeast Bangladesh and southeast Assam (India).
Sylheti has commonly been regarded as a dialect of Bengali, with
which it shares a high proportion of vocabulary. The Sylheti
Nagri script has 27 consonant letters with an inherent vowel of
/o/, and 5 independent vowel letters. There are five dependent
vowel signs which are attached to a consonant letter. Included
in the encoding are several script-specific punctuation marks.
Old Persian: U+103A0 - U+103DF
Old Persian is found in a number of inscriptions in the Old Persian
language dating from the Achaemenid Empire. It is an alphabetic writing
system with some syllabic aspects. While the shapes of some Old Persian
letters may look similar to signs in Sumero-Akkadian Cuneiform, it is
clear that only one of them was borrowed from Sumero-Akkadian Cuneiform.
Scholars today agree that the character inventory of Old Persian was
newly-invented for the purpose of providing monumental inscriptions of
the Achaemenid king, Darius I, by about 525 BCE.
Old Persian is written from left to right. The repertoire
contains 36 signs which represent consonants, vowels or
sequences of single consonants plus vowels, a set of five
numbers, one word divider, and eight ideograms.
Kharoshthi: U+10A00 - U+10A5F
The Kharoshthi script was used historically to write Gāndhārī and Sanskrit
as well as various mixed dialects. Kharoshthi is an Indic script of the abugida
type. However, unlike other Indic scripts, it is written from right to left.
The Kharoshthi script was initially deciphered around the middle of the
nineteenth century by James Prinsep and others who worked from short Greek
and Kharoshthi inscriptions on the coins of the Indo-Greek and Indo-Scythian
kings. The decipherment has been refined over the last 150 years as more
material has come to light. Representation of Kharoshthi in the Unicode
code charts uses forms based on manuscripts of the first century CE.
Kharoshthi can be implemented using the rules of the Unicode bidirectional
algorithm. In Kharoshthi both letters and digits are written from right to
left. Rendering requirements for Kharoshthi are similar to those for Devanagari.
Significant Character Additions
In addition to encodings of entirely new scripts in
Unicode Version 4.1.0, there have been other significant
additions to the character repertoire. In some instances,
these consist of major or minor extensions of existing
scripts, and in other instances consist of specialized
sets of punctuation, modifier letters or other symbols.
These additions are sorted by category and explained in
the sections below.
Arabic Supplement: U+0750-U+077F
Unicode 4.1 adds 30 additional extended Arabic letters mainly for the
languages used in Northern and Western Africa, such as Fulfulde,
Hausa, Songhoy and Wolof. In the second half of the twentieth century,
the use of the Arabic script was actively promoted for these
languages. Characters used for other languages are annotated in the
character names list. Additional vowel marks used with these languages
are found in the main Arabic block.
Ethiopic Extensions: U+1380 - U+139F, U+2D80 - U+2DDF
The Ethiopic script is used for a large number of languages
and dialects in Ethiopia, and in some instances has been
extended significantly beyond the set of characters used
for major languages such as Amharic and Tigré. Unicode Version
4.1.0 adds two blocks of extensions to the Ethiopic script:
Ethiopic Supplement U+1380..U+139F and Ethiopic Extended
U+2D80..U+2DDF. Those extensions cover such languages as
Me'en, Blin, and Sebatbeit, which use many additional
characters. Several other characters have been added to the
main Ethiopic script block in the range U+1200..U+137F,
including one additional Ethiopic punctuation mark, and a
combining mark used to indicate gemination.
In the Ethiopic Supplement block there is also a new set of
tonal marks. These are used in multiline scored layout,
and as for other musical (an)notational systems of this type,
require a higher-level protocol to enable proper rendering.
Additions for Biblical Hebrew
Five new Hebrew characters have been added in Unicode 4.1 for special
usage in Biblical Hebrew text:
U+05A2 HEBREW ACCENT ATNAH HAFUKH
U+05BA HEBREW POINT HOLAM HASER FOR VAV
U+05C5 HEBREW MARK LOWER DOT
U+05C6 HEBREW PUNCTUATION NUN HAFUKHA
U+05C7 HEBREW POINT QAMATS QATAN
In some older versions of Biblical text, a distinction is made between
the accents U+05A2 HEBREW ACCENT ATNAH HAFUKH and U+05AA HEBREW ACCENT
YERAH BEN YOMO. Many editions from the last few centuries do not retain
this distinction, using only yerah ben yomo, but some users in recent
decades have begun to re-introduce this distinction. Similarly, a number of
publishers of Biblical or other religious texts have introduced a typographic distinction for
the vowel point qamats corresponding to two different readings. The
original letterform used for one reading is referred to as qamats or qamats
gadol; the new letterform for the other reading is qamats qatan. It is
important to note that not all users of Biblical Hebrew use atnah hafukh
and qamats qatan. If the distinction between accents atnah hafukh and yerah
ben yomo is not made, then only U+05AA HEBREW ACCENT YERAH BEN YOMO is
used. If the distinction between vowels qamats gadol and qamats qatan is
not made, then only U+05B8 HEBREW POINT QAMATS is used. Implementations
that support Hebrew accents and vowel points may not necessarily support
the special-usage characters U+05A2 HEBREW ACCENT ATNAH HAFUKH and U+05C7
HEBREW POINT QAMATS QATAN.
The vowel point holam represents the vowel phoneme /o/. The consonant
letter vav represents the consonant phoneme /w/, but in some words is used
to represent a vowel, /o/. When the point holam is used on vav, the
combination usually represents the vowel /o/, but in a very small number of
cases represents the consonant-vowel combination /wo/. A typographic
distinction is made between these two in many versions of Biblical text. In
most cases, in which vav + holam together represents the vowel /o/, the
point holam is centered above the vav and referred to as holam male. In the
less frequent cases, in which the vav represents the consonant /w/, some
versions show the point holam positioned above left. This is referred to as
holam haser. The character U+05BA HEBREW POINT HOLAM HASER FOR VAV is
intended for use as holam haser only in those cases where a distinction is
needed. When the distinction is made, the character U+05B9 HEBREW POINT
HOLAM is used to represent the point holam male on vav. U+05BA HEBREW POINT
HOLAM HASER FOR VAV is intended for use only on vav; results of combining
this character with other base characters are not defined. Not all users
distinguish between the two forms of holam, and not all implementations can
be assumed to support U+05BA HEBREW POINT HOLAM HASER FOR VAV.
In the Hebrew Bible, dots are written in various places above or below
the base letters that are distinct from the vowel points and accents. These
are referred to by scholars as puncta extraordinaria, and there are two
kinds. The upper punctum is the more common of the two, and has been
encoded since Unicode 2.0 as U+05C4 HEBREW MARK UPPER DOT. The lower
punctum is used only in one verse of the Bible, Psalm 27:13, and has been added
in Unicode 4.1 as U+05C5 HEBREW MARK LOWER DOT. The puncta generally differ in
appearance from dots that occur above letters used to represent numbers; the
number dots should be represented using U+0307 COMBINING DOT ABOVE and U+0308
COMBINING DIAERESIS.
The nun hafukha is a special symbol that appears to have been used for
scribal annotations, though its exact functions are uncertain. It is used a
total of nine times in the Hebrew Bible, although not all versions include
it, and there are variations in the exact locations in which it is used.
There is also variation in the glyph used: it often has the appearance of a
rotated or reversed nun, and is very often called inverted nun; it may
also appear similar to a half tet or have some other form.
Bengali Khanda Ta
In Bengali a dead consonant TA shows up as U+09CE BENGALI
LETTER KHANDA TA in all contexts except where it is immediately
followed by one of the consonants TA, THA, NA, BA, MA, YA, or
RA. Khanda-Ta cannot bear a vowel matra or combine with a
following consonant to form a conjunct aksara. It can
form a conjunct aksara only with a preceding dead
consonant RA, with the latter showing up as a REPH placed on the
Khanda Ta.
Previous versions of the Unicode Standard recommended that
Khanda-Ta be encoded as TA + VIRAMA + ZWJ. Instead, the Khanda-Ta
should be used explicitly in new text, but users are cautioned
that instances of the old encoding may exist.
Phonetic Extensions: U+1D6C - U+1DBF
Unicode 4.1 adds a significant number of characters used
for phonetic transcription and phonetically-based
orthographies. The characters in the range U+1D6C - U+1D7F
complete the previously existing Phonetic Extensions block.
A new Phonetic Extensions Supplement block has also been
added, with the range U+1D80 - U+1DBF.
The phonetic extensions for Unicode 4.1 are derived from a wide
variety of sources, including many technical orthographies
developed by SIL linguists, as well as older historic sources.
Of particular note, all attested phonetic characters showing
struckthrough tildes, struckthrough bars, and retroflex or
palatal hooks attached to the basic letter have been
separately encoded in the blocks for phonetic extensions.
Although separate combining marks exist in the Unicode Standard
for overstruck diacritics and attached retroflex or
palatal hooks, earlier encoded IPA letters such as
U+0268 LATIN SMALL LETTER I WITH STROKE or U+026D LATIN SMALL
LETTER L WITH RETROFLEX HOOK have never been
been given decomposition mappings in the standard. For
consistency, all newly encoded characters are handled
analogously to the existing, more common characters of this type,
and are not given decomposition mappings.
The Phonetic Extensions Supplement block also contains 37
superscript modifier letters. These complement the much
more commonly used superscript modifier letters found in
the Spacing Modifer Letters block.
U+1D77 LATIN SMALL LETTER TURNED G and U+1D78 MODIFIER LETTER
CYRILLIC EN are used in Caucasian linguistics. U+1D79 LATIN
SMALL LETTER INSULAR G is used in older Irish phonetic notation.
It is to be distinguished from merely a Gaelic style glyph
for U+0067 LATIN SMALL LETTER G.
U+1D7A LATIN SMALL LETTER TH
WITH STRIKETHROUGH is a digraphic notation commonly found
in some English-language dictionaries, representing the
voiceless (inter)dental fricative, as in thin.
While this character is clearly a digraph, the obligatory
strikethrough across two letters distinguishes it from
a "th" digraph per se, and there is no mechanism involving
combining marks which can easily be used to represent it.
A common alternative glyphic form for U+1D7A uses a
horizontal bar to strike through the two letters, instead
of a diagonal stroke.
Modifier Tone Letters: U+A700 - U+A71F
The Modifier Tone Letters block contains modifier
letters used in various schemes for marking tones. These
supplement the more commonly used tone marks and tone letters
found in the Spacing Modifier Letters block (U+02B0 - U+02FF).
The characters in the range U+A700 - U+A707 are corner
tone marks used in the transcription of Chinese. They were
invented by Bridgman and Wells Williams in the 1830s. They
have little current use, but are seen in a number of old
Chinese sources.
The tone letters in the range U+A708 - U+A716 complement the
basic set of IPA tone letters (U+02E5 - U+02E9), and are also
used in the representation of Chinese tones, for the most
part. The dotted tone letters are used to represent short
("stopped") tones. The left-stem tone letters are mirror
images of the IPA tone letters, and like those tone letters,
can be ligated in sequences of two or three tone letters to
represent contour tones. Left-stem versus right-stem tone
letters are sometimes used contrastively to distinguish between
tonemic and tonetic transcription, or to show the effects of
tonal sandhi.
Combining Diacritical Marks Supplement: U+1DC0 - U+1DFF
This block is the supplement to the Combining Diacritical
Marks block in the range U+0300 - U+036F. It contains
lesser-used combining diacritical marks.
U+1DC0 COMBINING DOTTED GRAVE ACCENT and U+1DC1 COMBINING
DOTTED ACUTE ACCENT are marks occasionally seen in some
Greek texts. They are variant representations of the
accent combinations, dialytika varia and dialytika oxia,
respectively. They are, however, encoded separately because
they cannot be reliably formed by regular stacking rules
involving U+0308 COMBINING DIAERESIS and U+0300 COMBINING
GRAVE ACCENT or U+0301 COMBINING ACUTE ACCENT.
U+1DC3 COMBINING SUSPENSION MARK is a combining mark specifically
used in Glagolitic. It is not to be confused with a combining
breve.
Editorial Marks for Biblical Text Annotation
The Greek text of the New Testament exists in a large number of
manuscripts with many textual variants. The most widely used critical
edition of the New Testament, the Nestle-Aland edition published by
the United Bible Societies (UBS), introduced a set of editorial
characters which are regularly used in a number of journals and other
publications. As a result, these editorial marks have become the
recognized method of annotating the New Testament, and have been
encoded in Unicode 4.1 in the range U+2E00..U+2E0D.
CJK Additions
Characters have been added to complete roundtrip mapping support for
HKSCS and GB 18030. Some of these characters can be found in a new CJK
Basic Strokes block (U+31C0..U+31EF), in a new Vertical Forms
block (U+FE10..U+FE1F), and as a range extension to CJK Unified
Ideographs (U+9FA6..U+9FBB). Other new characters are found in symbol
blocks (U+23DA..U+23DB). Parsers and other code may need to adjust for
the change of the end of the CJK Unified Ideographs range from U+9FA5
to U+9FBB.
Characters in the CJK Basic Strokes block are single-stroke
components of CJK ideographs. The first characters assigned to
this block are 16 HKSCS-2001 characters.
A new collection of 106 CJK compatibility ideographs has
been added to support roundtrip mapping to the DPRK
standard.
Ancient Greek Additions
Ancient Greek Numbers: U+10140-U+1018F
Many symbols have been added to Unicode 4.1 to
enable the complete coverage of Ancient Greek acrophonic
numeric representation. This includes all known dialectal
variants. In addition, a set of Ancient Greek papyrological
numbers has been added.
Ancient Greek Editorial Marks
Ancient Greek scribes generally wrote in continuous uppercase letters
without separating letters into words. On occasion the scribe added punctuation
to indicate the end of a sentence or a change of speaker, or to
separate words. Editorial and punctuation characters appear
abundantly in surviving papyri and have been rendered in modern
typography when possible, often exhibiting considerable glyphic
variation. A number of these editorial marks are encoded in the range
U+2E0E..U+2E16.
Ancient Greek Musical Notation: U+1D200 - U+1D24F
Ancient Greek had complete sets of vocal and instrumental
notation symbols. These were based on Greek letters —
comparable to the modern usage of the Latin letters
A through G to refer to notes of the Western musical
scale. However, rather than using a sharp and flat
notation to indicate semitones, or casing and other
diacritics to indicate distinct octaves, the Ancient
Greek system extended the basic Greek alphabet by rotating
and flipping letterforms in various ways, and by adding
a few more symbols not directly based on a letter.
Ancient Greek musical notation had a separate system
for vocal notation and for instrumental notation;
each has a traditional catalog numbering system used
by modern scholars of Ancient Greek. In the Unicode Standard,
the two systems are unified against each other and
against the basic Greek alphabet, based on shape. Thus,
if a note is to be represented for the vocal notation
system by a Greek letterform, not rotated or flipped,
then the corresponding letter from the Greek alphabet
in the Greek and Coptic block should be used instead,
using an appropriate font to match the archaic letterforms
used in the notational system.
If a symbol is used in both the vocal notation system
and the instrumental notation system, its Unicode
character name is based on the vocal notation system
catalog number. Thus U+1D20D GREEK VOCAL NOTATION SYMBOL-14
has a glyph based on an inverted capital lambda. In the
vocal notation system, it represents the first sharp of B,
and in the instrumental notation system, it represents
the first sharp of d'. Since it is used in both systems,
its name is based on its sequence in the vocal notation
system, rather than its sequence in the instrumental
notation system. The character names list in the Unicode
Character Database is fully annotated with the functions
of the symbols for each system.
The combining marks encoded in the range U+1D242 - U+1D244
are placed over the vocal or instrumental notation symbols
and are used to indicate metrical qualities.
Georgian Nuskhuri: U+2D00 - U+2D2F
The Georgian script form Nuskhuri was added in Unicode 4.1.
The Georgian
script has two related forms. The ecclesiastical form, Khutsuri, has an
uppercase, inscriptional form, called Asomtavruli, and a lowercase,
cursive, manuscript form called Nuskhuri. The modern, ordinary form,
Mkhedruli, is caseless. Prior to Unicode 4.1, secular (Mkhedruli) and ecclesiastical
(Khutsuri) styles of Georgian were considered font styles. Both
Mkhedruli text and Nuskhuri text were represented using the character
range U+10D0..U+10F8. Beginning with Unicode 4.1, Nuskhuri is
separately represented using the new Georgian Supplement block,
U+2D00..U+2D2F, and the characters in the range U+10D0..
U+10F8 should be restricted to use for Mkhedruli text. Case mappings
are now provided between the two Khutsuri forms: Asomtavruli and
Nuskhuri.
In addition, three Mkhedruli characters which are used in the
transcription of some East Caucasian languages were added.