Technical Reports |
Version | 6.2.0 |
Editors | Mark Davis (markdavis@google.com), Ken Whistler (ken@unicode.org), Markus Scherer |
Date | 2012-08-30 |
This Version | http://www.unicode.org/reports/tr10/tr10-26.html |
Previous Version | http://www.unicode.org/reports/tr10/tr10-24.html |
Latest Version | http://www.unicode.org/reports/tr10/ |
Latest Proposed Update | http://www.unicode.org/reports/tr10/proposed.html |
Revision | 26 |
This report is the specification of the Unicode Collation Algorithm (UCA), which details how to compare two Unicode strings while remaining conformant to the requirements of the Unicode Standard. The UCA also supplies the Default Unicode Collation Element Table (DUCET) as the data specifying the default collation order for all Unicode characters.
This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.
A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.
Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].
Collation is the general term for the process and function of determining the sorting order of strings of characters. It is a key function in computer systems; whenever a list of strings is presented to users, they are likely to want it in a sorted order so that they can easily and reliably find individual strings. Thus it is widely used in user interfaces. It is also crucial for databases, both in sorting records and in selecting sets of records with fields within given bounds.
Collation varies according to language and culture: Germans, French and Swedes sort the same characters differently. It may also vary by specific application: even within the same language, dictionaries may sort differently than phonebooks or book indices. For non-alphabetic scripts such as East Asian ideographs, collation can be either phonetic or based on the appearance of the character. Collation can also be customized according to user preference, such as ignoring punctuation or not, putting uppercase before lowercase (or vice versa), and so on. Linguistically correct searching needs to use the same mechanisms: just as "v" and "w" sort as if they were the same base letter in Swedish, a loose search should pick up words with either one of them.
Collation implementations must deal with the complex linguistic conventions for ordering text in specific languages, and provide for common customizations based on user preferences. Furthermore, algorithms that allow for good performance are crucial for any collation mechanisms to be accepted in the marketplace.
Table 1 shows some examples of cases where sort order differs by language, usage, or another customization.
Language | Swedish: | z < ö |
German: | ö < z | |
Usage | German Dictionary: | of < öf |
German Telephone: | öf < of | |
Customizations | Upper-First | A < a |
Lower-First | a < A |
Languages vary regarding which types of comparisons to use (and in which order they are to be applied), and in what constitutes a fundamental element for sorting. For example, Swedish treats ä as an individual letter, sorting it after z in the alphabet; German, however, sorts it either like ae or like other accented forms of a, thus following a. In Slovak, the digraph ch sorts as if it were a separate letter after h. Examples from other languages and scripts abound. Languages whose writing systems use uppercase and lowercase typically ignore the differences in case, unless there are no other differences in the text.
It is important to ensure that collation meets user expectations as fully as possible. For example, in the majority of Latin languages, ø sorts as an accented variant of o, meaning that most users would expect ø alongside o. However, a few languages, such as Norwegian and Danish, sort ø as a unique element after z. Sorting "Søren" after "Sylt" in a long list, as would be expected in Norwegian or Danish, will cause problems if the user expects ø as a variant of o. A user will look for "Søren" between "Sorem" and "Soret", not see it in the selection, and assume the string is missing, confused because it was sorted in a completely different location. In matching, the same can occur, which can cause significant problems for software customers; for example, in a database selection the user may not realize what records are missing. See Section 1.5, Other Applications of Collation.
With Unicode applications widely deployed, multilingual data is the rule, not the exception. Furthermore, it is increasingly common to see users with many different sorting expectations accessing the data. For example, a French company with customers all over Europe will include names from many different languages. If a Swedish employee at this French company accesses the data from a Swedish company location, the customer names need to show up in the order that meets this employee's expectations—that is, in a Swedish order—even though there will be many different accented characters that do not normally appear in Swedish text.
For scripts and characters not used in a particular language, explicit rules may not exist. For example, Swedish and French have clearly specified, distinct rules for sorting ä (either after z or as an accented character with a secondary difference from a), but neither defines the ordering of characters such as Ж, ש, ♫, ∞, ◊, or ⌂.
To address the complexities of language-sensitive sorting, a multilevel comparison algorithm is employed. In comparing two words, the most important feature is the identity of the base letters—for example, the difference between an A and a B. Accent differences are typically ignored, if the base letters differ. Case differences (uppercase versus lowercase), are typically ignored, if the base letters or their accents differ. Treatment of punctuation varies. In some situations a punctuation character is treated like a base letter. In other situations, it should be ignored if there are any base, accent, or case differences. There may also be a final, tie-breaking level (called an identical level), whereby if there are no other differences at all in the string, the (normalized) code point order is used.
Level | Description | Examples |
---|---|---|
L1 | Base characters | role < roles < rule |
L2 | Accents | role < rôle < roles |
L3 | Case/Variants | role < Role < rôle |
L4 | Punctuation | role < “role” < Role |
Ln | Identical | role < ro□le < “role” |
The examples in Table 2 are in English; the description of the levels may correspond to different writing system features in other languages. In each example, for levels L2 through Ln, the differences on that level (indicated by the underlined characters) are swamped by the stronger-level differences (indicated by the blue text). For example, the L2 example shows that difference between an o and an accented ô is swamped by an L1 difference (the presence or absence of an s). In the last example, the □ represents a format character, which is otherwise completely ignorable.
The primary level (L1) is for the basic sorting of the text, and the non-primary levels (L2..Ln) are for adjusting string weights for other linguistic elements in the writing system that are important to users in ordering, but less important than the order of the basic sorting. In practice, fewer levels may be needed, depending on user preferences or customizations.
Many people expect the characters in their language to be in the "correct" order in the Unicode code charts. Because collation varies by language and not just by script, it is not possible to arrange the encoding for characters so that simple binary string comparison produces the desired collation order for all languages. Because multi-level sorting is a requirement, it is not even possible to arrange the encoding for characters so that simple binary string comparison produces the desired collation order for any particular language. Separate data tables are required for correct sorting order. For more information on tailorings for different languages, see [CLDR].
The basic principle to remember is: The position of characters in the Unicode code charts does not specify their sort order.
There are many cases in Unicode where two sequences of characters are canonically equivalent: the sequences represent essentially the same text, but with different actual sequences. For more information, see [UAX15].
Sequences that are canonically equivalent must sort the same. Table 3 gives some examples of canonically equivalent sequences. For example, the angstrom sign was encoded for compatibility, and is canonically equivalent to an A-ring. The latter is also equivalent to the decomposed sequence of A plus the combining ring character. The order of certain combining marks is also irrelevant in many cases, so such sequences must also be sorted the same, as shown in the second example. The third example shows a composed character that can be decomposed in four different ways, all of which are canonically equivalent.
1 | Å | U+212B ANGSTROM SIGN |
Å | U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE | |
A ◌̊ | U+0041 LATIN CAPITAL LETTER A, U+030A COMBINING RING ABOVE | |
2 | x ◌̛ ◌̣ | U+0078 LATIN SMALL LETTER X, U+031B COMBINING HORN, U+0323 COMBINING DOT BELOW |
x ◌̣ ◌̛ | U+0078 LATIN SMALL LETTER X, U+0323 COMBINING DOT BELOW, U+031B COMBINING HORN | |
3 | ự | U+1EF1 LATIN SMALL LETTER U WITH HORN AND DOT BELOW |
ụ◌̛ | U+1EE5 LATIN SMALL LETTER U WITH DOT BELOW, U+031B COMBINING HORN | |
u ◌̛ ◌̣ | U+0075 LATIN SMALL LETTER U, U+031B COMBINING HORN, U+0323 COMBINING DOT BELOW | |
ư ◌̣ | U+01B0 LATIN SMALL LETTER U WITH HORN, U+0323 COMBINING DOT BELOW | |
u ◌̣ ◌̛ | U+0075 LATIN SMALL LETTER U, U+0323 COMBINING DOT BELOW, U+031B COMBINING HORN |
There are additional complications in certain languages, where the comparison is context sensitive and depends on more than just single characters compared directly against one another, as shown in Table 4.
The first example of such a complication consists of contractions, where two (or more) characters sort as if they were a single base letter. In the table below, CH acts like a single letter sorted after C.
The second example consists of expansions, where a single character sorts as if it were a sequence of two (or more) characters. In the table below, an Œ ligature sorts as if it were the sequence of O + E.
Both contractions and expansions can be combined: that is, two (or more) characters may sort as if they were a different sequence of two (or more) characters. In the third example, for Japanese, a length mark sorts like the vowel of the previous syllable: as an A after KA and as an I after KI.
Contractions | H < Z, but CH > CZ |
---|---|
Expansions | OE < Œ < OF |
Both | カー < カイ, but キー > キイ |
Some languages have additional oddities in the way they sort. Normally, all differences in sorting are assessed from the start to the end of the string. If all of the base letters are the same, the first accent difference determines the final order. In row 1 of Table 5, the first accent difference is on the o, so that is what determines the order. In Canadian French, however, it is the last accent difference that determines the order, as shown in row 2.
Normal Accent Ordering | cote < coté < côte < côté |
---|---|
Backward Accent Ordering | cote < côte < coté < côté |
In practice, there are additional features of collation that users need to control. These are expressed in user-interfaces and eventually in APIs. Other customizations or user preferences include the following:
b < ב < β < б [Latin < Hebrew < Greek < Cyrillic] versus
β < b < б < ב [Greek < Latin < Cyrillic < Hebrew]
Attempting to achieve this effect by introducing an extra strength level before the first (primary) level would give incorrect ordering results for strings which mix characters of more than one script.
Phonetic sorting of Han characters requires use of either a lookup dictionary of words or, more typically, special construction of programs or databases to maintain an associated phonetic spelling for the words in the text.
The same principles about collation behavior apply to realms beyond sorting. In particular, searching should behave consistently with sorting. For example, if v and w are treated as identical base letters in Swedish sorting, then they should also be treated the same for searching. The ability to set the maximal strength level is very important for searching.
Selection is the process of using the comparisons between the endpoints of a range, as when using a SELECT command in a database query. It is crucial that the range returned be correct according to the user's expectations. For example, if a German businessman making a database selection to sum up revenue in each of of the cities from O... to P... for planning purposes does not realize that all cities starting with Ö were excluded because the query selection was using a Swedish collation, he will be one very unhappy customer.
A sequence of characters considered a unit in collation, such as ch in Slovak, represents a collation grapheme cluster. For applications of this concept, see Unicode Technical Standard #18, "Unicode Regular Expressions" [UTS18]. For more information on grapheme clusters, see Unicode Standard Annex #29, "Unicode Text Segmentation" [UAX29].
Sort keys may need to be merged. For example, the simplest way to sort a database according to two fields is to sort field by field, sequentially. This gives the results in column one in Table 6. (The examples in this table are ordered using the Shifted option for handling variable collation elements such as the space character; see Section 3.6.2 Variable Weighting for details.) All the levels in Field 1 are compared first, and then all the levels in Field 2. The problem with this approach is that high-level differences in the second field are swamped by minute differences in the first field, which results in unexpected ordering for the first names.
Sequential | Weak First | Merged | ||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
F1L1, F1L2, F1L3, F2L1, F2L2, F2L3 |
F1L1, F2L1, F2L2, F2L3 |
F1L1, F2L1, F1L2, F2L2, F1L3, F2L3 |
||||||||||||||||||||||||||||||||||||
|
|
|
A second way to do the sorting is to ignore all but base-level differences in the sorting of the first field. This gives the results in the second column. The first names are all in the right order, but the problem is now that the first field is not correctly ordered except by the base character level.
The correct way to sort two fields is to merge the fields, as shown in the "Merged" column. Using this technique, all differences in the fields are taken into account, and the levels are considered uniformly. Accents in all fields are ignored if there are any base character differences in any of the field, and case in all fields is ignored if there are accent or base character differences in any of the fields.
Collation is one of the most performance-critical features in a system. Consider the number of comparison operations that are involved in sorting or searching large databases, for example. Most production implementations will use a number of optimizations to speed up string comparison.
Strings are often preprocessed into sort keys, so that multiple comparisons operations are much faster. With this mechanism, a collation engine generates a sort key from any given string. The binary comparison of two sort keys yields the same result (less, equal, or greater) as the collation engine would return for a comparison of the original strings. Thus, for a given collation C and any two strings A and B:
A ≤ B according to C if and only if sortkey(C, A) ≤ sortkey(C, B)
However, simple string comparison is faster for any individual comparison, because the generation of a sort key requires processing an entire string, while differences in most string comparisons are found before all the characters are processed. Typically, there is a considerable difference in performance, with simple string comparison being about 5 to 10 times faster than generating sort keys and then using a binary comparison.
Sort keys, on the other hand, can be much faster for multiple comparisons. Because binary comparison is much faster than string comparison, it is faster to use sort keys whenever there will be more than about 10 comparisons per string, if the system can afford the storage.
There are a number of common expectations about and misperceptions of collation. This section points out many things that collation is not and cannot be.
Collation is not aligned with character sets or repertoires of characters.
Swedish and German share most of the same characters, for example, but have very different sorting orders.
Collation is not code point (binary) order.
A simple example of this is the fact that capital Z comes before lowercase a in the code charts. As noted earlier, beginners may complain that a particular Unicode character is “not in the right place in the code chart.” That is a misunderstanding of the role of the character encoding in collation. While the Unicode Standard does not gratuitously place characters such that the binary ordering is odd, the only way to get the linguistically-correct order is to use a language-sensitive collation, not a binary ordering.
Collation is not a property of strings.
In a list of cities, with each city correctly tagged with its language, a German user will expect to see all of the cities sorted according to German order, and will not expect to see a word with ö appear after z, simply because the city has a Swedish name. As in the earlier example, it is crucially important that if a German businessman makes a database selection, such as to sum up revenue in each of of the cities from O... to P... for planning purposes, cities starting with Ö not be excluded.
Collation order is not preserved under concatenation or substring operations, in general.
For example, the fact that x is less than y does not mean that x + z is less than y + z, because characters may form contractions across the substring or concatenation boundaries. In summary:
x < y does not imply that xz < yz
x < y does not imply that zx < zy
xz < yz does not imply that x < y
zx < zy does not imply that x < y
Collation order is not preserved when comparing sort keys generated from different collation sequences.
Remember that sort keys are a preprocessing of strings according to a given set of collation features. Different features result in different binary sequences. For example, if there are two collations, F and G, where F is a French collation, and G is a German phonebook ordering, then:
- A ≤ B according to F if and only if sortkey(F, A) ≤ sortkey(F, B), and
- A ≤ B according to G if and only if sortkey(G, A) ≤ sortkey(G, B)
- The relation between sortkey(F, A) and sortkey(G, B) says nothing about whether A ≤ B according to F, or whether A ≤ B according to G.
Collation order is not a stable sort.
Stability is a property of a sort algorithm, not of a collation sequence. For more information, see Section 3.8, Stability.
Collation order is not fixed.
Over time, collation order will vary: there may be fixes needed as more information becomes available about languages; there may be new government or industry standards for the language that require changes; and finally, new characters added to the Unicode Standard will interleave with the previously-defined ones. This means that collations must be carefully versioned.
The Unicode Collation Algorithm (UCA) details how to compare two Unicode strings while remaining conformant to the requirements of the Unicode Standard. This standard includes the Default Unicode Collation Element Table (DUCET), which is data specifying the default collation order for all Unicode characters, and the CLDR root collation element table that is based on the DUCET. This table is designed so that it can be tailored to meet the requirements of different languages and customizations.
Briefly stated, the Unicode Collation Algorithm takes an input Unicode string and a Collation Element Table, containing mapping data for characters. It produces a sort key, which is an array of unsigned 16-bit integers. Two or more sort keys so produced can then be binary-compared to give the correct comparison between the strings for which they were generated.
The Unicode Collation Algorithm assumes multiple-level key weighting, along the lines widely implemented in IBM technology, and as described in the Canadian sorting standard [CanStd] and the International String Ordering standard [ISO14651].
By default, the algorithm makes use of three fully-customizable levels. For the Latin script, these levels correspond roughly to:
A final level may be used for tie-breaking between strings not otherwise distinguished.
This design allows implementations to produce culturally acceptable collation, with a minimal burden on memory requirements and performance. In particular, it is possible to construct Collation Element Tables that use 32 bits of collation data for most characters.
Implementations of the Unicode Collation Algorithm are not limited to supporting only three levels. They are free to support a fully customizable 4th level (or more levels), as long as they can produce the same results as the basic algorithm, given the right Collation Element Tables. For example, an application which uses the algorithm, but which must treat some collection of special characters as ignorable at the first three levels and must have those specials collate in non-Unicode order (for example to emulate an existing EBCDIC-based collation), may choose to have a fully customizable 4th level. The downside of this choice is that such an application will require more storage, both for the Collation Element Table and in constructed sort keys.
The Collation Element Table may be tailored to produce particular culturally required orderings for different languages or locales. As in the algorithm itself, the tailoring can provide full customization for three (or more) levels.
The algorithm is designed to satisfy the following goals:
Given the standard ordering and the tailoring for any particular language, any two companies or individuals—with their own proprietary implementations—can take any arbitrary Unicode input and produce exactly the same ordering of two strings. In addition, when given an appropriate tailoring this algorithm can pass the Canadian and ISO 14651 benchmarks ([CanStd], [ISO14651]).
Note: The Default Unicode Collation Element Table does not explicitly list weights for all assigned Unicode characters. However, the algorithm is well defined over all Unicode code points. See Section 7.1.2, Unassigned and Other Code Points.
The Default Unicode Collation Element Table (DUCET) explicitly does not provide for the following features:
The feature of linguistic applicability deserves further discussion. DUCET does not and cannot actually provide linguistically correct sorting for every language without further tailoring. That would be impossible, due to conflicting requirements for ordering different languages that share the same script. It is not even possible in the specialized cases where a script may be predominantly used by a single language, because of the limitations of the DUCET table design and because of the requirement to minimize implementation overhead for all users of DUCET.
Instead, the goal of DUCET is to provide a reasonable default ordering for all scripts that are not tailored. Any characters used in the language of primary interest for collation are expected to be tailored to meet all the appropriate linguistic requirements for that language. For example, for a user interested primarily in the Malayalam language, DUCET would be tailored to get all details correct for the expected Malayalam collation order, while leaving other characters (Greek, Cyrillic, Han, and so forth) in the default order, because the order of those other characters is not of primary concern. Conversely, a user interested primarily in the Greek language would use a Greek-specific tailoring, while leaving the Malayalam (and other) characters in their default order in the table.
The Unicode Collation Algorithm does not restrict the many different ways in which implementations can compare strings. However, any Unicode-conformant implementation that purports to implement the Unicode Collation Algorithm must do so as described in this document.
A conformance test for the UCA is available in [Tests10].
The algorithm is a logical specification. Implementations are free to change any part of the algorithm as long as any two strings compared by the implementation are ordered the same as they would be by the algorithm as specified. Implementations may also use a different format for the data in the Collation Element Table. The sort key is a logical intermediate object: if an implementation produces the same results in comparison of strings, the sort keys can differ in format from what is specified in this document. (See Section 6, Implementation Notes.)
The conformance requirements of the Unicode Collation Algorithm are as follows:
C1 | Given a well-formed Unicode Collation Element
Table, a conformant implementation shall replicate the same comparisons
of strings as those produced by Section 4,
Main Algorithm.
In particular, a conformant implementation must be able to compare any two canonical-equivalent strings as being equal, for all Unicode characters supported by that implementation. If a conformant implementation compares strings in a legacy character set, it must provide the same results as if those strings had been transcoded to Unicode. |
C2 | A conformant implementation shall support at
least three levels of collation.
A conformant implementation is only required to implement three levels. However, it may implement four (or more) levels if desired. |
C3 | A conformant implementation that supports any
of the following features: backward levels, variable weighting, and semi-stability (S3.10),
shall do so in accordance with this specification.
A conformant implementation is not required to support these features; however, if it does, it must interpret them properly. If an implementation intends to support the Canadian standard [CanStd] then it should implement a backwards secondary level. |
C4 | An implementation that claims to conform to the UCA must specify the UCA version it conforms to.
The version number of this document is synchronized with the version of the Unicode Standard which specifies the repertoire covered. |
C5 | An implementation claiming conformance to Matching and Searching according to UTS #10, shall meet the requirements described in Section 8, Searching and Matching. |
C6 | An implementation claiming conformance to standard
UCA parametric tailoring shall do so in accordance with the specifications
Section 5, Tailoring.
An implementation claiming such conformance does not have to support all of the parameter attributes and values; however, those it claims to support must behave as specified. |
A Collation Element Table contains a mapping from one (or more) characters to one (or more) collation elements, where a collation element is an ordered list of three or more weights (non-negative integers). (All code points not explicitly mentioned in the mapping are given an implicit weight: see Section 7, Weight Derivation).
Note: Implementations can produce the same result using various representations of weights. See Section 6, Implementation Notes.
Unless otherwise noted, all weights used in the example collation elements in this document are in hexadecimal format. The specific weight values shown are illustrative only; they may not match the weights in the latest Default Unicode Collation Element Table [Allkeys].
The first weight is called the Level 1 or primary weight; the second is called the Level 2 or secondary weight; the third is called the Level 3 or tertiary weight; the fourth is called the Level 4 or quaternary weight, and so on. For a collation element X, these can be abbreviated as X1, X2, X3, X4, and so on.
Given two collation elements X and Y, this document uses the notation in Table 7 and Table 8.
Notation | Reading | Meaning |
---|---|---|
X =1 Y | X is primary equal to Y | X1 = Y1 |
X =2 Y | X is secondary equal to Y | X2 = Y2 and X =1 Y |
X =3 Y | X is tertiary equal to Y | X3 = Y3 and X =2 Y |
X =4 Y | X is quaternary equal to Y | X4 = Y4 and X =3 Y |
Notation | Reading | Meaning |
---|---|---|
X <1 Y | X is primary less than Y | X1 < Y1 |
X <2 Y | X is secondary less than Y | X <1 Y or (X =1 Y and X2 < Y2) |
X <3 Y | X is tertiary less than Y | X <2 Y or (X =2 Y and X3 < Y3) |
X <4 Y | X is quaternary less than Y | X <3 Y or (X =3 Y and X4 < Y4) |
Other operations are given their customary definitions in terms of the above. That is:
This notation for collation elements is also adapted to refer to ordering between strings, as shown in Table 9, where A and B refer to two strings.
Notation | Meaning |
---|---|
A <2 B | A is less than B, and there is a primary or secondary difference between them |
A <2 B and A=1 B | A is less than B, but there is only a secondary difference between them |
A ≡ B | A and B are equivalent (equal at all levels) according to a given Collation Element Table |
A = B | A and B are bit-for-bit identical |
Where only plain text ASCII characters are available the fallback notation in Table 10 may be used.
Notation | Fallback |
---|---|
X <n Y | X <[n] Y |
Xn | X[n] |
X ≤n Y | X <=[n] Y |
A ≡ B | A =[a] B |
If a weight is 0000, then that collation element is ignorable at that level: the weight at that level is not taken into account in sorting. A Level N ignorable is a collation element that is ignorable at level N but not at level N+1. Thus:
D1. A primary collation element is a collation element that is not ignorable at Level 1.
D2. A secondary collation element is a collation element that is ignorable at Level 1, but not at Level 2.
D3. A tertiary collation element is ignorable at Levels 1 and 2, but not Level 3.
D4. A quaternary collation element is ignorable at Levels 1, 2, and 3 but not Level 4.
D5. A completely ignorable collation element is ignorable at all levels (except the identical level).
D6. An ignorable collation element is ignorable at Level 1.
For a given Collation Element Table, MINn is the least weight in any collation element at level n, and MAXn is the maximum weight in any collation element at level n.
There are three kinds of collation element mappings used in the discussion below. These are defined as follows:
D7. A simple mapping maps one Unicode character to one collation element.
D8. An expansion maps one Unicode character to a sequence of collation elements.
D9. A contraction maps a sequence of Unicode characters to a sequence of (one or more) collation elements.
Most of the mappings in a collation element table are simple: they consist of the mapping of a single character to a single collation element.
The following list shows several simple mappings that are used in the examples illustrating the algorithm.
Character |
Collation Element |
Name |
---|---|---|
0300 "`" |
[0000.0021.0002] |
COMBINING GRAVE ACCENT |
0061 "a" |
[06D9.0020.0002] |
LATIN SMALL LETTER A |
0062 "b" |
[06EE.0020.0002] |
LATIN SMALL LETTER B |
0063 "c" |
[0706.0020.0002] |
LATIN SMALL LETTER C |
0043 "C" |
[0706.0020.0008] |
LATIN CAPITAL LETTER C |
0064 "d" |
[0712.0020.0002] |
LATIN SMALL LETTER D |
The mapping from characters to collation elements may not always be a simple mapping from one character to one collation element. In general, the mapping may be from one to many, from many to one, or from many to many.
The Latin letter æ is treated as a primary equivalent to an <a e>sequence, such as in the following example:
Character |
Collation Element |
Name |
---|---|---|
00E6 |
[15D5.0020.0004][0000.0139.0004][1632.0020.001F] |
LATIN SMALL LETTER AE; "æ" |
In this example, the collation element [15D5.0020.0004]
gives the
primary weight for a, and the collation element [1632.0020.001F]
gives the primary weight for e.
Similarly, where ch is treated as a single letter, as for instance in traditional Spanish, it is represented as a mapping from two characters to a single collation element, such as in the following example:
Character |
Collation Element |
Name |
---|---|---|
0063 |
[0707.0020.0002] |
LATIN SMALL LETTER C, |
In this example, the collation element [0707.0020.0002] has a primary value one greater than the primary value for the letter c by itself, so that the sequence ch will collate after c and before d. This example shows the result of a tailoring of collation elements to weight sequences of letters as a single unit.
Characters in a contraction can be made to sort as separate characters by inserting, someplace within the contraction, a starter that maps to a completely ignorable collation element. There are two characters, soft hyphen and U+034F COMBINING GRAPHEME JOINER, that are particularly useful for this purpose. These can be used to separate contractions that would normally be weighted as units, such as Slovak ch or Danish aa. Section 5.3, Use of Combining Grapheme Joiner.
Contractions that end with non-starter characters (those with Combining_Character_Class≠0) are known as discontiguous contractions. For example, suppose that there is a contraction of <a, combining ring above>, as in Danish where this sorts as after "z". If the input text contains the sequence <a, combining dot below, combining ring above>, then the contraction still needs to be detected. This is required by the rearrangement of the combining marks:
<a, combining dot below, combining ring above>
≡
<a, combining ring above, combining dot below>.
That is, discontiguous contractions must be detected in input text whenever the final sequence of non-starter characters could be rearranged so as to make a contiguous matching sequence that is canonically equivalent. In the formal algorithm this is handled by rule Rule S2.1. For information on non-starters, see [UAX15].
In some cases a sequence of two or more characters is mapped to a sequence of two or more collation elements. For example, this technique is used in the Default Unicode Collation Element Table [Allkeys] to handle weighting of rearranged sequences of Thai or Lao left-side-vowel + consonant. See Section 3.5, Rearrangement.
Both many-to-many mappings and many-to-one mappings are referred to as contractions in the discussion of the Unicode Collation Algorithm, even though many-to-many mappings often do not actually shorten anything. The key issue for implementations is that for both many-to-one mappings and many-to-many mappings, the weighting algorithm must first identify a sequence of characters in the input string and "contract" them together as a unit for weight lookup in the table. The identified unit may then be mapped to any number of collation elements. Contractions pose particular issues for implementations, because all eligible contraction targets must be identified first, before the application of simple mappings, so that processing for simple mappings does not bleed away the context needed to correctly identify the contractions.
Certain characters may both expand and contract. See Section 1.3, Contextual Sensitivity.
In some languages (notably Canadian French), accents are sorted from the back of the string to the front of the string. This behavior is not marked in the Default Unicode Collation Element Table, but may occur in tailored tables. In such a case, the collation elements for the accents and their base characters are marked as being backwards at Level 2.
Certain characters, such as the Thai vowels เ through ไ (and related vowels in the Lao and Tai Viet scripts of Southeast Asia), are not represented in strings in logical order. The exact list of such characters is given by the Logical_Order_Exception property in the Unicode Character Database [UAX44]. For collation, they are rearranged by swapping them with the following character before further processing, because logically they belong afterward. This is done by providing these sequences as many-to-many mappings in the Collation Element Table.
The Default Unicode Collation Element Table is provided in [Allkeys]. This table provides a mapping from characters to collation elements for all the explicitly weighted characters. The mapping lists characters in the order that they are weighted. Any code points that are not explicitly mentioned in this table are given a derived collation element, as described in Section 7, Weight Derivation.
The Default Unicode Collation Element Table does not aim to provide precisely correct ordering for each language and script; tailoring is required for correct language handling in almost all cases. The goal is instead to have all the other characters, those that are not tailored, show up in a reasonable order. This is particularly true for contractions, because contractions can result in larger tables and significant performance degradation. Contractions are required in tailorings, but their use is kept to a minimum in the Default Unicode Collation Element Table to enhance performance.
In the Default Unicode Collation Element Table, contractions are necessary where a canonical decomposable character requires a distinct primary weight in the table, so that the canonical-equivalent character sequences are given the same weights. For example, Indic two-part vowels have primary weights as units, and their canonical-equivalent sequence of vowel parts must be given the same primary weight by means of a contraction entry in the table. The same applies to a number of precomposed Cyrillic characters with diacritic marks and to a small number of Arabic letters with madda or hamza marks.
Contractions are also entered in the table for Thai, Lao, and Tai Viet logical order exception vowels. Because these scripts all have five vowels that are represented in strings in visual order, the vowels cannot simply be weighted by their representation order in strings. One option is to preprocess relevant strings to identify and reorder all logical order exception vowels around the following consonant. That approach was used in Version 4.0 and earlier of the UCA. Starting with Version 4.1 of the UCA, contractions for the relevant combinations of vowel+consonant have been entered in the Default Unicode Collation Element Table instead.
Those are the only classes of contractions allowed in the Default Unicode Collation Element Table. Generic contractions of the sort needed to handle digraphs such as "ch" in Spanish or Czech sorting, should be dealt with in tailorings to the default table—because they often vary in ordering from language to language, and because every contraction entered into the default table has a significant implementation cost for all applications of the default table, even those which may not be particularly concerned with the affected script. See the Unicode Common Locale Data Repository [CLDR] for extensive tailorings of the DUCET for various languages, including those requiring contractions.
The Default Unicode Collation Element Table is constructed to be consistent with the Unicode Normalization algorithm, and to respect the Unicode character properties. It is not, however, merely algorithmically derivable based on considerations of canonical equivalence and an inspection of character properties, because the assignment of levels also takes into account characteristics of particular scripts. For example, the combining marks generally have secondary collation elements; however, the Indic combining vowels are given non-zero Level 1 weights, because they are as significant in sorting as the consonants.
Any character may have variant forms or applied accents which affect collation. Thus, for FULL STOP there are three compatibility variants: a fullwidth form, a compatibility form, and a small form. These get different tertiary weights accordingly. For more information on how the table was constructed, see Section 7.2, Tertiary Weight Table.
Table 11 summarizes the overall ordering of the collation elements in the Default Unicode Collation Element Table. The collation elements are ordered by primary, secondary, tertiary, and Unicode value weights, with primary, secondary, and tertiary weights for variables blanked (replaced by "0000"). Entries in the table which contain a sequence of collation elements have a multi-level ordering applied: comparing the primary weights first, then the secondary weights, and so on. This construction of the table makes it easy to see the order in which characters would be collated.
The weightings in the table are grouped by major categories. For example, whitespace characters come before punctuation, and symbols come before numbers. These groupings allow for programmatic reordering of scripts and other characters of interest, without table modification. For example, numbers can be reordered to be after letters instead of before. For more information, see the Unicode Common Locale Data Repository [CLDR].
Values | Type | Examples of Characters |
---|---|---|
X1, X2, X3 = 0 | completely ignorable and quaternary collation elements | - Control codes - Format characters - Hebrew points - Tibetan signs - Arabic tatweel ... |
X1, X2 = 0; X3 ≠ 0 |
tertiary collation elements | None in DUCET; could be in tailorings |
X1 = 0; X2, X3 ≠ 0 |
secondary collation elements | - Most nonspacing marks - Some letters and combining marks |
X1, X2, X3 ≠ 0 | primary collation elements | |
variable | - Whitespace (White_Space=True) - Punctuation (General_Category=Punctuation) - General symbols (General_Category=Letter Modifier or Symbol, but not Currency Symbol) |
|
regular | - General symbols (General_Category=Letter Modifier; certain characters, such as U+02D0 ː
MODIFIER LETTER TRIANGULAR COLON) - Currency symbols (General_Category=Currency Symbol) - Numbers (General_Category=Number) - Latin - Greek ... |
|
implicit | - CJK Unified Ideographs from the URO and CJK Compatibility blocks - CJK Extensions A, B, C, ... - Unassigned and others given implicit weights |
|
trailing | None in DUCET; could be in tailorings |
There are a number of exceptions in the grouping of characters in DUCET, where for various reasons characters are grouped in different categories. Examples are provided below for each type of exception.
Note that the [CLDR] root collation tailors the DUCET. The differences are described in the documentation file CollationAuxiliary.html (see Section 9, Data Files).
For most languages, some degree of tailoring is required to match user expectations. For more information, see Section 5, Tailoring.
Each of the files consists of a version line followed by optional lines for parametric attributes, and a series of entries, all separated by newlines. A '#' or '%' and any following characters on a line are comments. Whitespace between literals is ignored. The following is an extended BNF description of the format, where "x+" indicates one or more x's, "x*" indicates zero or more x's, "x?" indicates zero or one x, <char> is a hexadecimal Unicode code point value, and <weight> is a hexadecimal collation weight value.
<collationElementTable> := <version> <entry>+
The version line is of the form:
<version> := '@version' <major>.<minor>.<variant> <eol>
Each entry is a mapping from character(s) to collation element(s), and is of the following form:
<entry> := <charList> ';' <collElement>+ <eol> <charList> := <char>+ <collElement> := "[" <alt> <weight> "." <weight> "." <weight> ("." <weight>)? "]" <alt> := "*" | "."
Every collation element in the table should have the same number of fields.
Here are some selected entries taken from a particular version of the data file. (It may not match the actual values in the current data file.)
0020 ; [0209.0020.0002.0020] % SPACE 02DA ; [0209.002B.0002.02DA] % RING ABOVE 0041 ; [06D9.0020.0008.0041] % LATIN CAPITAL LETTER A 3373 ; [06D9.0020.0017.0041] [08C0.0020.0017.0055] % SQUARE AU 00C5 ; [06D9.002B.0008.00C5] % LATIN CAPITAL LETTER A WITH RING ABOVE 212B ; [06D9.002B.0008.212B] % ANGSTROM SIGN 0042 ; [06EE.0020.0008.0042] % LATIN CAPITAL LETTER B 0043 ; [0706.0020.0008.0043] % LATIN CAPITAL LETTER C 0106 ; [0706.0022.0008.0106] % LATIN CAPITAL LETTER C WITH ACUTE 0044 ; [0712.0020.0008.0044] % LATIN CAPITAL LETTER D
Although this document describes collation elements as three levels, the DUCET contains a fourth level (as in [0712.0020.0008.0044]), which can be used to approximate Step S3.10. Note, however, that this approximation is not very good, the UCA does not use this fourth level of data, and this data is not related to the fourth level introduced by variable handling and thus leads to confusion. So implementations should only use it at their own risk.
This DUCET fourth level is computable: In most cases the fourth level is equal to the code point itself. For composite characters which map to a sequence of collation elements, the fourth level for each collation element is based on the decomposition of the character. For further details see Section 7.3, Fourth-Level Weight Assignments. For more information on the use of the fourth level and stable sorts, see Section 3.8, Stability.
Implementations can also add more customizable levels, as discussed in Section 2, Conformance. For example, an implementation might want to handle the standard Unicode Collation, but also be capable of emulating an EBCDIC multi-level ordering (having a fourth-level EBCDIC binary order).
Collation elements that are marked with an asterisk in a Unicode Collation Element Table are known as variable collation elements.
Character |
Collation Element |
Name |
---|---|---|
0020 " " | [*0209.0020.0002] | SPACE |
Based on the setting of the variable weighting tag, collation elements can be either treated as quaternary collation elements or not. When they are treated as quaternary collation elements, any sequence of ignorable collation elements that immediately follows the variable collation element is also affected.
There are five possible options for variable weighted characters:
Type | L4 | Examples |
---|---|---|
L1, L2, L3 = 0 | 0000 | null [0000.0000.0000.0000] |
L1=0, L3 ≠ 0, following a Variable |
0000 | combining grave [0000.0000.0000.0000] |
L1 ≠ 0, Variable |
old L1 | space [0000.0000.0000.0209] |
L1 = 0, L3 ≠ 0, not following a Variable |
FFFF | combining grave [0000.0035.0002.FFFF] |
L1 ≠ 0, not Variable |
FFFF | Capital A [06D9.0020.0008.FFFF] |
The variants of the shifted option provide for improved orderings when the variable collation elements are ignorable, while still only requiring three fields to be stored in memory for each collation element. Those options result in somewhat longer sort keys, although they can be compressed (see Section 6.1, Reducing Sort Key Lengths and Section 6.3, Reducing Table Sizes).
Table 13 shows the differences between orderings using the different options for variable collation elements. In this example, sample strings differ by the third character: a letter, space, '-' hyphen-minus (002D), or '-' hyphen (2010); followed by an uppercase/lowercase distinction.
Non- ignorable |
Blanked | Shifted | IgnoreSP | Shift- Trimmed |
---|---|---|---|---|
de luge de Luge de-luge de-Luge de-luge de-Luge death deluge deLuge demark |
death de luge de-luge deluge de-luge de Luge de-Luge deLuge de-Luge demark |
death de luge de-luge de-luge deluge de Luge de-Luge de-Luge deLuge demark |
death de luge de-luge de-luge deluge de Luge de-Luge de-Luge deLuge demark |
death deluge de luge de-luge de-luge deLuge de Luge de-Luge de-Luge demark |
☠happy ☠sad ♡happy ♡sad |
☠happy ♡happy ☠sad ♡sad |
☠happy ♡happy ☠sad ♡sad |
☠happy ☠sad ♡happy ♡sad |
☠happy ♡happy ☠sad ♡sad |
The following points out some salient features of each of the columns in Table 13.
Primaries for variable collation elements are not interleaved with other primary weights. This allows for more compact storage of memory tables. Rather than using a bit per collation element to determine whether the collation element is variable, the implementation only needs to store the maximum primary value for all the variable elements. All collation elements with primary weights from 1 to that maximum are variables; all other collation elements are not.
In the Default Unicode Collation Element Table and in typical tailorings, most unaccented letters differ in the primary weights, but have secondary weights (such as a1) equal to MIN2. The secondary collation elements will have secondary weights greater than MIN2. Characters that are compatibility or case variants will have equal primary and secondary weights (for example, a1 = A1 and a2 = A2), but have different tertiary weights (for example, a3 < A3). The unmarked characters will a3) equal to MIN3.
This use of secondary and tertiary weights does not guarantee that the meaning of a secondary or tertiary weight is uniform across tables. For example, in a tailoring a capital A and katakana ta could both have a tertiary weight of 3.
The DUCET is not entirely well-formed. It does not include two contraction mappings required for well-formedness condition 5:
0FB2 0F71 ; CE(0FB2) CE(0F71) 0FB3 0F71 ; CE(0FB3) CE(0F71)
However, adding just these two contractions would disturb the default sort order for Tibetan. In order to also preserve the sort order for Tibetan, the following eight contractions would have to be added as well:
0FB2 0F71 0F72 ; CE(0FB2) CE(0F71 0F72) 0FB2 0F73 ; CE(0FB2) CE(0F71 0F72) 0FB2 0F71 0F74 ; CE(0FB2) CE(0F71 0F74) 0FB2 0F75 ; CE(0FB2) CE(0F71 0F74) 0FB3 0F71 0F72 ; CE(0FB3) CE(0F71 0F72) 0FB3 0F73 ; CE(0FB3) CE(0F71 0F72) 0FB3 0F71 0F74 ; CE(0FB3) CE(0F71 0F74) 0FB3 0F75 ; CE(0FB3) CE(0F71 0F74)
The [CLDR] root collation adds all ten of these contractions.
A well-formed Collation Element Table meets the following well-formedness conditions:
WF1. Except in special cases detailed in Section 6.2, Large Weight Values, no collation element can have a zero weight at Level N and a non-zero weight at Level N-1.
WF2. Secondary weights of secondary collation elements must be strictly greater than secondary weights of all primary collation elements. Tertiary weights of tertiary collation elements must be strictly greater than tertiary weights of all primary and secondary collation elements.
WF3. No variable collation element has an ignorable primary weight.
WF4. For all variable collation elements U, V, if there is a collation element W such that U1 ≤ W1 and W1 ≤ V1, then W is also variable.
WF5. If a table contains a contraction consisting of a sequence of N code points, with N > 2 and the last code point being a non-starter, then the table must also contain a contraction consisting of the sequence of the first N-1 code points.
The notion of stability in sorting often causes confusion when discussing collation.
A stable sort is one where two records with a field that compares as equal will retain their order if sorted according to that field. This is a property of the sorting algorithm, not of the comparison mechanism. For example, a bubble sort is stable, while a Quicksort is not. This is a useful property, but cannot be accomplished by modifications to the comparison mechanism or tailorings.
A deterministic comparison is different. It is a comparison in which strings that are not canonical equivalents will not be judged to be equal. This is a property of the comparison, not of the sorting algorithm. This is not a particularly useful property—its implementation also requires extra processing in string comparison or an extra level in sort keys, and thus may degrade performance to little purpose. However, if a deterministic comparison is required, the specified mechanism is to append the NFD form of the original string after the sort key, in Section 4.3, Form Sort Key. See also Appendix A, Deterministic Sorting.
The fourth-level weights in the Default Collation Element Table can be used to provide an approximation of a deterministic comparison.
A deterministic comparison is also sometimes referred to as a stable (or semi-stable) comparison. Those terms are not to be preferred, because they tend to be confused with stable sort.
Neither a stable sort nor a deterministic comparison refers to the stability of the Default Collation Element Table itself. The contents of that table will remain unchanged in any particular version of the UCA. However, the contents may change between successive versions of the UCA as new characters are added, or more information is obtained about existing characters.
Implementers should be aware that using different versions of the UCA or different versions of the Unicode Standard could result in different collation results of their data. There are numerous ways collation data could vary across versions, for example:
Any of these reasons could necessitate a change between versions with regards to collation weights for code points. It is therefore important that the implementers specify the version of the UCA, as well as the version of the Unicode Standard under which their data is sorted.
The policies which the UTC uses to guide decisions about the collation weight assignments made for newly assigned characters are enumerated in the UCA Default Table Criteria for New Characters. In addition, there are policies which constrain the timing and type of changes which are allowed for the DUCET table between versions of the UCA. Those policies are enumerated in Change Management for the Unicode Collation Algorithm.
The main algorithm has four steps. First is to normalize each input string, second is to produce an array of collation elements for each string, and third is to produce a sort key for each string from the collation elements. Two sort keys can then be compared with a binary comparison; the result is the ordering for the original strings.
Step 1. Produce a normalized form of each input string, applying S1.1.
S1.1 Use the Unicode canonical algorithm to decompose characters according to the canonical mappings. That is, put the string into Normalization Form D (see [UAX15]).
Step 2. The collation element array is built by sequencing through the normalized form, applying S2.1 through S2.6.
Normalized String | Collation Element Array |
---|---|
ca´b | [0706.0020.0002], [06D9.0020.0002], [0000.0021.0002], [06EE.0020.0002] |
S2.1 Find the longest initial substring S at each point that has a match in the table.
S2.1.1 If there are any non-starters following S, process each non-starter C.
S2.1.2 If C is not blocked from S, find if S + C has a match in the table.
Note: A non-starter in a string is called blocked if there is another non-starter of the same canonical combining class or zero between it and the last character of canonical combining class 0.
S2.1.3 If there is a match, replace S by S + C, and remove C.
S2.2 Fetch the corresponding collation element(s) from the table if there is a match. If there is no match, synthesize a weight as described in Section 7.1, Derived Collation Elements.
S2.3 Process collation elements according to the variable-weight setting, as described in Section 3.6.2, Variable Weighting.
S2.4 Append the collation element(s) to the collation element array.
S2.5 Proceed to the next point in the string (past S).
S2.6 Loop until the end of the string is reached.
Note: The extra non-starter C needs to be considered in Step 2.1.1 because otherwise irrelevant characters could interfere with matches in the table. For example, suppose that the contraction <a, combining_ring> (= å) is ordered after z. If a string consists of the three characters <a, combining_ring, combining_cedilla>, then the normalized form is <a, combining_cedilla, combining_ring>, which separates the a from the combining_ring. Without considering the extra non-starter, this string would compare incorrectly as after a and not after z.
If the desired ordering treats <a, combining_cedilla> as a contraction which should take precedence over <a, combining_ring>, then an additional mapping for the combination <a, combining_ring, combining_cedilla> can be introduced to produce this effect.
For conformance to Unicode canonical equivalence, only unblocked non-starters are matched in Step 2.1.2. For example, <a, combining_macron, combining_ring> would compare as after a-macron, and not after z. Additional mappings can be added to customize behavior.
Also note that the Algorithm employs two distinct contraction matching methods:
- Step 2.1 “Find the longest initial substring S” is a contiguous, longest-match method. In particular, if must support matching of a contraction ABC even if there is not also a contraction AB. Thus, an implementation that incrementally matches a lengthening initial substring must be able to handle partial matches like for AB.
- Steps 2.1.1 “process each non-starter C” and 2.1.2 “find if S + C has a match in the table”, where one or more intermediate non-starters may be skipped (making it discontiguous), extends a contraction match by one code point at a time to find the next match. In particular, if C is a non-starter and if the table had a mapping for ABC but not one for AB, then a discontiguous-contraction match on text ABMC (with M being a skippable non-starter) would never be found. Well-formedness condition 5 requires the presence of the prefix contraction AB.
- In either case, the prefix contraction AB cannot be added to the table automatically because it would yield the wrong order for text ABD if there is a contraction BD.
Step 3. The sort key is formed by successively appending all non-zero weights from the collation element array. The weights are appended from each level in turn, from 1 to 3. (Backwards weights are inserted in reverse order.)
Collation Element Array | Sort Key |
---|---|
[0706.0020.0002], [06D9.0020.0002], [0000.0021.0002], [06EE.0020.0002] | 0706 06D9 06EE 0000 0020 0020 0021 0020 0000 0002 0002 0002 0002 |
An implementation may allow the maximum level to be set to a smaller level than the available levels in the collation element array. For example, if the maximum level is set to 2, then level 3 and higher weights are not appended to the sort key. Thus any differences at levels 3 and higher will be ignored, leveling any such differences in string comparison.
Here is a more detailed statement of the algorithm:
S3.1 For each weight level L in the collation element array from 1 to the maximum level,
S3.2 If L is not 1, append a level separator
Note:The level separator is zero (0000), which is guaranteed to be lower than any weight in the resulting sort key. This guarantees that when two strings of unequal length are compared, where the shorter string is a prefix of the longer string, the longer string is always sorted after the shorter—in the absence of special features like contractions. For example: "abc" < "abcX" where "X" can be any character(s).
S3.3 If the collation element table is forwards at level L,
S3.4 For each collation element CE in the array
S3.5 Append CEL to the sort key if CEL is non-zero.
S3.6 Else the collation table is backwards at level L, so
S3.7 Form a list of all the non-zero CEL values.
S3.8 Reverse that list
S3.9 Append the CEL values from that list to the sort key.
S3.10 If a semi-stable sort is required, then after all the level weights have been added, append a copy of the NFD version of the original string. This strength level is called the identical level, and this feature is called semi-stability. (See also Appendix A, Deterministic Sorting.)
Step 4. Compare the sort keys for each of the input strings, using a binary comparison. This means that:
String | Sort Key | |
---|---|---|
1 | cab | 0706 06D9 06EE 0000 0020 0020 0020 0000 0002 0002 0002 |
2 | Cab | 0706 06D9 06EE 0000 0020 0020 0020 0000 0008 0002 0002 |
3 | cáb | 0706 06D9 06EE 0000 0020 0020 0021 0020 0000 0002 0002 0002 0002 |
4 | dab | 0712 06D9 06EE 0000 0020 0020 0020 0000 0002 0002 0002 |
In Figure 3, "cab" <3 "Cab" <2 "cáb" <1 "dab". The differences that produce the ordering are shown by the bold underlined items:
While forming sort keys, zero weights are omitted. If collation elements were not well-formed according to conditions 1 and 2, the ordering of collation elements could be incorrectly reflected in the sort key. The following examples illustrate this.
Suppose well-formedness condition 1 were broken, and secondary weights of the Latin characters were zero (ignorable) and that (as normal) the primary weights of case-variants are equal: that is, a1 = A1. Then the following incorrect keys would be generated:
Order | String | Normalized | Sort Key |
---|---|---|---|
1 | "áe" | a, acute, e | a1 e1 0000 acute2 0000 a3 acute3 e3... |
2 | "Aé" | A, e, acute | a1 e1 0000 acute2 0000 A3 acute3 e3... |
Because the secondary weights for a, A, and e are lost in forming the sort key, the relative order of the acute is also lost, resulting in an incorrect ordering based solely on the case of A versus a. With well-formed weights, this does not happen, and the following correct ordering is obtained:
Order | String | Normalized | Sort Key |
---|---|---|---|
1 | "Aé" | A, e, acute | a1 e1 0000 a2 e2 acute2 0000 a3 acute3 e3... |
2 | "áe" | a, acute, e | a1 e1 0000 a2 acute2 e2 0000 A3 acute3 e3... |
However, there are circumstances—typically in expansions—where higher-level weights in collation elements can be zeroed (resulting in ill-formed collation elements) without consequence (see Section 6.2, Large Weight Values). Implementations are free to do this as long as they produce the same result as with well-formed tables.
Suppose on the other hand, well-formedness condition 2 were broken. Let there be a tailoring of 'b' as a secondary difference from 'a' resulting in the following collation elements where the one for 'b' is ill-formed.
0300 ; [.0000.0035.0002.0300] # (DUCET) COMBINING GRAVE ACCENT 0061 ; [.15EF.0020.0002.0061] # (DUCET) LATIN SMALL LETTER A 0062 ; [.15EF.0040.0002.0061] # (tailored) LATIN SMALL LETTER B
Then the following incorrect ordering would result: "aa" < "àa" < "ab" — The secondary difference on the second character (b) trumps the accent on the first character (à).
A correct tailoring would give 'b' a secondary weight lower than that of any secondary collation element, for example: (assuming the DUCET did not use secondary weight 0021 for any secondary collation element)
0300 ; [.0000.0035.0002.0300] # (DUCET) COMBINING GRAVE ACCENT 0061 ; [.15EF.0020.0002.0061] # (DUCET) LATIN SMALL LETTER A 0062 ; [.15EF.0021.0002.0061] # (tailored) LATIN SMALL LETTER B
Then the following correct ordering would result: "aa" < "ab" < "àa"
Tailoring is any well-defined syntax that takes the Default Unicode Collation Element Table and produces another well-formed Unicode Collation Element Table. This syntax can provide linguistically-accurate collation, if desired. Such syntax will usually allow the following:
Reordering any character (or contraction) with respect to others in the standard ordering. The reordering can represent a Level 1 difference, Level 2 difference, Level 3 difference, or identity (in levels 1 to 3). Because such reordering includes sequences, arbitrary multiple mappings can be specified.
For best interoperability, it is recommended that tailorings for particular locales (or languages) make use of the tables provided in the Unicode Common Locale Data Repository [CLDR].
For an example of a tailoring syntax, see Section 5.2, Tailoring Example.
Parametric tailoring, if supported, is specified using a set of attribute-value pairs that specify a particular kind of behavior relative to the UCA. The standard parameter names (attributes) and their possible values are listed in Unicode Locale Data Markup Language (LDML) [UTS35] in the table Collation Settings (Section 5.14.3, Setting Options).
The default values for collation parameters specified by the UCA algorithm may differ from the LDML defaults given in the LDML table Collation Settings. The table indicates both default values. For example, the UCA default for alternate handling is shifted, while the general default in LDML is non-ignorable. Also, defaults in CLDR data may vary by locale. For example, normalization is turned off in most CLDR locales (those that don't normally use multiple accents). The default for strength in UCA is tertiary; it can be changed for different locales in CLDR.
When a locale or language identifier is specified for tailoring of the UCA, the identifier uses the syntax from [UTS35], Section 3, Unicode Language and Locale Identifiers. Unless otherwise specified, tailoring by locale uses the tables from the Unicode Common Locale Data Repository [CLDR].
Unicode [CLDR] provides a powerful tailoring syntax in [UTS35], as well as tailoring data for many locales. The tailorings are based on the DUCET table, although there are some general tailorings always applied to the DUCET in CLDR 1.9. This XML format maps to a simpler representation in ICU, shown in Table 14. A simpler version of this syntax is also used in Java, although at the time of this writing, Java does not implement the UCA.
Syntax | Description |
---|---|
& y < x | Make x primary-greater than y |
& y << x | Make x secondary-greater than y |
& y <<< x | Make x tertiary-greater than y |
& y = x | Make x equal to y |
Either x or y in this syntax can represent more than one character, to handle contractions and expansions.
Entries for tailoring can be abbreviated in a number of ways:
These abbreviations can be applied successively, so the examples shown in Table 15 are equivalent in ordering.
ICU Syntax | DUCET Syntax |
---|---|
a <<< A << à <<< À < b <<< B |
0061 ; [0001.0001.0001] % a 0040 ; [0001.0001.0002] % A 00E0 ; [0001.0002.0001] % à 00C0 ; [0001.0002.0002] % À 0042 ; [0002.0001.0001] % b 0062 ; [0002.0001.0002] % B |
The syntax has many other capabilities: for more information, see [UTS35] and [ICUCollator].
The Unicode Collation Algorithm involves the normalization of Unicode text strings before collation weighting. U+034F COMBINING GRAPHEME JOINER (CGJ) is ordinarily ignored in collation key weighting in the UCA, but it can be used to block the reordering of combining marks in a string as described in [Unicode]. In that case, its effect can be to invert the order of secondary key weights associated with those combining marks. Because of this, the two strings would have distinct keys, making it possible to treat them distinctly in searching and sorting without having to further tailor either the combining grapheme joiner or the combining marks.
The CGJ can also be used to prevent the formation of contractions in the Unicode Collation Algorithm. Thus, for example, while ch is sorted as a single unit in a tailored Slovak collation, the sequence <c, CGJ, h> will sort as a c followed by an h. This can also be used in German, for example, to force ü to be sorted as u + umlaut (thus u <2 ü), even where a dictionary sort is being used (which would sort ue <3 ü). This happens without having to further tailor either the combining grapheme joiner or the sequence.
Note: As in a few other cases in the Unicode Standard, the name of the CGJ can be misleading—the usage above is in some sense the inverse of "joining".
Sequences of characters which include the combining grapheme joiner or other completely ignorable characters may also be given tailored weights. Thus the sequence <c, CGJ, h> could be weighted completely differently from either the contraction "ch" or the sequence "c" followed by "h" without the contraction. However, this application of CGJ is not recommended, because it would produce effects much different than the normal usage above, which is to simply interrupt contractions.
In addition to tailoring, some implementations may choose to preprocess the text for special purposes. Once such preprocessing is done, the standard algorithm can be applied.
Examples include:
Such preprocessing is outside of the scope of this document.
As noted above for efficiency, implementations may vary from this logical algorithm as long as they produce the same result. The following items discuss various techniques that can be used for reducing sort key length, reducing table sizes, customizing for additional environments, searching, and other topics.
The following discuss methods of reducing sort key lengths. If these methods are applied to all of the sort keys produced by an implementation, they can result in significantly shorter and more efficient sort keys while retaining the same ordering.
Level separators are not needed between two levels in the sort key, if the weights are properly chosen. For example, if all L3 weights are less than all L2 weights, then no level separator is needed between them. If there is a fourth level, then the separator before it needs to be retained.
The following example shows a sort key with these level separators removed.
String | Technique(s) Applied | Sort Key |
---|---|---|
càb | none | 0706 06D9 06EE 0000 0020 0020 0021 0020 0000 0002 0002 0002 0002 |
càb | 1 | 0706 06D9 06EE 0020 0020 0021 0020 0002 0002 0002 0002 |
While this technique is relatively easy to implement, it can interfere with other compression methods.
The L2 and L3 weights commonly are small values. Where that condition occurs for all possible values, they can then be represented as single 8-bit quantities.
The following example modifies the first example with both these changes (and grouping by bytes). Note that the separator has to remain after the primary weight when combining these techniques. If any separators are retained (such as before the fourth level), they need to have the same width as the previous level.
String | Technique(s) Applied | Sort Key |
---|---|---|
càb | none | 07 06 06 D9 06 EE 00 00 00 20 00 20 00 21 00 20 00 00 00 02 00 02 00 02 00 02 |
càb | 1, 2 | 07 06 06 D9 06 EE 00 00 20 20 21 20 02 02 02 02 |
The sort key can be represented as an array of different quantities depending on the machine architecture. For example, comparisons as arrays of unsigned 32-bit quantities may be much faster on some machines. When using arrays of unsigned 32-bit quantities, the original is to be padded with trailing (not leading) zeros as necessary.
String | Technique(s) Applied | Sort Key |
---|---|---|
càb | 1, 2 | 07 06 06 D9 06 EE 00 00 20 20 21 20 02 02 02 02 |
càb | 1, 2, 3 | 070606D9 06EE0000 20202120 02020202 |
Generally sort keys do not differ much in the secondary or tertiary weights, which tends to result in keys with a lot of repetition. This also occurs with quaternary weights generated with the shifted parameter. By the structure of the collation element tables, there are also many weights that are never assigned at a given level in the sort key. One can take advantage of these regularities in these sequences to compact the length—while retaining the same sort sequence—by using the following technique. (There are other techniques that can also be used.)
This is a logical statement of the process; the actual implementation can be much faster and performed as the sort key is being generated.
In the example shown in Figure 4, the low weights are 01, 02; the high weights are FE, FF; and the common weight is 77.
Original Weights | Compressed Weights |
---|---|
01 02 77 01 77 02 77 77 01 77 77 02 77 77 77 01 77 77 77 02 ... 77 77 77 FE 77 77 77 FF 77 77 FE 77 77 FF 77 FE 77 FF FE FF |
01 02 03 01 03 02 04 01 04 02 05 01 05 02 ... FB FE FB FF FC FE FC FF FD FE FD FF FE FF |
This process results in keys that are never longer than the original, are generally much shorter, and result in the same comparisons.
If a collation sequence requires more than
65,535 weight values (or 65,024 values where zero bytes are avoided),
this requirement can still be accommodated by assigning multiple collation elements to
single characters. For example, suppose that 50,000 supplementary
private-use characters are used
in a particular implementation, and that these are to be sorted after
a character whose primary weight is X
. In such cases, the second CE ("continuation") does not have to be well formed.
Simply assign them all dual collation elements of the following form:
[(X+1).zzzz.wwww], [yyyy.0000.0000]
If there is an element with the primary weight (X+1)
,
then it also needs to be converted into a dual collation element.
The private-use characters will then sort properly with respect to each other and the rest of the characters. The first collation element of this dual collation element pair is one of the instances in which ill-formed collation elements are allowed. The second collation element of each of these pairs is well-formed, and the first element only occurs in combination with them. In this way, ordering is preserved with respect to other, non-paired collation elements.
Another example of the continuation technique appears in the DUCET:
2F00 ; [FB40.0020.0004][CE00.0000.0000] # KANGXI RADICAL ONE
The data tables required for collation of the entire Unicode repertoire can be quite sizable. This section discusses ways to significantly reduce the table size in memory. These recommendations have very important implications for implementations.
Whenever collation elements have different primary weights, the ordering of their secondary weights is immaterial. Thus all of the secondaries that share a single primary can be renumbered to a contiguous range without affecting the resulting order. The same technique can be applied to tertiary weights.
Although the tertiary weights for the Default Unicode Collation Element Table can fit within one byte, any particular tailored table could conceivably end up with tertiary weights that exceed what can be contained in a single byte. Secondary weights in the Default Unicode Collation Element Table already far exceed what can be represented by a single byte, ranging from 0020 to 019A as of UCA Version 6.0.0. However, the same technique used for large weight values can also be used for implementations that do not want to handle more than 00FF values for a particular weight.
For example, the Java collation implementation only stored 8-bit quantities in level 2 and level 3. However, characters can be given L2 or L3 weights with greater values by using a series of two collation elements. For example, if characters require 2,000 weights at L2, then 248 characters can be given single keys, while 1,752 (2000 - 248) can be given two collation keys of the form [yyyy.00zz.00ww] [0000.00nn.0001].
The 248 in this example can be chosen to be the higher frequency characters, while there would need to be eight distinct zz values to cover the remaining characters. These zz values must only be used with dual collation elements.
Because all canonically decomposable characters are decomposed in Step 1.1, no collation elements need to be supplied for them. The DUCET has over 2,000 of these, but they can all be dropped with no change to the ordering (it does omit the 11,172 Hangul syllables).
The collation elements for the Han characters (unless tailored) are algorithmically derived; no collation elements need to be stored for them either.
This means that only a small fraction of the total number of Unicode characters need to have an explicit collation element. This can cut down the memory storage considerably.
In addition, most characters with compatibility decompositions can have collation elements computed at runtime to save space, duplicating the work that was done to compute the Default Unicode Collation Element Table. This can provide important savings in memory space. The process works as follows.
1. Derive the compatibility decomposition. For example,
2475 PARENTHESIZED DIGIT TWO => 0028, 0032, 0029
2. Look up the collation, discarding completely ignorables. For example,
0028 [023D.0020.0002] % LEFT PARENTHESIS 0032 [06C8.0020.0002] % DIGIT TWO 0029 [023E.0020.0002] % RIGHT PARENTHESIS
3. Set the L3 values according to the table in Section 7.2, Tertiary Weight Table. If the last L3 value is 0004, and if the last element of the decomposition represents a non-alphabetic character, then reset that last L3 value to MAX (which in the default table is 001F). For example,
0028 [023D.0020.0004] % LEFT PARENTHESIS 0032 [06C8.0020.0004] % DIGIT TWO 0029 [023E.0020.001F] % RIGHT PARENTHESIS
4. Concatenate the result to produce the sequence of collation elements that the character maps to. For example,
2475 [023D.0020.0004] [06C8.0020.0004] [023E.0020.001F]
Some characters cannot be computed in this way. They must be filtered out of the default table and given specific values. For example, the long s has a secondary difference, not a tertiary.
0073 [17D9.0020.0002] # LATIN SMALL LETTER S 017F [17D9.0020.0004][0000.013A.0004] # LATIN SMALL LETTER LONG S
If characters are not fully supported by an implementation, then their code points can be treated as if they were unassigned. This allows them to be algorithmically constructed from code point values instead of including them in a table. This can significantly reduce the size of the required tables. See Section 7.1, Derived Collation Elements for more information.
Applying the above techniques, an implementation can thus safely pack all of the data for a collation element into a single 32-bit quantity: 16 for the primary, 8 for the secondary and 8 for the tertiary. Then applying techniques such as the Two-Stage table approach described in "Multistage Tables" in Section 5.1, Transcoding to Other Standards of [Unicode], the mapping table from characters to collation elements can both fast and small. For an example of how this can be done, see Section 6.10, Flat File Example.
If the resulting sort key is to be a C-string, then zero bytes must be avoided. This can be done by:
x → 010116 + (x / 255)*256 + (x % 255)
Where the values are limited to 8-bit quantities (as discussed above), zero bytes are even more easily avoided by just using 01 as the level separator (where one is necessary), and mapping weights by:
x → 01 + x
Characters with canonical decompositions do not require mappings to collation elements, because S1.1 maps them to collation elements based upon their decompositions. However, they may be given mappings to collation elements anyway. The weights in those collation elements must be computed in such a way that they will sort in the same relative location as if the characters were decomposed using Normalization Form D. Including these mappings allows an implementation handling a restricted repertoire of supported characters to compare strings correctly without performing the normalization in S1.1 of the algorithm. It is recommended that implementations correctly sort all strings that are in the format known as "Fast C or D form" (FCD) even if normalization is off, because this permits more efficient sorting for locales whose customary characters do not use multiple combining marks. For more information on FCD, see [UTN5].
In some languages, it is common to sort lowercase before uppercase; in other languages this is reversed. Often this is more dependent on the individual concerned, and is not standard across a single language. It is strongly recommended that implementations provide parameterization that allows uppercase to be sorted before lowercase, and provides information as to the standard (if any) for particular countries. For more information, see Section 5.14.13, Case Parameters in [UTS35].
Implementations do not actually have to produce full sort keys. Collation elements can be incrementally generated as needed from two strings, and compared with an algorithm that produces the same results as sort keys would have. The choice of algorithm depends on the number of comparisons between the same strings.
However, it is very tricky to produce an incremental comparison that produces correct results. For example, some implementations have not even been transitive! Be sure to test any code for incremental comparison thoroughly.
Sort keys from two different tailored collations cannot be compared, because the weights may end up being rearranged arbitrarily. To catch this case, implementations can produce a hash value from the collation data, and prepend it to the sort key. Except in extremely rare circumstances, this will distinguish the sort keys. The implementation then has the opportunity to signal an error.
A collation ordering determines a collation grapheme cluster (also known as a collation grapheme or collation character), which is a sequence of characters that is treated as a primary unit by the ordering. For example, ch is a collation grapheme for a traditional Spanish ordering. These are generally contractions, but may include additional ignorable characters.
Roughly speaking, a collation grapheme cluster is the longest substring whose corresponding collation elements start with a non-zero primary weight, and contain as few other collation elements with non-zero primary weights as possible. In some cases, collation grapheme clusters may be degenerate: they may have collation elements that do not contain a non-zero weight, or they may have no non-zero weights at all.
For example, consider a collation for language in which "ch" is treated as a contraction, and "à" as an expansion. The expansion for à contains collation weights corresponding to combining-grave + "a" (but in an unusual order). In that case, the string <`ab`ch`à> would have the following clusters:
To find the collation grapheme cluster boundaries in a string, the following algorithm can be used:
For information on the use of collation graphemes, see [UTS18].
This section contains a sample flat-file binary layout and sample code for collation data. It is included only for illustration. Data in the format of Table 16 is used to generate collation elements from characters, either going forward or backward, and detect the start of a contraction. The backward generation is for searching backward or Boyer-Moore-style searching; the contraction detection is for random access.
In the file representation, ints are 32 bit values, shorts are 16 bits, bytes are 8 bits. Negatives are two's-complement. For alignment, the ends of all arrays are padded out to multiples of 32 bits. The signature determines endianness. The locale uses an ASCII representation for the Java locale: a 2-byte ISO language code, optionally followed by '_' and 2-byte ISO country code, followed optionally by a series of variant tags separated by '_'; any unused bytes are zero.
Data | Comment | |
---|---|---|
int signature; | Constant 0x636F6C74 , used also for big-endian detection |
|
int tableVersion; | Version of the table format | |
int dataVersion; | Version of the table data | |
byte[32] locale; | Target locale (if any) | |
int flags; | Bit01 = 1 if Backward secondary. Other values are reserved. | |
int limitVariable; | Every CE below this value that has a non-zero primary is variable. Because variables are not interleaved, this does not need to be stored on a per-character basis. | |
int maxCharsPerCE; | Maximum number of characters that are part of a contraction | |
int maxCEsPerChar; | Maximum number of collation elements that are generated by an expansion | |
int indexOffset; | Offset to index table | |
int collationElementsOffset; | Offset to main data table | |
int expansionsOffset; | Offset to expansion table | |
int contractionMatchOffset; | Offset to contraction match table | |
int contractionResultOffset; | Offset to contraction values table | |
int nonInitialsOffset; | Offset to non-initials table. These are used for random access. | |
int[10] reserved; | Reserved | |
int indexLength; | Length of following table | |
int[] index; | Index for high-byte (trie) table. Contains
offsets into Collation Elements. Data is accessed by:ce = collationElements[index[char>>8]+char&0xFF] |
|
int collationElementsLength; | Length of following table | |
int[] collationElements; | Each element is either a real collation element, an expansionsOffset, or a contractionsOffset. See below for more information. | |
int expansionsLength; | Length of following table | |
int[] expansions; | The expansionOffsets in the collationElements table point into sublists in this table. Each list is terminated by FFFFFFFF. | |
int contractionMatchesLength; | Length of following table | |
short[] contractionMatches; | The contractionOffsets in the collationElements table point into sublists in this table. Each sublist is of the following format: | |
short backwardsOffset; | When processing backward, offset to true contractions table | |
short length; | Number of chars in list to search | |
short[] charsToMatch; | Characters in sorted order | |
int contractionCEsLength; | Length of following table | |
int[] contractionCEs; | List of CEs. Each corresponds to a position in the contractionChars table. The one corresponding to the length in a sublist is the bail-out; this specifies what to do if a match is not found. | |
int nonInitialsLength; | Length of following table | |
short[] nonInitials; | List of characters (in sorted order) that can be non-initials in contractions. That is, if "ch" is a contraction, then "h" is in this list. If "abcd" is a contraction, then "b", "c", and "d" are in the list. |
An alternative structure would have the offsets be byte offsets from the start of the table, instead of indexes into the arrays. That would limit the size of the table, but use fewer machine instructions.
The following is a pseudo code using this table for the required operations. Although using Java syntax in general, the code example uses arrays, which are more familiar to users of C and C++. The code is presented for illustration only; it is not a complete statement of the algorithm. To make it easier to follow the logic, the pseudo-code does not handle supplementary code points or discontiguous contractions.
char[] input; // input buffer (i) int inputPos; // position in input buffer (io) int[] output; // output buffer (o) int outputPos; // position in output buffer (io) /** * Reads characters from input, writes collation elements in output */ void getCollationElements() { char c = input[inputPos++]; int ce = collationElements[index[c>>8] + c&0xFF]; processCE(ce); } /** * Normally just returns ce. However, special forms indicate that * the ce is actually an expansion, or that we have to search * to see if the character was part of a contraction. * Expansions use */ void processCE(int ce) { if (ce < 0xFFE00000) { output[outputPos++] = ce; } else if (ce >= 0xFFF00000) { copyExpansions(ce & 0xFFFFF); } else { searchContractions(ce & 0xFFFFF); } } /** * Search through a contraction sublist to see if there is a match. * Because the list is sorted, we can exit if our value is too high. * Because we have a length, we could implement this as a * binary search, although we do not right now. * If we do find a match, we need to recurse. That's how "abc" would * be handled. * If we fail, we return the non-matching case. That can be an expansion * itself (it would never be a contraction). */ void searchContractions(int offset) { offset++; // skip backwardsoffset int goal = input[inputPos++]; int length = contractionMatches[offset]; int limit = offset + 1 + length; for (int i = offset + 1; i < limit; ++i) { int cc = contractionMatches[i]; if (cc > goal) { // definitely failed processCE(contractionCEs[offset]); break; } else if (cc == goal) { // found match processCE(contractionCEs[i]); break; } } } /** * Copy the expansion collation elements up to the terminator. * Do not use 00000000 as a terminator, because that may be a valid CE. * These elements do not recurse. */ void copyExpansions (int offset) { int ce = expansions[offset++]; while (ce != 0xFFFFFFFF) { output[outputPos++] = ce; ce = expansions[offset++]; } }
This section describes the generation of the Default Unicode Collation Element Table (DUCET), and the assignment of weights to code points that are not explicitly mentioned in that table. The assignment of weights uses information derived from the Unicode Character Database [UAX44].
CJK ideographs and Hangul syllables are not explicitly mentioned in the default table. CJK ideographs are mapped to collation elements that are derived from their Unicode code point value as described in Section 7.1.3, Implicit Weights. For a discussion of derived collation elements for Hangul syllables and other issues related to the collation of Korean, see Section 7.1.5, Hangul Collation.
Unicode strings sometimes contain ill-formed code unit sequences. Such ill-formed sequences must not be interpreted as valid Unicode characters. See Section 3.2, Conformance Requirements in [Unicode]. For example, expressed in UTF-32, a Unicode string might contain a 32-bit value corresponding to an surrogate code point (General_Category Cs) or an out-of range value (< 0 or > 10FFFF), or a UTF-8 string might contain misconverted byte values that cannot be interpreted. Implementations of the Unicode Collation Algorithm may choose to treat such ill-formed code unit sequences as error conditions and respond appropriately, such as by throwing an exception.
An implementation of the Unicode Collation Algorithm may also choose not to treat ill-formed sequences as an error condition, but instead to give them explicit weights. This strategy provides for determinant comparison results for Unicode strings, even when they contain ill-formed sequences. However, to avoid security issues when using this strategy, ill-formed code sequences should not be given an ignorable primary weight. There are two recommended approaches, based on how these ill-formed sequences are typically handled by character set converters. The first approach is to weight each maximal ill-formed subsequence as if it were U+FFFD REPLACEMENT CHARACTER. (For more information about maximal ill-formed subsequences, see Section 3.9, Unicode Encoding Forms in [Unicode].) A second approach is to generate an implicit weight for any surrogate code point as if it were an unassigned code point, using the method of Section 7.1.3, Implicit Weights.
Each unassigned code point and each other code point that is not explicitly mentioned in the table is mapped to a sequence of two collation elements as described in Section 7.1.3, Implicit Weights.
This section describes how a code point is mapped to an implicit weight. The result of this process consists of collation elements that are sorted in code point order, that do not collide with any explicit values in the table, and that can be placed anywhere (for example, at BASE) with respect to the explicit collation element mappings. By default, implicit mappings are given higher weights than all explicit collation elements (except those with decompositions to characters with implicit weights).
To derive the collation elements, the value of the code point is used to calculate two numbers, by bit shifting and bit masking. The bit operations are chosen so that the resultant numbers have the desired ranges for constructing implicit weights. The first number is calculated by taking the code point expressed as a 32-bit binary integer CP and bit shifting it right by 15 bits. Because code points range from U+0000 to U+10FFFF, the result will be a number in the range 0 to 2116 (= 3310). This number is then added to the special value BASE.
AAAA = BASE + (CP >> 15);
Now mask off the bottom 15 bits of CP. OR a 1 into bit 15, so that the resultant value is non-zero.
BBBB = (CP & 0x7FFF) | 0x8000;
AAAA and BBBB are interpreted as unsigned 16-bit integers. The implicit weight mapping given to the code point is then constructed as:
[AAAA.0020.0002][BBBB.0000.0000]
If a fourth or higher weights are used, then the same pattern is followed for those weights. They are set to a non-zero value in the first collation element and zero in the second. (Because all distinct code points have a different AAAA/BBBB combination, the exact non-zero value does not matter.)
The value for BASE depends on the type of character. The first BASE value is for the core Han Unified Ideographs. The second BASE value is for all other Unified Han ideographs. In both of these cases, compatibility decomposables are excluded, because they are otherwise handled in the UCA. Unassigned code points are also excluded from these first two BASE values. The final BASE value is for all other code points, including unassigned code points.
Base | Applicable Ranges |
---|---|
FB40 | Unified_Ideograph=True AND ((Block=CJK_Unified_Ideograph) OR (Block=CJK_Compatibility_Ideographs)) In regex notation: [\p{unified_ideograph}&[\p{Block=CJK_Unified_Ideographs}\p{Block=CJK_Compatibility_Ideographs}]] |
FB80 | Unified_Ideograph=True AND NOT ((Block=CJK_Unified_Ideograph) OR (Block=CJK_Compatibility_Ideographs)) In regex notation: [\p{unified ideograph}-[\p{Block=CJK_Unified_Ideographs}\p{Block=CJK_Compatibility_Ideographs}]] |
FBC0 | Any other code point |
These results make AAAA (in each case) larger than any explicit primary weight; thus the implicit weights will not collide with explicit weights. It is not generally necessary to tailor these values to be within the range of explicit weights. However if this is done, the explicit primary weights must be shifted so that none are between each of the BASE values and BASE + 34.
The range of 1,024 primary weights from FC00 to FFFF is available for use as trailing weights.
In many writing systems, the convention for collation is to order by syllables (or other units similar to syllables). In most cases a good approximation to syllabic ordering can be obtained in the UCA by weighting initial elements of syllables in the appropriate primary order, followed by medial elements (such as vowels), followed by final elements, if any. The default weights for the UCA in the DUCET are assigned according to this general principle for many scripts. This approach handles syllables within a given script fairly well, but unexpected results can occur when syllables of different lengths are adjacent to characters with higher primary weights, as illustrated in the following example:
Case 1 | Case 2 | ||||||||
---|---|---|---|---|---|---|---|---|---|
|
|
In this example, the symbols {G}, {A}, and {K} represent letters in a script where syllables (or other sequences of characters) are sorted as units. By proper choice of weights for the individual letters, the syllables can be ordered correctly. However, the weights of the following characters may cause syllables of different lengths to change order. Thus {G}{A}{K} comes after {G}{A} in Case 1, but in Case 2, it comes before. That is, the order of these two syllables would be reversed when each is followed by a CJK ideograph, with a high primary weight: in this case, U+4E8B (事).
This unexpected behavior can be avoided by using trailing weights to tailor the non-initial letters in such syllables. The trailing weights, by design, have higher values than the primary weights for characters in all scripts, including the implicit weights used for CJK ideographs. Thus in the example, if {K} is tailored with a trailing weight, it would have a higher weight than any CJK ideograph, and as a result, the relative order of the two syllables {G}{A}{K} and {G}{A} would not be affected by the presence of a CJK ideograph following either syllable.
The Hangul script for Korean is in a rather unique position, because of its large number of precomposed syllable characters, and because those precomposed characters are the normal (NFC) form of interchanged text. For Hangul syllables to sort correctly, either the DUCET table must be tailored or both the UCA algorithm and the table must be tailored. The essential problem results from the fact that Hangul syllables can also be represented with a sequence of conjoining jamo characters and because syllables represented that way may be of different lengths, with or without a trailing consonant jamo. That introduces the trailing weights problem, as discussed in Section 7.1.4, Trailing Weights. This section describes several approaches which implementations may take for tailoring to deal with the trailing weights problem for Hangul.
Note: The Unicode Technical Committee recognizes that it would be preferable if a single "best" approach could be standardized and incorporated as part of the specification of the UCA algorithm and the DUCET table. However, picking a solution requires working out a common approach to the problem with the ISO SC2 OWG-Sort group, which takes considerable time. In the meantime, implementations can choose among the various approaches discussed here, when faced with the need to order Korean data correctly.
The following discussion makes use of definitions and abbreviations from Section 3.12, Conjoining Jamo Behavior in [Unicode]. In addition, a special symbol (Ⓣ) is introduced to indicate a terminator weight. For convenience in reference, these conventions are summarized here:
Description | Abbr. | Weight |
---|---|---|
Leading consonant | L | WL |
Vowel | V | WV |
Trailing consonant | T | WT |
Terminator weight | - | Ⓣ |
Simple Method
The specification of the Unicode Collation Algorithm requires that Hangul syllables be decomposed. However, if the weight table is tailored so that the primary weights for Hangul jamo are adjusted, then the Hangul syllables can be left as single code points and be treated in much the same way as CJK ideographs. The adjustment is specified as follows:
The net effect of such a tailoring is to provide a Hangul collation which is approximately equivalent to one of the more complex methods specified below. This may be sufficient in environments where individual jamo are not generally expected.
Three more complex and complete methods are spelled out below. First the nature of the tailoring is described. Then each method is exemplified, showing the implications for the relative weighting of jamo and illustrating how each method produces correct results.
Each of these three methods can correctly represent the ordering of all Hangul syllables, both for modern Korean and for Old Korean. However, there are implementation trade-offs between them. These trade-offs can have a significant impact on the acceptability of a particular implementation. For example, substantially longer sort keys will cause serious performance degradations and database index bloat. Some of the pros and cons of each method are mentioned in the discussion of each example. Note that if the repertoire of supported Hangul syllables is limited to those required for modern Korean (those of the form LV or LVT), then each of these methods becomes simpler to implement.
Data Method
For example, if L1 has a primary weight of 555, and L2 has a primary weight of 559, then the sequence L1L2 would be treated as a contraction and be given a primary weight chosen from the range 556 to 558.
Terminator Method
The details of the algorithm for parsing Hangul data into standard Korean syllable blocks can be found in Section 8, Hangul Syllable Boundary Determination of [UAX29]
Interleaving Method
The interleaving method requires tailoring both the DUCET table and the way the algorithm handles Korean text.
Generate a tailored weight table by assigned an explicit primary weight to each precomposed Hangul syllable character, with a 1-weight gap between each one. (See Section 6.2, Large Weight Values.)
Separately define a small, internal table of jamo weights. This internal table of jamo weights is separate from the tailored weight table, and is only used when processing standard Korean syllable blocks. Define this table as follows:
When processing a string to assign collation weights, whenever a substring of jamo and/or precomposed Hangul syllables in encountered, break it into standard Korean syllable blocks. For each syllable identified, assign a weight as follows:
Data Method Example
The data method provides for the following order of weights, where the Xb are all the scripts sorted before Hangul, and the Xa are all those sorted after.
Xb | L | Xa | T | V |
This ordering gives the right results among the following:
Chars | Weights | Comments | |||
---|---|---|---|---|---|
L1V1Xa | WL1 | WV1 | WXa | ||
L1V1L... | WL1 | WV1 | WLn | ... | |
L1V1Xb | WL1 | WV1 | WXb | ||
L1V1T1 | WL1 | WV1 | WT1 | Works because WT > all WX and WL | |
L1V1V2 | WL1 | WV1 | WV2 | Works because WV > all WT | |
L1L2V1 | WL1L2 | WV1 | Works if L1L2 is a contraction |
The disadvantages of the data method are that the weights for T and V are separated from those of L, which can cause problems for sort key compression, and that a combination of LL that is outside the contraction table will not sort properly.
Terminator Method Example
The terminator method would assign the following weights:
Ⓣ | Xb | T | V | L | Xa |
This ordering gives the right results among the following:
Chars | Weights | Comments | ||||
---|---|---|---|---|---|---|
L1V1Xa | WL1 | WV1 | Ⓣ | WXa | ||
L1V1Ln... | WL1 | WV1 | Ⓣ | WLn | ... | |
L1V1Xb | WL1 | WV1 | Ⓣ | WXb | ||
L1V1T1 | WL1 | WV1 | WT1 | Ⓣ | Works because WT > all WX and Ⓣ | |
L1V1V2 | WL1 | WV1 | WV2 | Ⓣ | Works because WV > all WT | |
L1L2V1 | WL1 | WL2 | WV1 | Ⓣ | Works because WL > all WV |
The disadvantages of the terminator method are that an extra weight is added to all Hangul syllables, increasing the length of sort keys by roughly 40%, and the fact that the terminator weight is non-contiguous can disable sort key compression.
Interleaving Method Example
The interleaving method provides for the following assignment of weights. Wn represents the weight of a Hangul syllable, and Wn' is the weight of the gap right after it. The L, V, T weights will only occur after a W, and thus can be considered part of an entire weight.
Xb | W | Xa |
byte weights:
Ⓣ | T | V | L |
This ordering gives the right results among the following:
Chars | Weights | Comments | ||
---|---|---|---|---|
L1V1Xa | Wn | Xa | ||
L1V1Ln... | Wn | Wk | ... | The Ln will start another syllable |
L1V1Xb | Wn | Xb | ||
L1V1T1 | Wm | Works because Wm > Wn | ||
L1V1V2 | Wm'L1V1V2Ⓣ | Works because Wm' > Wm | ||
L1L2V1 | Wm'L1L2V1Ⓣ | Works because the byte weight for L2 > all V |
The interleaving method is somewhat more complex than the others, but produces the shortest sort keys for all of the precomposed Hangul syllables, so for normal text it will have the shortest sort keys. If there were a large percentage of ancient Hangul syllables, the sort keys would be longer than other methods.
Characters are given tertiary weights according to Table 18. The Decomposition Type is from the Unicode Character Database [UAX44]. The Case or Kana Subtype entry refers either to a case distinction or to a specific list of characters. The weights are from MIN = 2 to MAX = 1F16, excluding 7, which is not used for historical reasons. The Samples show some minimal values that are distinguished by the different weights. All values are distinguished. The samples have empty cells when there are no (visible) values showing a distinction.
Decomposition Type | Case or Kana Subtype | Weight | Samples | |||||
---|---|---|---|---|---|---|---|---|
NONE |
0x0002 |
i | ب | ) | mw | 1⁄2 | X | |
<wide> |
0x0003 |
i | ||||||
<compat> |
0x0004 |
ⅰ | ||||||
<font> |
0x0005 |
ℹ | ||||||
<circle> |
0x0006 |
ⓘ | ||||||
!unused! |
0x0007 |
|||||||
NONE |
Uppercase | 0x0008 |
I | MW | ||||
<wide> |
Uppercase | 0x0009 |
I | ) | ||||
<compat> |
Uppercase | 0x000A |
Ⅰ | |||||
<font> |
Uppercase | 0x000B |
ℑ | |||||
<circle> |
Uppercase | 0x000C |
Ⓘ | |||||
<small> |
small hiragana (3041, 3043, ...) | 0x000D |
ぁ | |||||
NONE |
normal hiragana (3042, 3044, ...) | 0x000E |
あ | |||||
<small> |
small katakana (30A1, 30A3, ...) | 0x000F |
﹚ | ァ | ||||
<narrow> |
small narrow katakana (FF67..FF6F) | 0x0010 |
ァ | |||||
NONE |
normal katakana (30A2, 30A4, ...) | 0x0011 |
ア | |||||
<narrow> |
narrow katakana (FF71..FF9D), narrow hangul (FFA0..FFDF) |
0x0012 |
ア | |||||
<circle> |
circled katakana (32D0..32FE) | 0x0013 |
㋐ | |||||
<super> |
0x0014 |
⁾ | ||||||
<sub> |
0x0015 |
₎ | ||||||
<vertical> |
0x0016 |
︶ | ||||||
<initial> |
0x0017 |
ﺑ | ||||||
<medial> |
0x0018 |
ﺒ | ||||||
<final> |
0x0019 |
ﺐ | ||||||
<isolated> |
0x001A |
ﺏ | ||||||
<noBreak> |
0x001B |
|||||||
<square> |
0x001C |
㎽ | ||||||
<square>, <super>, <sub> |
Uppercase | 0x001D |
㎿ | |||||
<fraction> |
0x001E |
½ | ||||||
n/a |
(MAX value) | 0x001F |
The MAX weight is used to prevent certain overlaps between collation elements. In particular, if it were not used there would be some unexpected cases where X ≠3 Y, but XY =3 YX. For example, consider:
2024 ; [0273.0020.0004] # ONE DOT LEADER
2025 ; [0273.0020.0004][0273.0020.001F] # TWO DOT LEADER
If the last CE were not given MAX (1F16) as a tertiary, then
The assignment of fourth-level weights in the Default Unicode Collation Element Table is done as follows:
Conditions 1 to 3 are disjoint. Condition 4 is the default assignment of fourth-level weights; it applies in any case which does not meet one of the first three conditions.
Language-sensitive searching and matching are closely related to collation. Strings that compare as equal at some strength level should be matched when doing language-sensitive matching. For example, at a primary strength, "ß" would match against "ss" according to the UCA, and "aa" would match "å" in a Danish tailoring of the UCA. The main difference from the collation comparison operation is that the ordering is not important. Thus for matching it does not matter that "å" would sort after "z" in a Danish tailoring—the only relevant information is that they do not match.
The basic operation is matching: determining whether string X matches string Y. Other operations are built on this:
The collation settings determine the results of the matching operation (see Section 5.1, Parametric Tailoring). Thus users of searching and matching need to be able to modify parameters such as locale or comparison strength. For example, setting the strength to exclude differences at Level 3 has the effect of ignoring case and compatibility format distinctions between letters when matching. Excluding differences at Level 2 has the effect of also ignoring accentual distinctions when matching.
Conceptually, a string matches some target where a substring of the target has the same sort key, but there are a number of complications:
The following are used to provide a clear definition of searching and matching that deal with the above complications:
DS1. Define S[start,end] to be the substring of S that includes the character after the offset start up to the character before offset end. For example, if S is "abcd", then S[1,3] is "bc". Thus S = S[0,length(S)].
DS1a. A boundary condition is a test imposed on an offset within a string. An example includes Whole Word Search, as defined in [UAX29].
The tailoring parameter match-boundaries specifies constraints on matching (see Section 5.1, Parametric Tailoring). The parameter match-boundaries=whole-character requires that the start and end of a match each be on a grapheme boundary. The value match-boundaries=whole-character further requires that the start and end of a match each be on a word boundary as well. For more information on the specification of these boundaries, see [UAX29].
By using grapheme-complete conditions, contractions and combining sequences are not interrupted except in edge cases. This also avoids the need to present visually discontiguous selections to the user (except for BIDI text).
Suppose there is a collation C, a pattern string P and a target string Q, and a boundary condition B. C has some particular set of attributes, such as a strength setting, and choice of variable weighting.
DS2. The pattern string P has a match at Q[s,e] according to collation C if C generates the same sort key for P as for Q[s,e], and the offsets s and e meet the boundary condition B. One can also say P has a match in Q according to C.
DS3. The pattern string P has a canonical match at Q[s,e] according to collation C if there is some Q' that is canonically equivalent to Q[s,e], and P has a match in Q'.
For example, suppose that P is "Å", and Q is "...A◌̥◌̊...". There would not be a match for P in Q, but there would be a canonical match, because P does have a match in "A◌̊◌̥", which is canonically equivalent to "A◌̥◌̊". However, it is not commonly necessary to use canonical matches, so this definition is only supplied for completeness.
Each of the following definitions is a qualification of DS2 or DS3:
DS3a. The match is grapheme-complete if B requires that the offset be at a grapheme cluster boundary. Note that Whole Word Search as defined in [UAX29] is grapheme complete.
DS4. The match is minimal if there is no match at Q[s+i,e-j] for any i and j such that i ≥ 0, j ≥ 0, and i + j > 0. In such a case, one can also say that P has a minimal match at Q[s,e].
DS4a. A medial match is determined in the following way:
DS4b. The match is maximal if there is no match at Q[s-i,e+j] for any i and j such that i ≥ 0, j ≥ 0, and i + j > 0. In such a case, one can also say that P has a maximal match at Q[s,e].
Figure 5 illustrates the differences between these type of matches, where the collation strength is set to ignore punctuation and case, and format indicates the match.
Text | Description | |
---|---|---|
Pattern | *!abc!* | Notice that the *! and !* are ignored in matching. |
Target Text | def$!Abc%$ghi | |
Minimal Match | def$!Abc%$ghi | The minimal match is the tightest one, because $! and %$ are ignored in the target. |
Medial Match | def$!Abc%$ghi | The medial one includes those characters that are binary equal. |
Maximal Match | def$!Abc%$ghi | The maximal match is the loosest one, including the surrounding ignored characters. |
By using minimal, maximal, or medial matches, the issue with ignorables is avoided. Medial matches tend to match user expectations the best.
When an additional condition is set on the match, the types (minimal, maximal, medial) are based on the matches that meet that condition. Consider the example in Figure 6.
Value | Notes | |
---|---|---|
Pattern | abc | |
Strength | primary | thus ignoring combining marks, punctuation |
Text | abc¸-°d | two combining marks, cedilla and ring |
Matches | |abc|¸|-|°|d | four possible end points, indicated by | |
If, for example, the condition is Whole Grapheme, then the matches are restricted to "abc¸|-°|d", thus discarding match positions that would not be on a grapheme cluster boundary. In this case the minimal match would be "abc¸|-°d"
DS6. The first forward match for P in Q starting at b is the least offset s greater than or equal to b such that for some e, P matches within Q[s,e].
DS7. The first backward match for P in Q starting at b is the greatest offset s less than or equal to b such that for some e, P matches within Q[s,e].
In DS6 and DS7, matches can be minimal, medial, or maximal; the only requirement is that the combination in use in DS6 and DS7 be specified. Of course, a possible match can also be rejected on the basis of other conditions, such as being grapheme-complete or applying Whole Word Search, as described in [UAX29]).
The choice of medial or minimal matches for the "starts with" or "ends with" operations only affects the positioning information for the end of the match or start of the match, respectively.
Special Cases. Ideally, the UCA at a secondary level would be compatible with the standard Unicode case folding and removal of compatibility differences, especially for the purpose of matching. For the vast majority of characters, it is compatible, but there are the following exceptions:
In practice, most of these differences are not important for modern text, with one exception: the German ß. Implementations should consider tailoring ß to have a tertiary difference from SS, at least when collation tables are used for matching. Where full compatibility with case and compatibility folding are required, either the text can be preprocessed, or the UCA tables can be tailored to handle the outlying cases.
Matching can be done by using the collation elements, directly, as discussed above. However, because matching does not use any of the ordering information, the same result can be achieved by a folding. That is, two strings would fold to the same string if and only if they would match according to the (tailored) collation. For example, a folding for a Danish collation would map both "Gård" and "gaard" to the same value. A folding for a primary-strength folding would map "Resume" and "résumé" to the same value. That folded value is typically a lowercase string, such as "resume".
A comparison between folded strings cannot be used for an ordering of strings, but it can be applied to searching and matching quite effectively. The data for the folding can be smaller, because the ordering information does not need to be included. The folded strings are typically much shorter than a sort key, and are human-readable, unlike the sort key. The processing necessary to produce the folding string can also be faster than that used to create the sort key.
The following is an example of the mappings used for such a folding using to the [CLDR] tailoring of UCA:
Parameters:
{locale=da_DK, strength=secondary, alternate=shifted}
Mapping:
... ª → a Map compatibility (tertiary) equivalents, such as full-width and superscript characters, to representative character(s) a → a A → a A → a ª → a ... å → aa Map contractions (a + ring above) to equivalent values Å → aa ...
Once the table of such mappings is generated, the folding process is a simple longest-first match-and-replace: a string to be folded is first converted to NFD, then at each point in the string, the longest match from the table is replaced by the corresponding result.
However, ignorable characters need special handling. Characters that are fully ignorable at a given strength level level normally map to the empty string. For example, at strength=quaternary, most controls and format characters map to the empty string; at strength=primary, most combining marks also map to the empty string. In some contexts, however, fully ignorable characters may have an effect on comparison, or characters that are not ignorable at the given strength level may be treated as ignorable.
Users often find asymmetric searching to be a useful option. When doing an asymmetric search, a character (or grapheme cluster) in the query that is unmarked at the secondary and/or tertiary levels will match a character in the target that is either marked or unmarked at the same levels, but a character in the query that is marked at the secondary and/or tertiary levels will only match a character in the target that is marked in the same way.
At a given level, a character is unmarked if it has the lowest collation
weight for that level. For the tertiary level, a plain lowercase ‘r’ would normally be treated as
unmarked, while the uppercase, fullwidth, and circled characters ‘R’, ‘r’, ‘ⓡ’ would be treated
as marked. There is an exception for kana characters, where the "normal" form is unmarked: 0x000E
for hiragana and 0x0011
for katakana.
For the secondary level, an unaccented ‘e’ would be treated as unmarked, while the accented letters ‘é’, ‘è’ would (in English) be treated as marked. Thus in the following examples, a lowercase query character matches that character or the uppercase version of that character even if strength is set to tertiary, and an unaccented query character matches that character or any accented version of that character even if strength is set to secondary.
Query | Target Matches |
---|---|
resume | resume, Resume, RESUME, résumé, rèsumè, Résumé, RÉSUMÉ, … |
Resume | Resume, RESUME, Résumé, RÉSUMÉ, … |
résumé | résumé, Résumé, RÉSUMÉ, … |
Résumé | Résumé, RÉSUMÉ, … |
けんこ | けんこ, げんこ, けんご, げんご, … |
げんご | げんご, … |
Query | Target Matches |
---|---|
resume | resume, Resume, RESUME, résumé, rèsumè, Résumé, RÉSUMÉ, … |
Resume | resume, Resume, RESUME, résumé, rèsumè, Résumé, RÉSUMÉ, … |
résumé | résumé, Résumé, RÉSUMÉ, … |
Résumé | résumé, Résumé, RÉSUMÉ, … |
けんこ | けんこ, ケンコ, げんこ, けんご, ゲンコ, ケンゴ, げんご, ゲンゴ, … |
げんご | げんご, ゲンゴ, … |
When doing an asymmetric search, there are many ways in which results might be returned:
The data files for each version of UCA are located in versioned subdirectories in [Data10]. The main data file with the DUCET data for each version is "allkeys.txt" [Allkeys].
Starting with Version 3.1.1 of UCA, the data directory also contains CollationTest.zip, a zipped file containing conformance test files. The documentation file CollationTest.html describes the format and use of those test files. See also [Tests10].
Starting with Version 6.0.0 of UCA, the data directory also contains CollationAuxiliary.zip, a zipped file containing auxiliary data files used for CLDR, and for comparison across versions. The documentation file CollationAuxiliary.html describes the format and use of those data files.
There is often a good deal of confusion about what is meant by the terms "stable" or "deterministic" when applied to sorting or comparison. This confusion in terms often leads people to make mistakes in their software architecture, or make choices of language-sensitive comparison options that have significant impact in terms of performance and footprint, and yet do not give the results that users expect.
A stable sort is one where two records will retain their order when sorted according to a particular field, even when the two fields have the same contents. Thus those two records come out in the same relative order that they were in before sorting, although their positions relative to other records may change. Importantly, this is a property of the sort algorithm, not the comparison mechanism.
Two examples of differing sort algorithms are Quicksort and Merge sort. Quicksort is not stable while Merge sort is stable. (A Bubble sort, as typically implemented, is also stable.)
Assume the following records:
Record | Last_Name | First_Name |
---|---|---|
1 | Davis | John |
2 | Davis | Mark |
3 | Curtner | Fred |
The results of a Merge sort on the Last_Name field only are:
Record | Last_Name | First_Name |
---|---|---|
3 | Curtner | Fred |
1 | Davis | John |
2 | Davis | Mark |
The results of a Quicksort on the Last_Name field only are:
Record | Last_Name | First_Name |
---|---|---|
3 | Curtner | Fred |
2 | Davis | Mark |
1 | Davis | John |
As is apparent, the Quicksort algorithm is not stable; records 1 and 2 are not in the same order they were in before sorting.
A stable sort is often desirable—for one thing, it allows records to be successively sorted according to different fields, and to retain the correct lexicographic order. Thus, with a stable sort, one could sort all the records by First_Name, and then sort them again by Last_Name, giving the desired results: that all records would be ordered by Last_Name, and in the case where the Last_Name values are the same, be further subordered by First_Name.
A deterministic sort is a very different beast. This is a sort algorithm that returns the same results each time. On the face of it, it would seem odd for any sort algorithm to not be deterministic, but there are examples of real-world sort algorithms that are not.
The key concept is that these sort algorithms are deterministic when two records have unequal fields, but they may return different results at different times when two records have equal fields.
For example, a classic Quicksort algorithm works recursively on ranges of records. For any given range of records, it takes the first element as the pivot element. However, that algorithm performs badly with input data that happens to be already sorted (or mostly sorted). A randomized Quicksort, which picks a random element as the pivot, can on average be faster. Because of this random selection, different outputs can result from exactly the same input: the algorithm is not deterministic.
|
or |
|
As another example, multiprocessor sort algorithms can be non-deterministic. The work of sorting different blocks of data is farmed out to different processors and then merged back together. The ordering of records with equal fields might be different according to when different processors finish different tasks.
Note that a deterministic sort is weaker than a stable sort. A stable sort is always deterministic, but not vice versa. Typically, when people say they want a deterministic sort, they really mean that they want a stable sort.
A deterministic comparison is different than either a stable sort or a deterministic sort; it is a property of a comparison function, not a sort algorithm. This is a comparison where strings that do not have identical binary contents (optionally, after some process of normalization) will compare as unequal. A deterministic comparison is sometimes called a stable (or semi-stable) comparison.
There are many people who confuse a deterministic comparison with a deterministic (or stable) sort, but this ignores the fundamental difference between a comparison and a sort. A comparison is used by a sort algorithm to determine the relative ordering of two fields, such as strings. Using a deterministic comparison cannot cause a sort to be deterministic, nor to be stable. Whether a sort is deterministic or stable is a property of the sort algorithm, not the comparison function, as the prior examples show.
One can produce a deterministic comparison function from a non-deterministic one, in the following way (in pseudo-code):
int new_compare (String a, String b) { int result = old_compare(a, b); if (result == 0) { result = binary_compare(a, b); } return result; }
Programs typically also provide the facility to generate a sort key,
which is a sequences of bytes generated from a string in alignment with a comparison
function. Two sort keys will binary-compare in the same order as their original
strings. The simplest means to create a deterministic sort key that aligns with the above
new_compare
is to append a copy of the original
string to the sort key. This will force the comparison to be deterministic.
byteSequence new_sort_key (String a) { return old_sort_key(a) + SEPARATOR + toByteSequence(a); }
Because sort keys and comparisons must be aligned, a sort key generator is deterministic if and only if a comparison is.
A deterministic comparison is generally not best practice. First, it has a certain performance cost in comparison, and a quite substantial impact on sort key size. (For example, ICU language-sensitive sort keys are generally about the size of the original string, so appending a copy of the original string generally doubles the size of the sort key.) A database using these sort keys can have substantially increased disk footprint and memory footprint, and consequently reduced performance.
More importantly, a deterministic comparison function does not actually achieve the effect people think it will have. Look at the sorted examples above. Whether a deterministic comparison is used or not, there will be no effect on the Quicksort example, because the two records in question have identical Last_Name fields. It does not make a non-deterministic sort into a deterministic one, nor does it make a non-stable sort into a stable one.
Thirdly, a deterministic comparison is often not what is wanted, when people look closely at the implications. Look at the example again, and suppose that this time the user is sorting first by last name, then by first name.
Record | Last_Name | First_Name |
---|---|---|
1 | Davis | John |
2 | Davis | Mark |
3 | Curtner | Fred |
The desired results are the following, which should result whether the sort algorithm is stable or not, because it uses both fields.
Record | Last_Name | First_Name |
---|---|---|
3 | Curtner | Fred |
1 | Davis | John |
2 | Davis | Mark |
Now suppose that in record 2, the source for the data caused the last name to contain a format control character, such as a ZWJ (used to request ligatures on display). In this case there is no visible distinction in the forms, because the font does not have any ligatures for these sequences of Latin letters. The default UCA collation weighting causes the ZWJ to be—correctly—ignored in comparison, since it should only affect rendering. However, if that comparison is changed to be deterministic (by appending the binary values for the original string), then unexpected results will occur.
Record | Last_Name | First_Name |
---|---|---|
3 | Curtner | Fred |
2 | Davis | Mark |
1 | Davis | John |
Typically, what people really want when they say they want a deterministic comparison is actually a stable sort.
One can force a non-stable sort algorithm to produce stable results by how one does the comparison. However, this has literally nothing to do with making the comparison deterministic or not. Forcing stable results can be done by appending the current record number to the strings to be compared. (The implementation may not actually append the number; it may use some other mechanism, but the effect would be the same.)
If such a modified comparison is used, for example, it forces Quicksort to get the same results as a Merge sort. In that case, the irrelevant character ZWJ does not affect the outcome. The correct results occur, as illustrated below.
Record | Last_Name | First_Name |
---|---|---|
3 | Curtner | Fred |
1 | Davis | John |
2 | Davis | Mark |
If anything, this then is what users want when they say they want a deterministic comparison. See also Section 1.6, Merging Sort Keys.
There are a few other terms worth mentioning, simply because they are also subject to considerable confusion. Any or all of the following terms may be easily confused with the discussion above.
A stable comparison is one that does not change over successive software versions. That is, as one uses successive versions of an API, with the same "settings" (such as locale), one gets the same results.
A stable sort key generator is one that generates the same binary sequence over successive software versions.
Warning: If the sort key generator is stable, then the associated comparison will perforce be. However, the reverse is not guaranteed. To take a trivial example, suppose the new version of the software always adds an 0xFF byte at the front of every sort key. The results of any comparison of any two new keys would be identical to the results of the comparison of any two corresponding old keys. However, the bytes have changed, and the comparison of old and new keys would give different results. Thus one can have a stable comparison, yet an associated non-stable sort key generator.
A portable comparison is where corresponding APIs for comparison produce the same results across different platforms. That is, if one uses the same "settings" (such as locale), one gets the same results.
A portable sort key generator is where corresponding sort key APIs produce exactly the same sequence of bytes across different platforms.
Warning: As above, a comparison may be portable without the associated sort key generator being portable.
Ideally, all products would have the same string comparison and sort key generation for, say Swedish, and thus be portable. For historical reasons, this is not the case. Even if the main letters sort the same, there will be differences in the handling of other letters, or of symbols, punctuation, and other characters. There are some libraries that offer portable comparison, such as [ICUCollator], but in general the results of comparison or sort key generation may vary significantly between different platforms.
In a closed system, or in simple scenarios, portability may not matter. Where someone has a given set of data to present to a user, and just wants the output to be reasonably appropriate for Swedish, the exact order on the screen may not matter.
In other circumstances, differences can lead to data corruption. For example, suppose that two implementations do a database SELECT for records between a pair of strings. If the collation is different in the least way, they can get different data results. Financial data might be different, for example, if a city is included in one SELECT on one platform and excluded from the same SELECT on another platform.
The Unicode Collation Algorithm is maintained in synchronization with the International Standard, ISO/IEC 14651 [ISO14651]. Although the presentation and text of the two standards are rather distinct, the approach toward the architecture of multi-level collation weighting and string comparison is closely aligned. In particular, the synchronization between the two standards is built around the data tables which define the default (or tailorable) weights. The UCA adds many additional specifications, implementation guidelines, and test cases, over and above the synchronized weight tables. This relationship between the two standards is similar to that maintained between the Unicode Standard and ISO/IEC 10646.
For each version of the UCA, the Default Unicode Collation Element Table (DUCET) [Allkeys] is constructed based on the repertoire of the corresponding version of the Unicode Standard. The synchronized version of ISO/IEC 14651 has a Common Tailorable Template (CTT) table built for the same repertoire and ordering. The two tables are constructed with a common tool, to guarantee identical default (or tailorable) weight assignments. The CTT table for ISO/IEC 14651 is constructed using only symbols, rather than explicit integral weights, and with the Shift-Trimmed option for variable weighting.
The detailed synchronization points between versions of UCA and published editions (or amendments) of ISO/IEC 14651 are shown in Table 19.
UCA Version | UTS #10 Date | DUCET File Date | ISO/IEC 14651 Reference |
---|---|---|---|
6.2.0 | 2012-08-30 | 2012-08-14 | --- |
6.1.0 | 2012-02-01 | 2011-12-06 | 14561:2011 Amd 1 |
6.0.0 | 2010-10-08 | 2010-08-26 | 14561:2011 (3rd ed.) |
5.2.0 | 2009-10-08 | 2009-09-22 | --- |
5.1.0 | 2008-03-28 | 2008-03-04 | 14561:2007 Amd 1 |
5.0.0 | 2006-07-10 | 2006-07-14 | 14561:2007 (2nd ed.) |
4.1.0 | 2005-05-05 | 2005-05-02 | 14561:2001 Amd 3 |
4.0.0 | 2004-01-08 | 2003-11-01 | 14561:2001 Amd 2 |
9.0 (= 3.1.1) | 2002-07-16 | 2002-07-17 | 14561:2001 Amd 1 |
8.0 (= 3.0.1) | 2001-03-23 | 2001-03-29 | 14561:2001 |
6.0 (= 2.1.9) | 2000-08-31 | 2000-04-18 | --- |
5.0 (= 2.1.9) | 1999-11-22 | 2000-04-18 | --- |
Mark Davis authored most of the original text of this document. Mark Davis and Ken Whistler together have added to and continue to maintain the text.
Thanks to Bernard Desgraupes, Richard Gillam, Kent Karlsson, York Karsunke, Michael Kay, Åke Persson, Roozbeh Pournader, Markus Scherer, Javier Sola, Otto Stolz, Ienup Sung, Yoshito Umaoka, Andrea Vine, Vladimir Weinstein, Sergiusz Wolicki, and Richard Wordingham for their feedback on previous versions of this document, to Jianping Yang and Claire Ho for their contributions on matching, and to Cathy Wissink for her many contributions to the text. Julie Allen helped in copyediting of the text.
[Allkeys] | Default Unicode Collation Element Table (DUCET) For the latest version, see: http://www.unicode.org/Public/UCA/latest/allkeys.txt For the 6.2.0 version, see: http://www.unicode.org/Public/UCA/6.2.0/allkeys.txt |
[CanStd] | CAN/CSA Z243.4.1. For availability see http://www.shopcsa.ca/ |
[CLDR] | Common Locale Data Repository http://unicode.org/cldr/ |
[Data10] | For all UCA implementation and test data For the latest version, see: http://www.unicode.org/Public/UCA/latest/ For the 6.2.0 version, see: http://www.unicode.org/Public/UCA/6.2.0/ For ftp access, see: ftp://www.unicode.org/Public/UCA/ |
[FAQ] | Unicode Frequently Asked Questions http://www.unicode.org/faq/ For answers to common questions on technical issues. |
[Feedback] | Reporting Errors and Requesting Information
Online http://www.unicode.org/reporting.html |
[Glossary] | Unicode Glossary http://www.unicode.org/glossary/ For explanations of terminology used in this and other documents. |
[ICUCollator] | ICU User Guide: Collation Introduction http://userguide.icu-project.org/collation |
[ISO14651] | International Organization for Standardization. Information Technology—International String ordering and comparison—Method for comparing character strings and description of the common template tailorable ordering. (ISO/IEC 14651:2011). For availability see http://www.iso.org |
[JavaCollator] |
http://docs.oracle.com/javase/6/docs/api/java/text/Collator.html, http://docs.oracle.com/javase/6/docs/api/java/text/RuleBasedCollator.html |
[Reports] | Unicode Technical Reports http://www.unicode.org/reports/ For information on the status and development process for technical reports, and for a list of technical reports. |
[Sample] | http://www.unicode.org/reports/tr10/Sample/ |
[SortAlg] | For background on the names and characteristics
of different sorting methods, see http://en.wikipedia.org/wiki/Sorting_algorithm |
[Tests10] | Conformance Test and Documentation For the latest version, see: http://www.unicode.org/Public/UCA/latest/CollationTest.html http://www.unicode.org/Public/UCA/latest/CollationTest.zip For the 6.2.0 version, see: http://www.unicode.org/Public/UCA/6.2.0/CollationTest.html http://www.unicode.org/Public/UCA/6.2.0/CollationTest.zip |
[UAX15] | UAX #15: Unicode Normalization Forms http://www.unicode.org/reports/tr15/ |
[UAX29] | UAX #29: Unicode Text Segmentation http://www.unicode.org/reports/tr29/ |
[UAX44] | UAX #44: Unicode Character Database http://www.unicode.org/reports/tr44/ |
[Unicode] | The Unicode Consortium. The Unicode Standard, Version 6.2.0
(Mountain View, CA: The Unicode Consortium, 2012. ISBN 978-1-936213-07-8) http://www.unicode.org/versions/Unicode6.2.0/ |
[Unstable] | For a definition of stable sorting, see http://planetmath.org/encyclopedia/UnstableSortingAlgorithm.html |
[UTN5] | UTN #5: Canonical Equivalence in Applications http://www.unicode.org/notes/tn5/ |
[UTS18] | UTS #18: Unicode Regular Expressions http://www.unicode.org/reports/tr18/ |
[UTS35] | UTS #35: Unicode Locale Data Markup Language (LDML) http://www.unicode.org/reports/tr35/ |
[Versions] | Versions of the Unicode Standard http://www.unicode.org/versions/ For details on the precise contents of each version of the Unicode Standard, and how to cite them. |
This section summarizes important migration issues which may impact implementations of the Unicode Collation Algorithm when they are updated to a new version.
The following summarizes modifications from the previous revisions of this document.
Revision 25 being a proposed update, only changes between revisions 26 and 24 are noted here.
Revision 23 being a proposed update, only changes between revisions 24 and 22 are noted here.
Revision 21 being a proposed update, only changes between revisions 22 and 20 are noted here.
Revision 19 being a proposed update, only changes between revisions 20 and 18 are noted here.
Revision 17 being a proposed update, only changes between revisions 18 and 16 are noted here.
Revision 15 being a proposed update, only changes between revisions 16 and 14 are noted here.
Data tables for 4.1.0 contain the following changes:
Revisions 12 and 13 being proposed updates, only changes between revisions 14 and 11 are noted here.
Revision 10 being a proposed update, only changes between revisions 11 and 9 are noted here.
Copyright © 1998-2012 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode Terms of Use apply.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.