UTN #36 - A Categorization of Unicode Characters
[Unicode] Technical Notes
 

Unicode Technical Note #36

A Categorization of Unicode Characters

Authors Ken Whistler (ken@unicode.org)
Date 2011-08-11
This Version http://www.unicode.org/notes/tn36/tn36-1.html
Previous Version n/a
Latest Version http://www.unicode.org/notes/tn36/
Revision 1

Summary

This document presents an approach to the categorization of Unicode characters, and documents a data file that implementers can use as a starting point for defining Unicode character categories.

Status

This document is a Unicode Technical Note. Sole responsibility for its contents rests with the author(s). Publication does not imply any endorsement by the Unicode Consortium.

For information on Unicode Technical Notes, including criteria for acceptance, see https://www.unicode.org/notes/.

Please submit corrigenda and other comments directly to the author. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

Contents


1 Introduction

Implementers frequently want to extract good categories for Unicode characters from the Unicode names list. For example, such categories may be needed to develop new character picker applications, which organize characters into groups that will make sense for people searching for characters in graphic panes or other UI elements.

There are two parts to this problem. First, the existing machine-readable data files in the Unicode Character Database [UCD] do not provide a fine enough categorization to meet the requirements of such applications. For example, the General_Category property distinguishes letters from combining marks and punctuation and symbols, but it doesn't drill down to the next level: independent vowel letters versus consonants versus matras; or game symbols versus map symbols versus zodiac symbols versus dingbats, and so on. Second, people who need more finely detailed categorization have been attempting to extract it by making use of the editorial subheaders used in the printing of the Unicode names list, figuring that such information is better than nothing, and assuming that doing the finer-level classification from scratch would be prohibitively complex.

However, the subheaders in the Unicode names list have always been editorial content aimed primarily at structuring the code charts for display, and are not particularly well-suited to a systematic categorization of Unicode characters in any context more extensive than visual display, one chart at a time. Efforts to revise the subheaders to make them work better for machine-extracted categorization of Unicode characters from the Unicode names list are counterproductive. The subheaders would not work very well if reorganized that way, and the net result would be a significant deterioration of the editorial content of the code charts.

The existing subheaders also group characters which other applications might want to distinguish. For example, the header for the range U+2600..U+260D is "Weather and astrological symbols". But it is possible to do much better, distinguishing more precisely those which are weather symbols, such as U+2602 UMBRELLA, those which are astrological symbols, such as U+260A ASCENDING NODE, and those which really are not either, such as U+2606 WHITE STAR.

This document presents an approach that focuses on the character category distinctions needed by such applications, without being entangled with the editorial requirements for the Unicode names list maintenance. It also describes a data file that implementers can use for further refining of Unicode character categories for particular applications.

2 Character Categories

This section describes the approach taken in this note for providing a set of usable categories for Unicode characters.

2.1 Hierarchical Typology

This scheme of categorization uses a hierarchical typology, which assumes that each category provided may itself be further subdivided at another level into more subcategories. Each subcategorization is, in principle, independent of the subcategorization of other categories. Thus, for example, how one might want to subcategorize letters would typically be quite distinct from how one might most usefully subcategorize punctuation marks. Such an approach departs from the structure of partition properties for Unicode characters, such as the General_Category property. A partition property assumes a single dimension of semantic applicability, and then assigns every character a single value within that dimension. Such a character property is easy to implement, but as users of the General_Category property well know, the drawback of such partitions for categorization is their rigidity and the inability to deal with edge cases, overlapping function, and subcategories.

The approach to categorization taken here makes no assumption that any particular level of the hierarchical subcategorization has any fixed significance. A third-level subcategorization of a punctuation mark might involve rather different salient distinctions than a third-level subcategorization of symbols, for example. The typology basically starts with first-level categories roughly based on the General_Category property, but then may diverge arbitrarily on a category-by-category basis, depending on what is most useful for distinguishing characters within each subgroup.

There is no assumption that all levels have to be specified for all characters. Categories defined this way can be extensible based on what level of detail people find useful to maintain for various characters. There is also no assumption that there is actually a single correct solution for categorization. The categorization may be modified and improved over time. Furthermore, it should be expected that actual implementations will merely start with categories in the data file and run with them, to provide whatever additional changes or refinements are needed in their particular domain.

These general principles are illustrated in part by the following examples, for several different major categories. For example, for letters:

Letter

Letter > Vowel

Letter > Vowel > Dependent  (i.e. Indic matras)

Letter > Consonant > Dependent > Subjoined

For symbols:

Symbol

Symbol > Graphic

Symbol > Technical

Symbol > Technical > Keyboard

Symbol > Arrow

Symbol > Arrow > Harpoon

Symbol > Arrow > Harpoon > Double

For punctuation marks:

Punctuation

Punctuation > Space

Punctuation > Quotation

Punctuation > Bracket

Punctuation > Bracket > CJK

Currently the categorization makes use of four levels of hierarchy, but this approach could easily be extended to five (or more), if finer levels of distinction for some groups of characters prove to be desirable. For example, arrows could be further subcategorized based on their shapes and orientations.

2.2 Names for Categories

Each level of hierachical categorization is given a conventional name, such as "Letter" or "Symbol" for the highest level, or "Game", "Technical", "Weather", "Astrological", and so for, for various sub-levels. As far as possible, such names are drawn from actual practice in the Unicode Standard and in the UTC committee practice in referring to various groups of characters.

There are no "empty" intermediate levels. Thus, for instance, if a name is given in the date file for a fourth level subcategorization for a particular character, there will also always be explicit names given at the first, second, and third level of categories for that character.

2.3 Display Labels for Categories

Because of the way the hierarchical categorization works, and the way in which names are chosen for the subcategories, it is always possible to create unique identifiers for each terminal subcategory in the hierarchy, simply by concatenating the level names together. Thus, for example, one could have identifiers such as "Letter_Consonant_Dependent_Subjoined" or "Symbol_Technical_Keyboard". However, while unique, such identifiers are not particularly felicitous as display labels for subcategories.

Implementers can, of course, apply whatever display labels make sense for their particular context. Some principles which might serve to make usable display labels include:

Although principles such as these are generally good practice, some of the categorial distinctions between Unicode characters are rather technical in nature. Also, there are many characters in the Unicode Standard for writing systems which are mostly unfamiliar to the English-speaking world. In such cases, it is occasionally unavoidable that technical terminology would end up being used in any comprehensive list of display labels.

2.4 Informative Status of the Categories

The categories defined in the data file for this technical note are informative only, have no status in the Unicode Standard itself, and may be changed or augmented in the future. This distinguishes them from the General_Category character property of the Unicode Standard, which is normative and rather constrained by stability guarantees in how it can be changed.

2.5 Categorical Distinctions not Addressed by this Note

The are many possible ways to categorize Unicode characters. The approach taken in this note and the accompanying data file is only one among the many possibilities. To avoid misconceptions about the intent of this categorization, this section lists a few of the kinds of distinctions which are not addressed by the data.

The data does not attempt a shape-based classification of glyphs for Unicode characters. There are no categories which identify all of the "dots", the "circles", the "squares", the "crosses", or any other such shape-based categories.

The data does not attempt a phonological-based classification of the usage of Unicode characters. There is no attempt to identify, across all scripts, classes such as "consonant", "vowel", and so forth. The few instances where "consonant" and "vowel" appear in categories relate to subgrouping of characters in abugida-type scripts, where the classification actually is built into the structure of the script, and where the distinctions between consonants (with inherent vowel), independent vowels, and dependent vowels (matras) are relevant to the encoding decisions regarding the repertoire of characters for the script. In my opinion, for most scripts, even attempting a phonological classification at the character level is hopeless, because phonological status is not inherent to the characters per se, but rather results from the usage of a character (or sequence of characters) to represent data in particular orthographies for particular languages.

The data does not attempt a historical status-based classification of Unicode characters. While there are occasional indications that some subgroup of characters is "historic", derived either from existing subheaders in the Unicode names list or general information provided in character proposals, there is no attempt to comb through thousands of individual characters and for each of them determine which is in current usage and which is obsolete or historic.

The data does not attempt a commonality-based classification of Unicode characters. There are no subcategories such as "common" or "rare" in the data.

The data should not be confused with the kind of classification of characters used in determining collation weighting of Unicode characters. Issues of whether particular characters make secondary weight distinctions, or whether particular compatibility characters are treated as tertiary weight variants of other base characters, and so forth, have no place in the categories data. For such concerns, see [UTS10].

3 Data File

The basic categories data is available in a data file [Data] called Categories.txt. That data file contains a listing of all Unicode characters other than CJK unified ideographs and Hangul syllables, giving informative category values at up to four levels of hierarchical assignment.

The data is formatted in tab-delimited fields, suitable for spreadsheet import. Once in a spreadsheet, the data can easily be further manipulated to whatever end an implementer needs.

The field values, along with a sample of the particular category values are shown below.

Code GC Level1    Level2       Level3      Level4       Name

23CE So Symbol    Technical    Keyboard                 RETURN SYMBOL
...
2460 No Symbol    Number       Circled                  CIRCLED DIGIT ONE
...
25CB So Symbol    Geometric                             WHITE CIRCLE
...
2602 So Symbol    Weather                               UMBRELLA
...
260A So Symbol    Astrological                          ASCENDING NODE
...
2660 So Symbol    Game         Playing card             BLACK SPADE SUIT
...
266D So Symbol    Music        Western      Accidental  MUSIC FLAT SIGN
...
2FBD So Ideograph Radical      CJK          Kangxi      KANGXI RADICAL HAIR
...
A869 Lo Letter    Consonant                             PHAGS-PA LETTER TTA
...

3.1 Maintenance of the Data File

The approach taken to maintaining this hierarchical typology reuses technology which is currently designed for maintenance of the Unicode names list. In particular, category assignments are treated as annotations over ranges of characters. The annotation file can then be maintained completely independently of the detailed, character-by-character listing files that are part of the UCD—most importantly, UnicodeData.txt. In this way, the annotation information (and the associated development and refinement of categorial assignments) can be version-agnostic, and is not required to be updated in lockstep with each version of the Unicode Standard.

The program that is used to maintain annotations for the Unicode names list has been modified slightly and then used for an automated merger of categorial annotations file with particular versions of the UnicodeData.txt file, producing as output a structured data file containing categorial information about all Unicode characters, with an explicit listing for each separate character, including its code point and Unicode character name.

The merge process omits CJK unified ideographs and Hangul syllables. Categorial information about CJK unified ideographs is better handled by other means, and in particular by the Unihan database. The 11,172 Hangul syllables do not have useful categorial distinctions in ways relevant to other Unicode characters, so including all of them explicitly as part of a category listing would simply be redundant.

References

[Charts] The online code charts can be found at http://www.unicode.org/charts/
An index to characters names with links to the corresponding chart is found at: http://www.unicode.org/charts/charindex.html
[Data] Unicode character categories (Unicode 6.1 repertoire), for spreadsheet import:
http://www.unicode.org/notes/tn36/Categories.txt
For earlier versions of the data file see prior versions of this note.
[Errata] Updates and errata to the Unicode Standard, as well as other technical standards developed by the Unicode Consortium can be found at http://www.unicode.org/errata/
[Feedback] Reporting Errors and Requesting Information Online http://www.unicode.org/reporting.html
[FAQ] Unicode Frequently Asked Questions
http://www.unicode.org/faq/
For answers to common questions on technical issues.
[Glossary] Unicode Glossary
http://www.unicode.org/glossary/
For explanations of terminology used in this and other documents.
[Reports] Unicode Technical Reports
http://www.unicode.org/reports/
For information on the status and development process for technical reports, and for a list of technical reports.
[Stability] Unicode Character Encoding Stability Policy http://www.unicode.org/policies/stability_policy.html
[UCD] Unicode Character Database, http://www.unicode.org/ucd/
For an overview of the Unicode Character Database and a list of its associated files
[Unicode] The Unicode Standard
For the latest version see:
http://www.unicode.org/versions/latest/
[UTC] The Unicode Technical Committee, see http://www.unicode.org/consortium/utc.html for more information on procedures.
[UTR23] Unicode Technical Report #23: The Unicode Character Property Model, http://www.unicode.org/reports/tr23/
[UTS10] Unicode Technical Standard #10: The Unicode CollationAlgorithm, http://www.unicode.org/reports/tr10/
[Versions] Versions of the Unicode Standard, http://www.unicode.org/standard/versions/
For information on version numbering, and citing and referencing the Unicode Standard, the Unicode Character Database, and Unicode Technical Reports.

Acknowledgements

The author wishes to acknowledge the following people for their comments and suggestions regarding early drafts of this document, and for review of the data file: Mark Davis, Asmus Freytag, Philippe Verdy, Andrew West, A.R. Amaithi Anantham, Behdad Esfahbod, Behnam Esfahbod, Bob Hallissy, Dr. P.R. Nakkeeran, Doug Ewell, Martin Hosken. Julie Allen helped with general editorial review of the contents of the note.

Modifications

Revision 1 [KW]