Writing Direction and Bidirectional Text FAQ
Writing Direction
Q: What does "writing direction" refer to?
Individual writing systems make different default assumptions about how characters are arranged into lines and how lines of text are then arranged on a page or screen. Such assumptions are referred to as a writing system's directionality. For example, in writing systems based on the Latin script, characters are laid out horizontally from left to right to form lines, and lines of text are then laid out running from top to bottom on a page. Because the predominant direction of text flow for the Latin script is from left to right on the page, the Latin script is referred to as a "left-to-right script".
Q: Are there scripts that go from right to left?
Many scripts arrange characters from right to left into lines. For some historic Middle-Eastern scripts, all of the characters in text flow from right to left, so those scripts are referred to as "right-to-left scripts". But modern writing systems such as Arabic and Hebrew are more complicated, because in addition to their basic letters, which flow from right-to-left, they also use other characters, such as digits, which may display the other way. And they are often mixed on the same page with left-to-right scripts such as Latin, so that text needs to run both ways on the same line. When text runs both ways on the same line, we refer to it as bidirectional (or bidi) text.
The W3C has published an overview of languages using right-to-left scripts. It lists 12 scripts and over 200 modern languages using RTL orthographies, including additional information, such as number of speakers.
Q: How does the Unicode Standard deal with writing that mixes directions?
Ordering characters into lines can be very complex when left-to-right and right-to-left scripts are used together. Proper display of Arabic, Hebrew, and similar scripts can require dealing with runs of text that have opposite directions on the same line. Also, the direction and location of punctuation characters is determined by the text that surrounds them, so the actual direction of a specific part of the line depends on context analysis. The Unicode Standard defines an algorithm to determine the layout of a line, including provision for overrides to handle situations that are ambiguous; see UAX #9, Unicode Bidirectional Algorithm for more information.
Q: When is text written vertically?
It is quite common to see text written in vertical lines in East Asia. This practice is still widespread in modern Japanese writing, for example, and used to be standard typography for China and countries influenced by Chinese culture.
When Japanese or Chinese are written vertically, the lines run from top to bottom, and then are arranged in columns that run from right to left on the page. Traditional Mongolian is also written vertically, but for Mongolian the columns run from left to right on the page.
Q: Is vertical text also handled by the Unicode Bidirectional Algorithm?
No. Vertical text in Japanese or Chinese only runs in a single direction—from top to bottom—so does not require dealing with two opposite directions on the same line. Unlike the bidirectional case, the choice of vertical layout is usually treated just as a formatting style. The Unicode Standard does not provide directionality controls designed to override such behavior.
Q: How does vertical text influence the orientation of characters?
Most characters use the same shape and orientation whether displayed horizontally or vertically, but many punctuation characters will change their shape when displayed vertically. Also, letters and words from other scripts are generally rotated through ninety degree angles when mixed with vertical Japanese or Chinese writing, so that they, too, will read from top to bottom. Letters from left-to-right scripts will be rotated clockwise, while letters from right-to-left scripts will be rotated counterclockwise, both through ninety degree angles.
Some individual letters and digits, as well as short combinations of them, may remain upright, instead of being rotated in vertical text. In some cases, there exist compatibility characters specifically intended to have this upright orientation in East Asian typography.
The Unicode Standard provides a property, Vertical_Orientation, which specifies which characters rotate in a vertical context, and which stay upright in orientation by default. See UAX #50, Unicode Vertical Text Layout for more information.
Q: Are there any other script directions?
Other script directionalities are possible and are found in actual writing systems, mainly in historical ones. For example, some ancient Numidian texts are written bottom-to-top, and Egyptian hieroglyphics can be written with various directions for individual lines.
One prominent example is boustrophedon (literally, "ox-turning"), which is often found in ancient European writing systems such as early Greek. In boustrophedon writing, characters are arranged into horizontal lines, but the individual lines alternate between running right to left and running left to right, the way an ox goes back and forth when plowing a field. The letters themselves use mirrored images in accordance with each individual line's direction. [JJ]
Q: Do developers need to worry about these historical directions?
Not really. Boustrophedon writing is of interest almost exclusively to scholars intent on reproducing the exact visual content of ancient texts. The Unicode Standard does not provide formatting codes to signal boustrophedon text. Specialized word processors for ancient scripts might offer support for this. In the absence of that, fixed texts can be written in boustrophedon by using hard line breaks and directionality overrides. [JJ]
Bidirectional Text
Q: What is the Unicode Bidirectional Algorithm (UBA)?
The Unicode Bidirectional Algorithm, often abbreviated to just "UBA", explains in detail how text should be laid out in lines whenever it consists of a mixture of left-to-right script characters and right-to-left script characters. The UBA is specified in UAX #9, Unicode Bidirectional Algorithm.
Q: How does the UBA tell which characters go left to right and which go right to left?
The UBA depends on a character property called Bidi_Class, which has property values defined for all Unicode characters. Note that in addition to left-to-right characters (for example, in the Latin script) and right-to-left characters (for example, in the Arabic script), there are also many characters with a neutral direction. Their behavior in bidirectional text layout depends on the details of their proximity to other characters of strong right-to-left or left-to-right direction. For example, most punctuation and symbol characters have neutral direction. For a complete listing of the Bidi_Class for all characters, see the data file DerivedBidiClass.txt in the Unicode Character Database.
Q: Do modern bidirectional scripts all behave the same?
While Arabic and Hebrew agree on the same ordering of digits, with the most-significant digit on the left, the layout of entire numbers in context, including groups of numbers or use of number separators, numerical and other punctuation differs both by script and, in the case of Arabic, by which set of digits is used. No matter how the layout is resolved the order of characters in memory essentially follows the order they are typed.
Here are some papers that explore this in-depth with examples:
https://r12a.github.io/scripts/arabic/arb.html#expressions
https://r12a.github.io/scripts/arabic/block.html#ar061C
Q: Do all scripts have the most significant digit on the left?
Not all scripts written right to left display simple numbers like Arabic and Hebrew. For example, in Adlam, N'Ko, and various historical scripts numbers have the most-significant digit on the right.
Q: Why does the Unicode Bidirectional Algorithm depend on giving default values for the Bidi_Class property to unassigned code points?
Default values are defined for unassigned code points for all character properties. While final property assignments are only selected at the time the character is encoded, the default values are chosen to make a change unlikely. As a result, systems supporting earlier versions of Unicode will very likely achieve the same display order, even if the text contains a character that was unassigned at the time. For a discussion of how this works and details about particular default values for the Bidi_Class property used in the Unicode Bidirectional Algorithm, see UAX #44, Unicode Character Database.
Q: Are there any issues with normalizing Arabic and/or Hebrew?
Yes, see the question "Isn't the canonical order for Arabic characters wrong?" for a clarification.
Q: Why do some Kannada characters have General_Category Mn but Bidi_Class L instead of NSM?
Ordinarily, nonspacing combining marks (General_Category=Mn) also get the Bidi_Class NSM. There are exceptions, however. For two Kannada vowels, U+0CBF KANNADA VOWEL SIGN I and U+0CC6 KANNADA VOWEL SIGN E, the Unicode Technical Committee made an explicit decision to give these combining marks the Bidi_Class L (Left-to-Right). This choice preserves canonical equivalence in bidirectional text formatting for those two-part Kannada vowels with either of these two vowels as part of their canonical decompositions.
Q: I have some mixed Arabic and English text. It seems to display incorrectly on my browser! Why?
There are several possible reasons. You may simply not have an appropriate Arabic font on your device. But when Arabic and English text are mixed together on a single line, the exact way they are formatted depends on application of the Unicode Bidirectional Algorithm, and the correct display is not always immediately obvious. In particular, the overall paragraph direction can change how mixed Arabic and English text appears on a line.
Q: How does overall paragraph direction change the display of mixed text?
For example, suppose your text is: " ما هو الترميز الموحد يونيكود؟ in Arabic ". The table below shows this text in the bottom row in both a right-to-left (RTL) and left-to-right (LTR) paragraph direction.
As you understand the Unicode Bidirectional Algorithm, you might think this should be rendered as in the RTL column. For example, it might be what your application does, and an Arabic speaker may have confirmed to you that this is indeed correct. Your browser, however, displays this as in the LTR column. Your understanding of both the algorithm and of how to read bidirectional text implies that the example text is predominantly Arabic and should be a right-to-left (RTL) paragraph. Hence, you should start with the Arabic at the right hand side (reading towards the left) and then continue with the English text after that (reading towards the right).
Logical Order | Display Order | |
---|---|---|
◀ ◀ ◀ RTL | LTR ▶ ▶ ▶ | |
WHAT IS UNICODE؟ in arabic | in arabic ؟EDOCINU SI TAHW | ؟EDOCINU SI TAHW in arabic |
ما هو الترميز الموحد يونيكود؟ in Arabic | ما هو الترميز الموحد يونيكود؟ in Arabic |
However, the rendering you actually get depends on the setting of the paragraph direction. The paragraph direction can be based on the first strongly directional character in the text. But as this is often an incorrect guess, one option is to override such a guess by making an explicit choice, whether by means of the document style or the user interface. Depending on the setting for paragraph direction, you would get either the RTL or the LTR display shown above.
So it is likely that your browser is not actually displaying incorrectly. Use of paragraph direction markup is illustrated in the last row of the table above, for which the table cell for the RTL column has an explicit dir="rtl" attribute set, while the table cell for the LTR column has an explicit dir=ltr" attribute set.
To better illustrate this, a schematic text is shown in the row just above the sample text. When viewing this page, your browser should lay out the pieces of the examples in the same way as the schematic. In the schematic text, uppercase letters stand for the Arabic and lowercase letters for the English, and the question mark is an Arabic question mark. The left column shows the schematic in logical order, which follows the order the text was typed. The right hand side shows how this sample would be laid out in RTL and LTR paragraph order, if uppercase characters behaved like Arabic. Because the schematic text only contains ASCII letters, the directional ordering has been simulated and is not affected by your browser.
Q: Can I see another paragraph direction example?
Here is a small Hebrew example. The paragraph direction is set to RTL by putting the dir="rtl" attribute on a "blockquote" element containing the text.
פרטים אודות הקונסורציום של יוניקוד (Unicode Consortium)
הקונסורציום של יוניקוד הוא ארגון ללא מטרת רווח שנוסד כדי לפתח, להרחיב ולקדם את השימוש בתקן יוניקוד, אשר מגדיר את ייצוג הטקסט במוצרי תוכנה ותקנים מודרניים.
Here it is again. In this case the paragraph direction is set instead to LTR.
פרטים אודות הקונסורציום של יוניקוד (Unicode Consortium)
הקונסורציום של יוניקוד הוא ארגון ללא מטרת רווח שנוסד כדי לפתח, להרחיב ולקדם את השימוש בתקן יוניקוד, אשר מגדיר את ייצוג הטקסט במוצרי תוכנה ותקנים מודרניים.
Explicit specification of the paragraph direction as either RTL or LTR by means of an attribute in the "blockquote" element is an example of application of a higher-level protocol. Both displays, if correctly handled by your browser, should be considered conformant with the Unicode Standard—it is not the case that one is correct and one is incorrect. The application of the higher-level protocol simply defines the directional context within which the UBA then determines the appropriate layout of the bidirectional text lines.
Q: How do I set up a web page for bidirectional text?
The W3C has published a number of tutorials and articles on how to set up a web page for bidirectional text and the recommended use of markup and CSS styling. There is also a set of guidelines for authors.
Q: What is a higher-level protocol?
The Unicode Standard defines a higher-level protocol as "any agreement on the interpretation of Unicode characters that extends beyond the scope of [the] standard." The Unicode Bidirectional Algorithm, in particular, allows some options to be set by higher-level protocols. See "Higher-Level Protocols" in UAX #9 for examples. Directional markup for HTML is just one example of a higher-level protocol which can be used with the UBA.
Q: Can a program itself constitute a higher-level protocol for bidirectional text?
A program, such as one implementing a terminal display window, is not generally considered a "protocol", per se. However, it is quite common for such programs to implicitly define an overall directional context for display, and that implicit definition of direction is itself an example of application of a higher-level protocol for the purposes of the UBA. For example, a terminal window may simply assume a left-to-right paragraph direction for display. That is the functional equivalent of an explicit HTML markup for dir="ltr" on a form input element.
A terminal display window might also allow a choice between an overall left-to-right paragraph direction or a right-to-left paragraph direction. Such a choice is not required for conformance to the Unicode Standard, although, of course, it might be a useful option for end users. In any case, if a terminal display window can display bidirectional text as illustrated above for Hebrew correctly for a left-to-right directional context or for a right-to-left directional context or for both contexts, that suffices to consider it conformant to the Unicode Bidirectional Algorithm.
Q: How can I insert dynamic text into a bidi context without causing issues?
If you want to insert a run of text as a unit while preserving its internal layout and not affecting the layout of surrounding text, use "bidi-isolation". In plain text, this means surround your run with the formatting characters for bidi isolates, while in a markup language you might use the commands provided there.
Q: How can I override the default display order?
If you are working in a markup language, use the syntax for selecting paragraph direction or directional override. In plain text, you can either use a paired set of bidi-formatting characters to affect the embedding level or force a fixed direction for the enclosed text, or insert a single character, like a "Right To Left Mark". These single-character marks simply act like an invisible letter. While not visible in the output, their presence affects the resolution of directionality. For an example, see https://en.wikipedia.org/wiki/Right-to-left_mark.
Q: How does the bidi algorithm handle paired punctuation?
The direction of paired punctuation cannot be resolved in isolation. Both parts of the pair need to have the same direction. The Unicode Character Database contains a list of paired punctuation that are recognized by the Unicode Bidirectional Algorithm (UBA).
Q: What is Mirroring?
Many characters have an alternate form that is a mirror image around the vertical axis. For some of these, the direction of the shape should match the writing direction of the run. In a RTL run, the open parenthesis should not look like '(', but rather like ')'. In some cases, this can be achieved by character substitution on the fly. These cases are listed in the Unicode Character Database so that layout systems can use this information.
Note that arrows, although they exist in forward and backward forms, are not included in the set of characters that should be mirrored automatically. The reason is that arrows can be used both to point to the start or end of text, but also to point to other elements on the page (for example a picture that sits in the margin). It is impossible to predict which sense was intended.