Language information and text direction

8 Language information and text direction

Contents

  1. Specifying the language of content: the lang attribute
    1. Language codes
    2. Inheritance of language codes
    3. Interpretation of language codes
  2. Specifying the direction of text and tables: the dir attribute
    1. Introduction to the bidirectional algorithm
    2. Inheritance of text direction information
    3. Setting the direction of embedded text
    4. Overriding the bidirectional algorithm: the BDO element
    5. Character references for directionality and joining control
    6. The effect of style sheets on bidirectionality

This section of the document discusses two important issues that affect the internationalization of HTML: specifying the language (the lang attribute) and direction (the dir attribute) of text in a document.

8.1 Specifying the language of content: the lang attribute

Attribute definitions
lang = language-code [CI]
This attribute specifies the base language of an element's attribute values and text content. The default value of this attribute is unknown.

Language information specified via the lang attribute may be used by a user agent to control rendering in a variety of ways. Some situations where author-supplied language information may be helpful include:

The lang attribute specifies the language of element content and attribute values; whether it is relevant for a given attribute depends on the syntax and semantics of the attribute and the operation involved.

The intent of the lang attribute is to allow user agents to render content more meaningfully based on accepted cultural practice for a given language. This does not imply that user agents should render characters that are atypical for a particular language in less meaningful ways; user agents must make a best attempt to render all characters, regardless of the value specified by lang.

For instance, if characters from the Greek alphabet appear in the midst of English text:

<P><Q lang="en">Her super-powers were the result of
&gamma;-radiation,</Q> he explained.</P>

a user agent (1) should try to render the English content in an appropriate manner (e.g., in its handling the quotation marks) and (2) must make a best attempt to render γ even though it is not an English character.

Please consult the section on undisplayable characters for related information.

8.1.1 Language codes

The lang attribute's value is a language code that identifies a natural language spoken, written, or otherwise used for the communication of information among people. Computer languages are explicitly excluded from language codes.

[RFC1766] defines and explains the language codes that must be used in HTML documents.

Briefly, language codes consist of a primary code and a possibly empty series of subcodes:

        language-code = primary-code ( "-" subcode )*

Here are some sample language codes:

Two-letter primary codes are reserved for [ISO639] language abbreviations. Two-letter codes include fr (French), de (German), it (Italian), nl (Dutch), el (Greek), es (Spanish), pt (Portuguese), ar (Arabic), he (Hebrew), ru (Russian), zh (Chinese), ja (Japanese), hi (Hindi), ur (Urdu), and sa (Sanskrit).

Any two-letter subcode is understood to be a [ISO3166] country code.

8.1.2 Inheritance of language codes

An element inherits language code information according to the following order of precedence (highest to lowest):

In this example, the primary language of the document is French ("fr"). One paragraph is declared to be in Spanish ("es"), after which the primary language returns to French. The following paragraph includes an embedded Japanese ("ja") phrase, after which the primary language returns to French.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   "http://www.w3.org/TR/html4/strict.dtd">
<HTML lang="fr">
<HEAD>
<TITLE>Un document multilingue</TITLE>
</HEAD>
<BODY>
...Interpreted as French...
<P lang="es">...Interpreted as Spanish...
<P>...Interpreted as French again...
<P>...French text interrupted by<EM lang="ja">some
         Japanese</EM>French begins here again...
</BODY>
</HTML>
Note. Table cells may inherit lang values not from its parent but from the first cell in a span. Please consult the section on alignment inheritance for details.

8.1.3 Interpretation of language codes

In the context of HTML, a language code should be interpreted by user agents as a hierarchy of tokens rather than a single token. When a user agent adjusts rendering according to language information (say, by comparing style sheet language codes and lang values), it should always favor an exact match, but should also consider matching primary codes to be sufficient. Thus, if the lang attribute value of "en-US" is set for the HTML element, a user agent should prefer style information that matches "en-US" first, then the more general value "en".

Note. Language code hierarchies do not guarantee that all languages with a common prefix will be understood by those fluent in one or more of those languages. They do allow a user to request this commonality when it is true for that user.

8.2 Specifying the direction of text and tables: the dir attribute

Attribute definitions

dir = LTR | RTL [CI]
This attribute specifies the base direction of directionally neutral text (i.e., text that doesn't have inherent directionality as defined in [UNICODE]) in an element's content and attribute values. It also specifies the directionality of tables. Possible values:
  • LTR: Left-to-right text or table.
  • RTL: Right-to-left text or table.

In addition to specifying the language of a document with the lang attribute, authors may need to specify the base directionality (left-to-right or right-to-left) of portions of a document's text, of table structure, etc. This is done with the dir attribute.

The [UNICODE] specification assigns directionality to characters and defines a (complex) algorithm for determining the proper directionality of text. If a document does not contain a displayable right-to-left character, a conforming user agent is not required to apply the [UNICODE] bidirectional algorithm. If a document contains right-to-left characters, and if the user agent displays these characters, the user agent must use the bidirectional algorithm.

Although Unicode specifies special characters that deal with text direction, HTML offers higher-level markup constructs that do the same thing: the dir attribute (do not confuse with the DIR element) and the BDO element. Thus, to express a Hebrew quotation, it is more intuitive to write

<Q lang="he" dir="rtl">...a Hebrew quotation...</Q>

than the equivalent with Unicode references:

&#x202B;&#x05F4;...a Hebrew quotation...&#x05F4;&#x202C;

User agents must not use the lang attribute to determine text directionality.

The dir attribute is inherited and may be overridden. Please consult the section on the inheritance of text direction information for details.

8.2.1 Introduction to the bidirectional algorithm

The following example illustrates the expected behavior of the bidirectional algorithm. It involves English, a left-to-right script, and Hebrew, a right-to-left script.

Consider the following example text:

  english1 HEBREW2 english3 HEBREW4 english5 HEBREW6

The characters in this example (and in all related examples) are stored in the computer the way they are displayed here: the first character in the file is "e", the second is "n", and the last is "6".

Suppose the predominant language of the document containing this paragraph is English. This means that the base direction is left-to-right. The correct presentation of this line would be:

english1 2WERBEH english3 4WERBEH english5 6WERBEH
         <------          <------          <------
            H                H                H
------------------------------------------------->
                       E

The dotted lines indicate the structure of the sentence: English predominates and some Hebrew text is embedded. Achieving the correct presentation requires no additional markup since the Hebrew fragments are reversed correctly by user agents applying the bidirectional algorithm.

If, on the other hand, the predominant language of the document is Hebrew, the base direction is right-to-left. The correct presentation is therefore:

6WERBEH english5 4WERBEH english3 2WERBEH english1
        ------->         ------->         ------->
            E                E                E
<-------------------------------------------------
                       H

In this case, the whole sentence has been presented as right-to-left and the embedded English sequences have been properly reversed by the bidirectional algorithm.

8.2.2 Inheritance of text direction information

The Unicode bidirectional algorithm requires a base text direction for text blocks. To specify the base direction of a block-level element, set the element's dir attribute. The default value of the dir attribute is "ltr" (left-to-right text).

When the dir attribute is set for a block-level element, it remains in effect for the duration of the element and any nested block-level elements. Setting the dir attribute on a nested element overrides the inherited value.

To set the base text direction for an entire document, set the dir attribute on the HTML element.

For example:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   "http://www.w3.org/TR/html4/strict.dtd">
<HTML dir="RTL">
<HEAD>
<TITLE>...a right-to-left title...</TITLE>
</HEAD>
...right-to-left text...
<P dir="ltr">...left-to-right text...</P>
<P>...right-to-left text again...</P>
</HTML>

Inline elements, on the other hand, do not inherit the dir attribute. This means that an inline element without a dir attribute does not open an additional level of embedding with respect to the bidirectional algorithm. (Here, an element is considered to be block-level or inline based on its default presentation. Note that the INS and DEL elements can be block-level or inline depending on their context.)

8.2.3 Setting the direction of embedded text

The [UNICODE] bidirectional algorithm automatically reverses embedded character sequences according to their inherent directionality (as illustrated by the previous examples). However, in general only one level of embedding can be accounted for. To achieve additional levels of embedded direction changes, you must make use of the dir attribute on an inline element.

Consider the same example text as before:

english1 HEBREW2 english3 HEBREW4 english5 HEBREW6

Suppose the predominant language of the document containing this paragraph is English. Furthermore, the above English sentence contains a Hebrew section extending from HEBREW2 through HEBREW4 and the Hebrew section contains an English quotation (english3). The desired presentation of the text is thus:

english1 4WERBEH english3 2WERBEH english5 6WERBEH
                 ------->
                    E
         <-----------------------
                    H
------------------------------------------------->
                    E

To achieve two embedded direction changes, we must supply additional information, which we do by delimiting the second embedding explicitly. In this example, we use the SPAN element and the dir attribute to mark up the text:

english1 <SPAN dir="RTL">HEBREW2 english3 HEBREW4</SPAN> english5 HEBREW6

Authors may also use special Unicode characters to achieve multiple embedded direction changes. To achieve left-to-right embedding, surround embedded text with the characters LEFT-TO-RIGHT EMBEDDING ("LRE", hexadecimal 202A) and POP DIRECTIONAL FORMATTING ("PDF", hexadecimal 202C). To achieve right-to-left embedding, surround embedded text with the characters RIGHT-TO-LEFT EMBEDDING ("RTE", hexadecimal 202B) and PDF.

Using HTML directionality markup with Unicode characters. Authors and designers of authoring software should be aware that conflicts can arise if the dir attribute is used on inline elements (including BDO) concurrently with the corresponding [UNICODE] formatting characters. Preferably one or the other should be used exclusively. The markup method offers a better guarantee of document structural integrity and alleviates some problems when editing bidirectional HTML text with a simple text editor, but some software may be more apt at using the [UNICODE] characters. If both methods are used, great care should be exercised to insure proper nesting of markup and directional embedding or override, otherwise, rendering results are undefined.

8.2.4 Overriding the bidirectional algorithm: the BDO element

<!ELEMENT BDO - - (%inline;)*          -- I18N BiDi over-ride -->
<!ATTLIST BDO
  %coreattrs;                          -- id, class, style, title --
  lang        %LanguageCode; #IMPLIED  -- language code --
  dir         (ltr|rtl)      #REQUIRED -- directionality --
  >

Start tag: required, End tag: required

Attribute definitions

dir = LTR | RTL [CI]
This mandatory attribute specifies the base direction of the element's text content. This direction overrides the inherent directionality of characters as defined in [UNICODE]. Possible values:
  • LTR: Left-to-right text.
  • RTL: Right-to-left text.

Attributes defined elsewhere

The bidirectional algorithm and the dir attribute generally suffice to manage embedded direction changes. However, some situations may arise when the bidirectional algorithm results in incorrect presentation. The BDO element allows authors to turn off the bidirectional algorithm for selected fragments of text.

Consider a document containing the same text as before:

english1 HEBREW2 english3 HEBREW4 english5 HEBREW6

but assume that this text has already been put in visual order. One reason for this may be that the MIME standard ([RFC2045], [RFC1556]) favors visual order, i.e., that right-to-left character sequences are inserted right-to-left in the byte stream. In an email, the above might be formatted, including line breaks, as:

english1 2WERBEH english3
4WERBEH english5 6WERBEH

This conflicts with the [UNICODE] bidirectional algorithm, because that algorithm would invert 2WERBEH, 4WERBEH, and 6WERBEH a second time, displaying the Hebrew words left-to-right instead of right-to-left.

The solution in this case is to override the bidirectional algorithm by putting the Email excerpt in a PRE element (to conserve line breaks) and each line in a BDO element, whose dir attribute is set to LTR:

<PRE>
<BDO dir="LTR">english1 2WERBEH english3</BDO>
<BDO dir="LTR">4WERBEH english5 6WERBEH</BDO>
</PRE>

This tells the bidirectional algorithm "Leave me left-to-right!" and would produce the desired presentation:

english1 2WERBEH english3
4WERBEH english5 6WERBEH

The BDO element should be used in scenarios where absolute control over sequence order is required (e.g., multi-language part numbers). The dir attribute is mandatory for this element.

Authors may also use special Unicode characters to override the bidirectional algorithm -- LEFT-TO-RIGHT OVERRIDE (202D) or RIGHT-TO-LEFT OVERRIDE (hexadecimal 202E). The POP DIRECTIONAL FORMATTING (hexadecimal 202C) character ends either bidirectional override.

Note. Recall that conflicts can arise if the dir attribute is used on inline elements (including BDO) concurrently with the corresponding [UNICODE] formatting characters.

Bidirectionality and character encoding According to [RFC1555] and [RFC1556], there are special conventions for the use of "charset" parameter values to indicate bidirectional treatment in MIME mail, in particular to distinguish between visual, implicit, and explicit directionality. The parameter value "ISO-8859-8" (for Hebrew) denotes visual encoding, "ISO-8859-8-i" denotes implicit bidirectionality, and "ISO-8859-8-e" denotes explicit directionality.

Because HTML uses the Unicode bidirectionality algorithm, conforming documents encoded using ISO 8859-8 must be labeled as "ISO-8859-8-i". Explicit directional control is also possible with HTML, but cannot be expressed with ISO 8859-8, so "ISO-8859-8-e" should not be used.

The value "ISO-8859-8" implies that the document is formatted visually, misusing some markup (such as TABLE with right alignment and no line wrapping) to ensure reasonable display on older user agents that do not handle bidirectionality. Such documents do not conform to the present specification. If necessary, they can be made to conform to the current specification (and at the same time will be displayed correctly on older user agents) by adding BDO markup where necessary. Contrary to what is said in [RFC1555] and [RFC1556], ISO-8859-6 (Arabic) is not visual ordering.

8.2.5 Character references for directionality and joining control

Since ambiguities sometimes arise as to the directionality of certain characters (e.g., punctuation), the [UNICODE] specification includes characters to enable their proper resolution. Also, Unicode includes some characters to control joining behavior where this is necessary (e.g., some situations with Arabic letters). HTML 4 includes character references for these characters.

The following DTD excerpt presents some of the directional entities:

   <!ENTITY zwnj CDATA "&#8204;"--=zero width non-joiner-->
   <!ENTITY zwj  CDATA "&#8205;"--=zero width joiner-->
   <!ENTITY lrm  CDATA "&#8206;"--=left-to-right mark-->
   <!ENTITY rlm  CDATA "&#8207;"--=right-to-left mark-->

The zwnj entity is used to block joining behavior in contexts where joining will occur but shouldn't. The zwj entity does the opposite; it forces joining when it wouldn't occur but should. For example, the Arabic letter "HEH" is used to abbreviate "Hijri", the name of the Islamic calendar system. Since the isolated form of "HEH" looks like the digit five as employed in Arabic script (based on Indic digits), in order to prevent confusing "HEH" as a final digit five in a year, the initial form of "HEH" is used. However, there is no following context (i.e., a joining letter) to which the "HEH" can join. The zwj character provides that context.

Similarly, in Persian texts, there are cases where a letter that normally would join a subsequent letter in a cursive connection should not. The character zwnj is used to block joining in such cases.

The other characters, lrm and rlm, are used to force directionality of directionally neutral characters. For example, if a double quotation mark comes between an Arabic (right-to-left) and a Latin (left-to-right) letter, the direction of the quotation mark is not clear (is it quoting the Arabic text or the Latin text?). The lrm and rlm characters have a directional property but no width and no word/line break property. Please consult [UNICODE] for more details.

Mirrored character glyphs. In general, the bidirectional algorithm does not mirror character glyphs but leaves them unaffected. An exception are characters such as parentheses (see [UNICODE], table 4-7). In cases where mirroring is desired, for example for Egyptian Hieroglyphs, Greek Bustrophedon, or special design effects, this should be controlled with styles.

8.2.6 The effect of style sheets on bidirectionality

In general, using style sheets to change an element's visual rendering from block-level to inline or vice-versa is straightforward. However, because the bidirectional algorithm relies on the inline/block-level distinction, special care must be taken during the transformation.

When an inline element that does not have a dir attribute is transformed to the style of a block-level element by a style sheet, it inherits the dir attribute from its closest parent block element to define the base direction of the block.

When a block element that does not have a dir attribute is transformed to the style of an inline element by a style sheet, the resulting presentation should be equivalent, in terms of bidirectional formatting, to the formatting obtained by explicitly adding a dir attribute (assigned the inherited value) to the transformed element.