Authoring HTML: Language declarations

Abstract

Specifying the language of content is useful for a wide number of applications, from linguistically-sensitive searching to applying language-specific display properties. In some cases the potential applications for language information are still waiting for implementations to catch up, whereas in others it is a necessity today. Adding markup for language information to content is something that can and should be done as content is first developed. If not, it will be much more difficult to take advantage of any future developments.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document provides advice on practical techniques related to the creation of content in HTML that is language aware.

This document was published by the Internationalization Working Group as a Working Group Note. If you wish to make comments regarding this document, please send them to www-international@w3.org (subscribe, archives). All comments are welcome.

Publication as a Working Group Note does not imply endorsement by the W3C Membership. This document may be updated, replaced or obsoleted by other documents at any time. Therefore, quotes or references to specific information in the document should include the publication date of this version, 03 June 2014. It is inappropriate to cite this document as other than a Working Group Note, which is not an endorsed W3C Recommendation.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Table of Contents

1. Introduction

1.1 Who should use this document?

All authors and producers of HTML and CSS.

This document provides guidance for developers of HTML that enables support for international deployment. Enabling international deployment is the responsibility of all content authors, not just localization groups or vendors, and is relevant for all content.

It is assumed that readers of this document are proficient in developing HTML pages – this document provides advice specifically related to internationalization.

1.2 How to use this document

Note
If you don't know much about using language in HTML, you may find it useful to familiarise yourself with the concepts introduced in the tutorial Working with language in HTML. That tutorial will help you understand the essential aspects of how to work with language information when authoring HTML and CSS.

This document lists a number of do's and don'ts, which we will refer to as techniques, related to authoring pages with language information. Each technique is followed by a 'more...' link which points to an article or page that gives further details and explanations. You can find additional information by following the links after each section, which point to sections of the technique index.

If a technique says 'consider', there are usually pros and cons involved in following the advice given, and you should follow the link to more detailed information to be sure you understand these. In some cases it may be that not all browsers support the features described. In other cases, it may be purely up to you to decide whether or not this is a good idea.

1.3 Why read this document?

Applications already exist that can use information about the natural language (ie. the human, non-programmatic language) of content to deliver to users the most relevant information or styling. The more content is tagged, and tagged correctly, the more useful and pervasive such applications will become.

Language information is useful for things such as authoring tools, translation tools, accessibility, font selection, page rendering, search, and scripting.

These applications can't work, however, if the information about the language of the text is not available. Language information should therefore be specified for the page as a whole, and wherever language changes within the page.

In the future there will be other applications for language information, driven by developments in technology. For example, implementations of the CSS3 :first-letter pseudo-element will need language information to apply correct styling. However, we are currently faced with a circular problem. People who don't see the application of language information do not provide information about their content, and language-related applications are slow to be deployed until this information is widely available. This cycle can be broken by content authors taking steps now to declare language information. This is usually very easy to do, and carries no penalties.

2. Metadata vs. text-processing

2.1 The language of the intended audience

Metadata that describes the language of the intended audience is about the document as a whole. Such metadata may be used for searching, serving the right language version, classification, etc. Where there are language changes in a document, information about the language of the intended audience is not specific enough to support text-processing, that is to say, in a way that would be needed for the application of text-to-speech, styling, automatic font assignment, etc.

The language of the intended audience does not include every language used in a document. Many documents on the Web contain embedded fragments of content in different languages, whereas the page is clearly aimed at speakers of one particular language. For example, a German city-guide for Beijing may contain useful phrases in Chinese, but it is aimed at a German-speaking audience, not a Chinese one.

On the other hand, it is also possible to imagine a situation where a document contains the same or parallel content in more than one language. For example, a Web page may welcome Canadian readers with French content in the left column, and the same content in English in the right-hand column. Here the document is equally targeted at speakers of both languages, so there are two audience languages. This situation is not as common on the Web as in printed material since it is easy to link to separate pages on the Web for different audiences, but it does occur where there are multilingual communities. Another use case is a blog or a news page aimed at a multilingual community, where some articles on a page are in one language and some in another.

There are also pages where the navigational information, including the page title, is in one language but the real content of the page is in another. While this is not necessarily good practice, it doesn't change the fact that the language of the intended audience is usually that of the content, regardless of the language at the top of the document source.

Metadata about the language of the intended audience is usually best declared outside the document in the HTTP Content-Language header.

2.2 The text-processing language

When specifying the text-processing language you are declaring the language in which a specific range of text is actually written, so that user agents or applications that manipulate the text, such as voice browsers, spell checkers, or style processors can effectively handle the text in question. So we are, by necessity, talking about associating a single language with a specific range of text.

This specificity distinguishes the declaration of the language for text-processing from the language of the intended audience.

The language for text-processing is usually best declared using attributes on elements, including the html element, which contains all the content of the document. Enclosed elements inherit the declared value, but you can, of course, override an initial declaration by specifying a different language on embedded elements where the language changes, eg. a French phrase in an English paragraph.

2.3 Relationships between language, character encoding and directionality

There are separate mechanisms for declaring character encoding and directionality in HTML, and these ideas should not be confused with mechanisms for declaring language.

Character encoding refers to the sequences of bytes that are used to represent characters in text. It is important to declare which encoding is being used for your document, but this is a separate issue from declaring language. (To better understand character encoding declarations see Handling character encodings in HTML and CSS.)

Some people think that information about language can be inferred from the character encoding, but this is not true. There would have to be a one-to-one mapping between encoding and language for this to work, and there isn't. A single character encoding such as ISO 8859-1 (Latin1), could encode both French and English, as well as a great many other languages. In addition, different character encodings can be used for a single language, eg. Arabic could be encoded with 'Windows-1256' or 'ISO 8859-6' or 'UTF-8'.

Nowadays, this argument should be moot anyway, because content authors should always use UTF-8 as the character encoding. Since UTF-8 encodings cover all but the rarest of language use with a single encoding, there is normally no need to match language and encoding.

Text direction is another thing that should not be confused with language. In some scripts, such as Arabic and Hebrew, displayed text is read predominantly from right to left, although within that flow, numbers and text from other scripts are displayed from left to right. Markup is needed to set the overall right-to-left context, and in some circumstances markup is needed to correctly render bidirectional text, but this cannot necessarily be done using language markup. (To better understand text direction and markup see Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts.)

As with encodings and language, there is not always a one-to-one mapping between language and script, and therefore directionality. For example, Azerbaijani can be written using both right-to-left and left-to-right scripts, and the language code az can be relevant for either. In addition, text direction markup used with inline text applies a range of different values to the text, whereas language is a simple switch that is not up to the tasks required.

3. Declaring the overall language of a page

Always declare the default language for text in the page using attributes on the html tag, unless the document contains content aimed at speakers of more than one language. more...

Do NOT use the meta element with the content attribute set to Content-Language. more...

Use language attributes rather than HTTP to declare the default language for text processing. more...

Do not declare the default language of a document in the body element, use the html element. more...

Use the lang attribute for pages served as HTML, and the xml:lang attribute for pages served as XML. For XHTML 1.x and HTML5 polyglot documents, use both together. more...

4. Identifying in-document language changes

Use the lang and/or xml:lang attributes around text to indicate any changes in language. more...

Use the lang attribute for pages served as HTML, and the xml:lang attribute for pages served as XML. For XHTML 1.x and HTML5 polyglot documents, use both together. more...

If the text in attribute values and element content is in different languages, consider using a nested approach. more...

5. Choosing language values

Use subtags as defined by BCP 47 for language attribute values. detail Use the shortest possible language tag values. more...

Where possible, use the codes zh-Hans and zh-Hant to refer to Simplified and Traditional Chinese, respectively. more...

Use the subtag zxx when the text is known to be not in any language. more...

If using XML, and the format you are using supports it, use xml:lang="", otherwise use xml:lang="und" when the language is undetermined and you have to label it. more...

6. Declaring metadata about the language(s) of the intended audience

Consider using a Content-Language HTTP header to declare metadata about the language(s) of the intended audience of a document. more...

Where a document contains content aimed at speakers of more than one language, use the HTTP Content-Language header with a comma-separated list of language tags. more...

7. Indicating the language of a link destination

When pointing to a resource in another language, consider the pros and cons before indicating the language of the target document. more...

If you want to indicate that the target document of an a element is in another language, consider the pros and cons before using hreflang with CSS. more...

Do not use flag icons to indicate languages. more...

A. Revision Log

This version introduces the following changes:

B. Acknowledgements

Members of the Internationalization Working Group and former GEO Working Group have contributed their time and valuable comments to shaping these guidelines.