Document Status Update 2022-08-30: This version is outdated! For the latest version, please look at https://www.w3.org/International/techniques/authoring-html#language.
Copyright © 2007-2014 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved. W3C liability, trademark and document use rules apply.
Specifying the language of content is useful for a wide number of applications, from linguistically-sensitive searching to applying language-specific display properties. In some cases the potential applications for language information are still waiting for implementations to catch up, whereas in others it is a necessity today. Adding markup for language information to content is something that can and should be done as content is first developed. If not, it will be much more difficult to take advantage of any future developments.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document provides advice on practical techniques related to the creation of content in HTML that is language aware.This document was published by the Internationalization Working Group as a Working Group Note. If you wish to make comments regarding this document, please send them to www-international@w3.org (subscribe, archives). All comments are welcome.
Publication as a Working Group Note does not imply endorsement by the W3C Membership. This document may be updated, replaced or obsoleted by other documents at any time. Therefore, quotes or references to specific information in the document should include the publication date of this version, 03 June 2014. It is inappropriate to cite this document as other than a Working Group Note, which is not an endorsed W3C Recommendation.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
All authors and producers of HTML and CSS.
This document provides guidance for developers of HTML that enables support for international deployment. Enabling international deployment is the responsibility of all content authors, not just localization groups or vendors, and is relevant for all content.
It is assumed that readers of this document are proficient in developing HTML pages – this document provides advice specifically related to internationalization.
This document lists a number of do's and don'ts, which we will refer to as techniques, related to authoring pages with language information. Each technique is followed by a 'more...' link which points to an article or page that gives further details and explanations. You can find additional information by following the links after each section, which point to sections of the technique index.
If a technique says 'consider', there are usually pros and cons involved in following the advice given, and you should follow the link to more detailed information to be sure you understand these. In some cases it may be that not all browsers support the features described. In other cases, it may be purely up to you to decide whether or not this is a good idea.
Applications already exist that can use information about the natural language (ie. the human, non-programmatic language) of content to deliver to users the most relevant information or styling. The more content is tagged, and tagged correctly, the more useful and pervasive such applications will become.
Language information is useful for things such as authoring tools, translation tools, accessibility, font selection, page rendering, search, and scripting.
These applications can't work, however, if the information about the language of the text is not available. Language information should therefore be specified for the page as a whole, and wherever language changes within the page.
In the future there will be other applications for language information, driven by developments in technology. For example, implementations of the CSS3 :first-letter
pseudo-element will need language information to apply correct styling. However, we are currently faced with a circular problem. People who don't see the application of language information do not provide information about their content, and language-related applications are slow to be deployed until this information is widely available. This cycle can be broken by content authors taking steps now to declare language information. This is usually very easy to do, and carries no penalties.
Metadata that describes the language of the intended audience is about the document as a whole. Such metadata may be used for searching, serving the right language version, classification, etc. Where there are language changes in a document, information about the language of the intended audience is not specific enough to support text-processing, that is to say, in a way that would be needed for the application of text-to-speech, styling, automatic font assignment, etc.
The language of the intended audience does not include every language used in a document. Many documents on the Web contain embedded fragments of content in different languages, whereas the page is clearly aimed at speakers of one particular language. For example, a German city-guide for Beijing may contain useful phrases in Chinese, but it is aimed at a German-speaking audience, not a Chinese one.
On the other hand, it is also possible to imagine a situation where a document contains the same or parallel content in more than one language. For example, a Web page may welcome Canadian readers with French content in the left column, and the same content in English in the right-hand column. Here the document is equally targeted at speakers of both languages, so there are two audience languages. This situation is not as common on the Web as in printed material since it is easy to link to separate pages on the Web for different audiences, but it does occur where there are multilingual communities. Another use case is a blog or a news page aimed at a multilingual community, where some articles on a page are in one language and some in another.
There are also pages where the navigational information, including the page title, is in one language but the real content of the page is in another. While this is not necessarily good practice, it doesn't change the fact that the language of the intended audience is usually that of the content, regardless of the language at the top of the document source.
Metadata about the language of the intended audience is usually best declared outside the document in the HTTP Content-Language
header.
When specifying the text-processing language you are declaring the language in which a specific range of text is actually written, so that user agents or applications that manipulate the text, such as voice browsers, spell checkers, or style processors can effectively handle the text in question. So we are, by necessity, talking about associating a single language with a specific range of text.
This specificity distinguishes the declaration of the language for text-processing from the language of the intended audience.
The language for text-processing is usually best declared using attributes on elements, including the html
element, which contains all the content of the document. Enclosed elements inherit the declared value, but you can, of course, override an initial declaration by specifying a different language on embedded elements where the language changes, eg. a French phrase in an English paragraph.
There are separate mechanisms for declaring character encoding and directionality in HTML, and these ideas should not be confused with mechanisms for declaring language.
Character encoding refers to the sequences of bytes that are used to represent characters in text. It is important to declare which encoding is being used for your document, but this is a separate issue from declaring language. (To better understand character encoding declarations see Handling character encodings in HTML and CSS.)
Some people think that information about language can be inferred from the character encoding, but this is not true. There would have to be a one-to-one mapping between encoding and language for this to work, and there isn't. A single character encoding such as ISO 8859-1 (Latin1), could encode both French and English, as well as a great many other languages. In addition, different character encodings can be used for a single language, eg. Arabic could be encoded with 'Windows-1256' or 'ISO 8859-6' or 'UTF-8'.
Nowadays, this argument should be moot anyway, because content authors should always use UTF-8 as the character encoding. Since UTF-8 encodings cover all but the rarest of language use with a single encoding, there is normally no need to match language and encoding.
Text direction is another thing that should not be confused with language. In some scripts, such as Arabic and Hebrew, displayed text is read predominantly from right to left, although within that flow, numbers and text from other scripts are displayed from left to right. Markup is needed to set the overall right-to-left context, and in some circumstances markup is needed to correctly render bidirectional text, but this cannot necessarily be done using language markup. (To better understand text direction and markup see Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts.)
As with encodings and language, there is not always a one-to-one mapping between language and script, and therefore
directionality. For example, Azerbaijani can be written using both right-to-left and left-to-right scripts, and the language code az
can be relevant for either. In addition, text direction markup used with inline text applies a range of different values to the text, whereas language is a simple switch that is not up to the tasks required.
Always declare the default language for text in the page using attributes on the html
tag, unless the document contains content aimed at speakers of more than one language. more...
Do NOT use the meta
element with the content
attribute set to Content-Language
. more...
Use language attributes rather than HTTP to declare the default language for text processing. more...
Do not declare the default language of a document in the body
element, use the html
element. more...
Use the lang
attribute for pages served as HTML, and the xml:lang
attribute for pages served as XML. For XHTML 1.x and HTML5 polyglot documents, use both together. more...
Learn more about:
Use the lang
and/or xml:lang
attributes around text to indicate any changes in language. more...
Use the lang
attribute for pages served as HTML, and the xml:lang
attribute for pages served as XML. For XHTML 1.x and HTML5 polyglot documents, use both together. more...
If the text in attribute values and element content is in different languages, consider using a nested approach. more...
Learn more about:
Use subtags as defined by BCP 47 for language attribute values. detail Use the shortest possible language tag values. more...
Where possible, use the codes zh-Hans
and zh-Hant
to refer to Simplified and Traditional Chinese, respectively. more...
Use the subtag zxx
when the text is known to be not in any language. more...
If using XML, and the format you are using supports it, use xml:lang=""
, otherwise use xml:lang="und"
when the language is undetermined and you have to label it. more...
Learn more about:
Consider using a Content-Language
HTTP header to declare metadata about the language(s) of the intended audience of a document. more...
Where a document contains content aimed at speakers of more than one language, use the HTTP Content-Language
header with a comma-separated list of language tags. more...
Learn more about:
When pointing to a resource in another language, consider the pros and cons before indicating the language of the target document. more...
If you want to indicate that the target document of an a
element is in another language, consider the pros and cons before using hreflang
with CSS. more...
Do not use flag icons to indicate languages. more...
Learn more about:
This version introduces the following changes:
Members of the Internationalization Working Group and former GEO Working Group have contributed their time and valuable comments to shaping these guidelines.