Choosing & applying a character encoding

Choosing & applying a character encoding

Question

Which character encoding should I use for my content, and how do I apply it to my content?

Content is composed of a sequence of characters. Characters represent letters of the alphabet, punctuation, etc. But content is stored in a computer as a sequence of bytes, which are numeric values. Sometimes more than one byte is used to represent a single character. Like codes used in espionage, the way that the sequence of bytes is converted to characters depends on what key was used to encode the text. In this context, that key is called a character encoding.

This article offers simple advice on which character encoding to use for your content, and how to apply it, ie. how to actually produce a document in that encoding.

If you need to better understand what characters and character encodings are, see the article Character encodings for beginners.

Quick answer

Choose UTF-8 for all content and consider converting any content in legacy encodings to UTF-8.

If you really can't use a Unicode encoding, check that there is wide browser support for the page encoding that you have selected, and that the encoding is not on the list of encodings to be avoided according to recent specifications.

Check whether your choice will be affected by HTTP server-side settings.

In addition to declaring the encoding of the document inside the document and/or on the server, you need to save the text in that encoding to apply it to your content.

Developers also need to ensure that the various parts of the system can communicate with each other.

Details

Applying an encoding to your content

Content authors should declare the character encoding of their pages using one of the methods described in Declaring character encodings in HTML.

However, it is important to understand that just declaring an encoding inside a document or on the server won't actually change the bytes; you need to save the text in that encoding to apply it to your content. (The declaration just helps the browser interpret the sequences of bytes in which the text is stored.)

If necessary, set up UTF-8 as the default for new documents in your editor. The picture below shows how you would do that in the preferences of an editor such as Dreamweaver.

Dreamweaver's new document preferences allow you to specify a default encoding.

You may also need to check that your server is serving documents with the right HTTP declarations, since it will override the in-document information (see below).

Developers also need to ensure that the various parts of the system can communicate with each other. Web pages must be able to communicate seamlessly with back-end scripts, databases, and such. These, of course, all work best with UTF-8, too. Developers can find a detailed set of things to consider in the article Migrating to Unicode.

Why use UTF-8?

An HTML page can only be in one encoding. You cannot encode different parts of a document in different encodings.

A Unicode-based encoding such as UTF-8 can support many languages and can accommodate pages and forms in any mixture of those languages. Its use also eliminates the need for server-side logic to individually determine the character encoding for each page served or each incoming form submission. This significantly reduces the complexity of dealing with a multilingual site or application.

A Unicode encoding also allows many more languages to be mixed on a single page than any other choice of encoding.

Any barriers to using Unicode are very low these days. In fact, in January 2012 Google reported that over 60% of the Web in their sample of several billion pages was now using UTF-8. Add to that the figure for ASCII-only web pages (since ASCII is a subset of UTF-8), and the figure rises to around 80%.

There are three different Unicode character encodings: UTF-8, UTF-16 and UTF-32. Of these three, only UTF-8 should be used for Web content. The HTML5 specification says "Authors are encouraged to use UTF-8. Conformance checkers may advise authors against using legacy encodings. Authoring tools should default to using UTF-8 for newly-created documents."

Note, in particular, that all ASCII characters in UTF-8 use exactly the same bytes as an ASCII encoding, which often helps with interoperability and backwards compatibility.

Taking the HTTP header into account

Any character encoding declaration in the HTTP header will override declarations inside the page. If the HTTP header declares an encoding that is not the same as the one you want to use for your content this will cause a problem unless you are able to change the server settings.

You may not have control over the declarations that come with the HTTP header, and may have to contact the people who manage the server for help. On the other hand there are sometimes ways you can fix things on the server if you have limited access to server setup files or are generating pages using scripting languages. For example, see Setting the HTTP charset parameter for more information about how to change the encoding information, either locally for a set of files on a server, or for content generated using a scripting language.

Typically, before doing so, you need to check whether the HTTP header is actually declaring the character encoding. You could use the W3C Internationalization Checker to find out what character encoding, if any, is specified in the HTTP header. Alternatively, the article Checking HTTP Headers points to some other tools for checking the encoding information passed by the server.

Additional information

The information in this section relates to things you should not normally need to know, but which are included here for completeness.

What if I can't use UTF-8?

If you really can't avoid using a non-UTF-8 character encoding you will need to choose from a limited set of encoding names to ensure maximum interoperability and the longest possible term of readability for your content, and to minimise security vulnerabilities.

Until recently the IANA registry was the place to find names for encodings. The IANA registry commonly includes multiple names for the same encoding. In this case you should use the name designated as 'preferred'.

The new Encoding specification now provides a list that has been tested against actual browser implementations. You can find the list in the table in the section called Names and labels. It is best to use the names in the left column of that table.

Note, however, that the presence of a name in either of these sources doesn't necessarily mean that it is OK to use that encoding. See the next section for encodings that you should avoid.

Avoid these encodings

The HTML5 specification calls out a number of encodings that you should avoid.

Documents must not use JIS_C6226-1983, JIS_X0212-1990, HZ-GB-2312, JOHAB (Windows code page 1361), encodings based on ISO-2022, or encodings based on EBCDIC. This is because they allow ASCII code points to represent non-ASCII characters, which poses a security threat.

Documents must also not use CESU-8, UTF-7, BOCU-1, or SCSU encodings, since they were never intended for Web content and the HTML5 specification forbids browsers from recognising them.

The specification also strongly discourages the use of UTF-16, and the use of UTF-32 is 'especially discouraged'.

Other character encodings listed in the Encoding specification should also be avoided. These include Big5 and EUC-JP encodings, which have interoperability issues. ISO-8859-8 (Hebrew encoding for visually ordered text) should also be avoided, in favour of an encoding that works with logically ordered text (ie. UTF-8, or failing that ISO-8859-8-i).

The replacement encoding, listed in the Encoding specification, is not actually an encoding; it is a fallback that maps every octet to the Unicode code point U+FFFD REPLACEMENT CHARACTER. Obviously, it is not useful to transmit data in this encoding.

The x-user-defined encoding is a single-byte encoding whose lower half is ASCII and whose upper half is mapped into the Unicode Private Use Area (PUA). Like the PUA in general, using this encoding on the public Internet is best avoided because it damages interoperability and long-term use.