Curling Quotes in HTML, XML, and SGML

Curling Quotes in HTML, SGML, and XML

by David A. Wheeler

Summary

If you’re creating HTML, SGML, and XML directly, perhaps using a text editor or writing a program, the safest approach is to use “decimal numeric character references” for curling single and double quote characters (these marks are called “smart quotes,” “curly quotes,” “curled quotes,” “curling quotes,” or “curved quotes”). In other words, for left and right double quotation marks, use “ and ” - and for left and right single quotation marks (and apostrophes), use ‘ and ’ - and you’ll be glad you did. This approach complies with all international standards, and works essentially everywhere.

Here’s a table showing what I mean.

To showIn HTML, SGML, or XML useDisplays on your system as
Left Double Quotation Mark“
Right Double Quotation Mark”
Left Single Quotation Mark‘
Right Single Quotation Mark (including English possessives and contractions)’

By doing this, your text will look good on a very wide variety of browsers and viewers, and you can easily cut-and-paste portions of data between HTML, SGML, and XML documents (letting you dynamically query and create new material from existing material, without having to deal with the complexities of translating between character sets).

If you don't want to do this directly, use tools that will do it for you. If you're using simple ASCII text files, SmartyPants can do this for you.

The best alternative is using UTF-8. UTF-8 is fantastic, but other charsets are still in use and can cause problems.

Rationale

There are many advantages to this particular recommendation. These are the official, standard, vendor-neutral encodings for these characters according to both Unicode and ISO-10646, so you don’t need to worry about them not working in the future. They also work across XML, HTML, and SGML, simplifying data extraction - alternatives such as named character entity references do not easily work across XML and HTML (in particular). Systems which can display curling quotes (with the current fonts) will do so, and practically without exception will gracefully go back to neutral (vertical) characters if they can’t - even if they’re a somewhat old browser. I’ve tested this approach on several versions of Internet Explorer, Netscape (the old 4.5 and 6.X), Mozilla (0.9.9 and 1.0), and lynx (a text browser), on a variety of systems (Windows, Linux, Sun Solaris). The one minor problem is that on some older X windows systems with old fonts, the left single quotation mark may get mapped to a character that is an angled character for the right single quotation mark - but it doesn’t look bad, the alternatives look far worse everywhere else, and this solution is “future-proof”.

Do not use the various alternatives:

Now, why is this a problem? Normal English uses matched pairs of curled single quotation marks and double quotation marks to indicate quotation. Unfortunately, the original designers of the ASCII character set didn’t define a standard method for identifying properly curved quotation marks, so computers have had problems with properly exchanging quotation marks ever since.

Other Sources of Information

Markus Kuhn’s “ASCII and Unicode Quotation Marks” describes the general problem well. He summarizes this way:

Please do not use the ASCII grave accent (0x60) as a left quotation mark together with the ASCII apostrophe (0x27) as the corresponding right quotation mark. Your text will otherwise appear rather strange with most modern fonts (e.g., on Windows and Mac systems). Only old X Window System fonts and some old video terminals show ASCII 0x60/0x27 as left and right quotation marks, while most modern systems follow the ISO and Unicode standards instead. If you can use only ASCII’s typewriter characters, then use the apostrophe character (0x27) as both the left and right quotation mark. If you can use Unicode characters, nice directional quotation marks are available in the form of characters U+2018 and U+2019.
There’s an interesting test page that tests some characters. The W3C has a page on character encodings.

Unfortunately, Kuhn doesn’t describe how to specifically deal with the problem in HTML, XML, and SGML, which is why I wrote this page.

If you’re curious, here are the text pages I used to examine the issue on a wide variety of machines:

Note that this approach means that if you're trying to generate simple ASCII text from HTML, SGML, or XML, you will need to translate curved quotes into straight quotes. But this is true in general - if you start with a richer character set (such as HTML, SGML, or XML when using numeric character references) and have to move to a poorer character set, you should expect that some characters will need to be translated. There are many other characters you have to handle anyway, so this is a step you would would have to do anyway.

After I wrote this page, I found that others have come to the same conclusion (for the same reasons). For example, Peter K. Sheerin’s The Trouble with EM ‘n EN proposes the same solution, for many of the same reasons (although he doesn’t note the issues with SGML and XML, which I think are important too). He also discusses proper use of the em dash (—), which is used to indicate a sudden break in thought, the en dash, (–), which is used to indicate a range or connection between things, and the single prime (′), which is used to represent feet or minutes. Again, the solution is to use decimal numeric character references.

Note that the W3C recommends only using such escapes as an exception. They suggest using a Unicode-based encoding (UTF-8, UTF-16, or UTF-32), and for XML using UTF-8 or UTF-16. The problem is that we're still in a transition period where tools don't all handle them so well, and the recommendation made here will ALWAYS work (now and in the future).

If you want detailed specifications on some of this, here are a few pointers: here is the Microsoft Windows Codepage 1252 (Windows Latin 1), as well as the Microsoft Windows Codepage 1253 (there are many more). A summary of the PalmOS code page is available. Possibly more importantly, here are some mapping documents that show how to convert from some of these character encodings into Unicode/ISO 10646: Microsoft Windows 1252 to Unicode, MacOS Roman to Unicode, and here are the set of mappings from various encodings into Unicode/ISO 10646.

Tools

I make a open source software / free software (OSS/FS) tool, quoter, available which uses heuristics to try to fix quotation marks in HTML, XML, or SGML (it’s smart enough to leave quotes alone when used in tags). It’s free, so feel free to use it (it requires a Unix-like system or Cygwin on Windows).

The demoroniser program fixes many incompatible Microsoft punctuation marks so that they comply with standards, but unfortunately, the last version I’ve seen (published January 1998) only converts the Microsoft quotation marks into the straight ASCII quotation marks instead of implmenting the approach described here. The demoroniser results at least look better and are more interoperable than doing nothing, but they aren’t as good as the approach recommended here. This perhaps makes sense; in January 1998, there were still some old tools that did not handle quotation marks well, but at this time that is unnecessary. My quoter tool does a better job of translating quotation marks; you can use demoroniser after using quoter if you’d like to fix other characters.

Composer, the HTML editor in Mozilla and Netscape 6 (and later), will normally correctly edit files that include curled quotes defined this way. In other words, if the file has them, and you edit the file, they’ll be fine. However, if you set the Content-type value in the HTML file, be sure to use a setting such ascii or iso-8859-1. Here’s an example of the HTML codes you should set, if you choose to set the Content-type (often a good idea):

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
If you set some other charset that can represent quote characters directly, such as utf-8, then Composer will automatically convert any numeric character references to that character set. This is reasonable for Composer to do, but it may cause trouble when you try to combine files later (you may then have to use various conversion tools). You may also want to use the Edit/Preferences menu and select “Retain original source formatting.” Sadly, at the time of this writing, Composer doesn’t have a preference setting that lets you automatically use curling quotes when pressing a straight quote button (the capability is sometimes called “smart quotes”), and its Insert/Characters and Symbols capability have curled quotes as an option. I’ve entered a suggestion to do so; please look at the bug report #145765 and vote to add this capability. Thus, for the moment, to enter curled quotes while in Composer you have to switch to HTML source view; this works, but is slow on extremely large files. An alternative is to just edit files normally, and then use tools such as my quoter tool to fix things after editing.

Plucker, as of version 1.2, handles these quotes correctly.

MacOS X’s Cocoa supports curling quotes, both entering and displaying them, using the standard Unicode character values advocated here. However, users may not remember how to enter the curly quotes. Andrew C. Stone shows how to automatically add curly quotes to Cocoa’s Text system.

If you have existing text in one character set, particularly a non-standard one like Windows’, you can use one of many tools to convert it to something else. Unix-like systems such as GNU/Linux usually have iconv, which will let you convert between the character sets to a single uniform character set (iconv comes with the GNU C library). Changing everything using iconv to something standard (like utf-8), and then running a simple program to change all non-ASCII characters into decimal numeric character references, would be a very good way to turn random text in various character sets into a single, uniform result.

Feel free to see my home page or my paper Why OSS/FS? Look at the Numbers!.