WaSP Asks the W3C

Specifying Character Encoding

This month kicks off our new “WaSP Asks the W3C” Question and Answer project. In this project, frequently asked questions posed to WaSP by Web authors and designers regarding standards are submitted by WaSP members to the W3C’s Quality Assurance Group for information. The answers are published and archived both here and on the W3C Web Standards Education list, where follow-up discussion also takes place. Signup details can be found at the end of this article.

WaSP asks

There are several ways of specifying the character encoding for a particular document. Which of the following methods (or combination thereof) does the W3C recommend, and why?

The W3C responds

These three ways of providing the character encoding of a document are not equivalent. When trying to figure out the character encoding of a resource, user agents will try, in this order:

Since the HTTP Content-Type header has precedence, and is also the easiest information to retrieve (user-agents do not have to parse the resource to get it), it is almost always the preferred way to provide the character encoding for an (X)HTML document.

However, in at least two cases, this is simply not possible:

In these cases, an HTML document should provide the character encoding via a meta element, and an XML document can provide it via the XML declaration. If the XML document uses one of the default encodings (UTF-8 or UTF-16) no declaration is needed to manage the character encoding.

To sum it up

Examples

Example of an XHTML 1.0 document written in French with an ISO-8859-1 encoding:

<?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="fr" lang="fr"> <head> <title>Exemple de document XHTML 1.0</title> </head> <body> <h1>Portrait Int�rieur</h1> <h2>Rainer-Maria Rilke</h2> <p>Ce ne sont pas des souvenirs<br /> qui, en moi, t'entretiennent ;<br /> tu n'es pas non plus mienne<br /> par la force d'un beau d�sir.</p> </body> </html>

Example of an HTML 4.01 document written in French with a UTF-8 encoding:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html lang="fr"> <head> <meta http-equiv="content-type" content="text/html; charset=UTF-8"> <title>Exemple de document HTML 4.01</title> </head> <body> <h1>Portrait Int�rieur</h1> <h2>Rainer-Maria Rilke</h2> <p>Ce ne sont pas des souvenirs<br> qui, en moi, t'entretiennent ;<br> tu n'es pas non plus mienne<br> par la force d'un beau d�sir.</p> </body> </html>

On the popular Apache Web server, the HTTP Content-Type header for a resource can be set up in the .htaccess file, as follows:

<Files example.html> ForceType text/html;charset=ISO-8859-1 </Files>

This would force the file example.html to be served as ISO-8859-1 even if the server had a different global configuration.

WaSP comments

WaSP and W3C member Tim Bray commented on this answer and said:

“If you know that the document you’re sending is going to get read by an XML processor, the server should get the charset right. If the server makes any mistake the rules say that the processor is supposed to do the wrong thing! On the other hand, if the document is going to any kind of HTML reader, the server can usefully try to help and do what is suggested here. So it turns out that it matters whether you serve it as html or xhtml+xml.”

How to serve HTML and XHTML will be discussed in the next issue of WaSP Asks the W3C.

References

Discussion

For clarification and discussion on this topic, please address your comments and questions to the W3C Web Standards Education list.

To subscribe to the list, send an email to [email protected] with “Subject: subscribe”. You can read archived posts at http://lists.w3.org/Archives/Public/public-evangelist/.