About Character Sets

Javascript source editor Web programming

About Character Sets

Character sets are usually referred to as single-byte or multibyte character sets, referring to the number of bytes needed to define a relationship with a character that is used in a language. English, German, and French (among many others) are single-byte languages; only one byte is necessary to represent a character such as the letter a or the number 9. Single-byte code sets have, at most, 256 characters, including the entire set of ASCII characters, accented characters, and other characters necessary for formatting.

Multibyte code sets have more than 256 characters, including all single-byte characters as a subset. Multibyte languages include traditional and simplified Chinese, Japanese, Korean, Thai, Arabic, Hebrew, and so forth. These languages require more than 1 byte to represent a character. A good example is the word Tokyo, the capital of Japan. In English, it is spelled with four different characters, using a total of 5 bytes. However, in Japanese, the word is represented by two syllables, tou and kyou, each of which uses 2 bytes, for a total of 4 bytes used.

This is a complete simplification of character sets and the technology behind them, but the relevance is this: To properly interpret and display the text of Web pages in their intended language, it is up to you to tell the Web browser which character set to use. This is achieved by sending the appropriate headers before all content.

If you have a set of pages that includes Japanese text and you do not send the correct headers regarding language and character set, those pages will render incorrectly in Web browsers whose primary language is not Japanese. In other words, because no character set information is included, the browser assumes that it is to render the text using its own default character set. For example, if your Japanese pages use the Shift_JIS or UTF-8 character set and your browser is set for ISO-8859-1, your browser will try to render the Japanese text using the single-byte ISO-8859-1 character set. It will fail miserably in this unless the headers alert it to use Shift_JIS or UTF-8 and you have the appropriate libraries and language packs installed on your operating system.

The headers in question are the Content-type and Content-language headers, and these can also be set as META tags. Because you have all the tools for a dynamic environment, it's best to both send the appropriate headers before your text and print the correct META tags in your document. The following is an example of the header() function outputting the proper character information for an English site:

header("Content-Type: text/html;charset=utf-8");
header("Content-Language: en");

The accompanying META tags would be these:

<META HTTP-EQUIV="Content-Type" content="text/html; charset=utf-8">
<META HTTP-EQUIV="Content-Language" content="en">

A German site would use the same character set but a different language code:

header("Content-Type: text/html;charset=utf-8");
header("Content-Language: de");

The accompanying META tags would be these:

<META HTTP-EQUIV="Content-Type" content="text/html; charset=utf-8">
<META HTTP-EQUIV="Content-Language" content="de">

A Japanese site uses both a different character set and different language code:

header("Content-Type: text/html;charset=utf-8");
header("Content-Language: ja");

The accompanying META tags would be these:

<META HTTP-EQUIV="Content-Type" content="text/html; charset=utf-8">
<META HTTP-EQUIV="Content-Language" content="ja">

Javascript source editor Web programming

→