JavaScript Editor Source code editor     What Is Ajax 

Main Page

A.11. MySQL 5.0 FAQ — MySQL Chinese, Japanese, and Korean Character Sets

This set of Frequently Asked Questions derives from the experience of MySQL's Support and Development groups in handling many inquiries about CJK (Chinese-Japanese-Korean) issues.


Questions and Answers

26.11.1: I have inserted CJK characters into my table. Why does SELECT display them as “?” characters?

This problem is usually due to a setting in MySQL that doesn't match the settings for the application program or the operating system. Here are some common steps for correcting these types of issues:

26.11.2: What GB (Chinese) character sets does MySQL support?

MySQL supports the two common variants of the GB (Guojia Biaozhun, or National Standard) character sets which are official in the People's Republic of China: gb2312 and gbk. Sometimes people try to insert gbk characters into gb2312, and it works most of the time because gbk is a superset of gb2312 — but eventually they try to insert a rarer Chinese character and it doesn't work. (See Bug#16072 for an example).

Here, we try to clarify exactly what characters are legitimate in gb2312 or gbk, with reference to the official documents. Please check these references before reporting gb2312 or gbk bugs.

26.11.3: What problems should I be aware of when working with the Big5 Chinese character set?

MySQL supports the Big5 character set which is common in Hong Kong and Taiwan (Republic of China). MySQL's big5 is in reality Microsoft code page 950, which is very similar to the original big5 character set. We changed to this character set starting with MySQL version 4.1.16 / 5.0.16 (as a result of Bug#12476). For example, the following statements work in current versions of MySQL, but not in old versions:

Query OK, 0 rows affected (0.13 sec)

mysql> INSERT INTO big5 VALUES (0xf9dc);
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM big5;
| big5 |
| 嫺  |
1 row in set (0.02 sec)

A feature request for adding HKSCS extensions has been filed. People who need this extension may find the suggested patch for Bug#13577 to be of interest.

26.11.4: Why do Japanese character set conversions fail?

MySQL supports the sjis, ujis, cp932, and eucjpms character sets, as well as Unicode. A common need is to convert between character sets. For example, there might be a Unix server (typically with sjis or ujis) and a Windows client (typically with cp932).

In the following conversion table, the ucs2 column represents the source, and the sjis, cp932, ujis, and eucjpms columns represent the destinations — that is, the last 4 columns provide the hexadecimal result when we use CONVERT(ucs2) or we assign a ucs2 column containing the value to an sjis, cp932, ujis, or eucjpms column.

Character Nameucs2sjiscp932ujiseucjpms

Now consider this portion of the table:


This means that MySQL converts the NOT SIGN (Unicode U+00AC) to sjis code point 0x81CA and to cp932 code point 3F. (3F is the question mark (“?”) — this is what is always used when the conversion cannot be performed.

26.11.5: What should I do if I want to convert SJIS 81CA to cp932?

Our answer is: “?”. There are serious complaints about this: many people would prefer a “loose” conversion, so that 81CA (NOT SIGN) in sjis becomes 81CA (FULLWIDTH NOT SIGN) in cp932. We are considering a change to this behavior.

26.11.6: How does MySQL represent the Yen (Ґ) sign?

A problem arises because some versions of Japanese character sets (both sjis and euc) treat 5C as a reverse solidus (\ — also known as a backslash), and others treat it as a yen sign (Ґ).

MySQL follows only one version of the JIS (Japanese Industrial Standards) standard description. In MySQL, 5C is always the reverse solidus (\).

26.11.7: Do MySQL plan to make a separate character set where 5C is the Yen sign, as at least one other major DBMS does?

This is one possible solution to the Yen sign issue; however, this will not happen in MySQL 5.1 or 5.2.

26.11.8: Of what issues should I be aware when working with Korean character sets in MySQL?

In theory, while there have been several versions of the euckr (Extended Unix Code Korea) character set, only one problem has been noted.

We use the “ASCII” variant of EUC-KR, in which the code point 0x5c is REVERSE SOLIDUS, that is \, instead of the “KS-Roman” variant of EUC-KR, in which the code point 0x5c is WON SIGN(). This means that you cannot convert Unicode U+20A9 to euckr:

mysql> SELECT
    ->     CONVERT('₩' USING euckr) AS euckr,
    ->     HEX(CONVERT('₩' USING euckr)) AS hexeuckr;
| euckr | hexeuckr |
| ?     | 3F       |
1 row in set (0.00 sec)

MySQL's graphic Korean chart is here:

26.11.9: Why do I get Data truncated error messages?

For illustration, we'll create a table with one Unicode (ucs2) column and one Chinese (gb2312) column.

mysql> CREATE TABLE ch
    -> (ucs2 CHAR(3) CHARACTER SET ucs2,
    -> gb2312 CHAR(3) CHARACTER SET gb2312);
Query OK, 0 rows affected (0.05 sec) 

We'll try to place the rare character in both columns.

mysql> INSERT INTO ch VALUES ('A汌B','A汌B');
Query OK, 1 row affected, 1 warning (0.00 sec) 

Ah, there's a warning. Let's see what it is.

| Level   | Code | Message                                     |
| Warning | 1265 | Data truncated for column 'gb2312' at row 1 |
1 row in set (0.00 sec)

So it's a warning about the gb2312 column only.

mysql> SELECT ucs2,HEX(ucs2),gb2312,HEX(gb2312) FROM ch;
| ucs2  | HEX(ucs2)    | gb2312 | HEX(gb2312) |
| A汌B | 00416C4C0042 | A?B    | 413F42      |
1 row in set (0.00 sec)

There are several things that need explanation here.

  1. The fact that it's a “warning” rather than an “error” is characteristic of MySQL. We like to try to do what we can, to get the best fit, rather than give up.

  2. The character isn't in the gb2312 character set. We described that problem earlier.

  3. Admittedly the message is misleading. We didn't “truncate” in this case, we replaced with a question mark. We've had a complaint about this message (See Bug#9337). But until we come up with something better, just accept that error/warning code 2165 can mean a variety of things.

  4. With SQL_MODE=TRADITIONAL, there would be an error message, but instead of error 2165 you would see: ERROR 1406 (22001): Data too long for column 'gb2312' at row 1.

26.11.10: Why does my GUI front end or browser not display CJK characters correctly in my application using Access, PHP, or another API?

Obtain a direct connection to the server using the mysql client (Windows: mysql.exe), and try the same query there. If mysql responds correctly, then the trouble may be that your application interface requires initialization. Use mysql to tell you what character set or sets it uses with the statement SHOW VARIABLES LIKE 'char%';. If you are using Access, then you are most likely connecting with MyODBC. In this case, you should check Section 23.1.3, “Connector/ODBC Configuration”. If, for instance, you use big5, you would enter SET NAMES 'big5'. (Note that no ; is required in this case). If you are using ASP, you might need to add SET NAMES in the code. Here is an example that has worked in the past:

Dim strConnection
Dim Conn
strConnection="driver={MySQL ODBC 3.51 Driver};server=server;uid=username;" \ 
               & "pwd=password;database=database;stmt=SET NAMES 'big5';"
Set Conn = Server.CreateObject("ADODB.Connection") 
Conn.Open strConnection

In much the same way, if you are using any character set other than latin1 with Connector/NET, then you must specify the character set in the connection string. See Section, “Connecting to MySQL Using Connector/NET”, for more information.

If you are using PHP, try this:

  $link = mysql_connect($host, $usr, $pwd); 

  if( mysql_error() ) { print "Database ERROR: " . mysql_error(); }
  mysql_query("SET NAMES 'utf8'", $link); 

In this case, we used SET NAMES to change character_set_client and character_set_connection and character_set_results.

We encourage the use of the newer mysqli extension, rather than mysql. Using mysqli, the previous example could be rewritten as shown here:

  $link = new mysqli($host, $usr, $pwd, $db);
  if( mysqli_connect_errno() )
    printf("Connect failed: %s\n", mysqli_connect_error());

  $link->query("SET NAMES 'utf8'"); 

Another issue often encountered in PHP applications has to do with assumptions made by the browser. Sometimes adding or changing a <meta> tag suffices to correct the problem: for example, to insure that the user agent interprets page content as UTF-8, you should include <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> in the <head> of the HTML page.

If you are using Connector/J, see Section, “Using Character Sets and Unicode”.

26.11.11: I've upgraded to MySQL 5.0. How can I revert to behavior like that in MySQL 4.0 with regard to character sets?

In MySQL Version 4.0, there was a single “global” character set for both server and client, and the decision as to which character to use was made by the server administrator. This changed starting with MySQL Version 4.1. What happens now is a “handshake”, as described in Section 10.4, “Connection Character Sets and Collations”:

When a client connects, it sends to the server the name of the character set that it wants to use. The server uses the name to set the character_set_client, character_set_results, and character_set_connection system variables. In effect, the server performs a SET NAMES operation using the character set name.

The effect of this is that you cannot control the client character set by starting mysqld with --character-set-server=utf8. However, some of our Asian customers have said that prefer the MySQL 4.0 behavior. To make it possible to retain this behavior, we added a mysqld switch, --character-set-client-handshake, which can be turned off with --skip-character-set-client-handshake. If you start mysqld with --skip-character-set-client-handshake, then, when a client connects, it sends to the server the name of the character set that it wants to use — however, the server ignores this request from the client.

By way of example, suppose that your favorite server character set is latin1 (unlikely in a CJK area, but this is the default value). Suppose further that the client uses utf8 because this is what the client's operating system supports. Now, start the server with latin1 as its default character set:

mysqld --character-set-server=latin1

And then start the client with the default character set utf8:

mysql --default-character-set=utf8

The current settings can be seen by viewing the output of SHOW VARIABLES:

mysql> SHOW VARIABLES LIKE 'char%';
| Variable_name            | Value                                  |
| character_set_client     | utf8                                   |
| character_set_connection | utf8                                   |
| character_set_database   | latin1                                 |
| character_set_filesystem | binary                                 |
| character_set_results    | utf8                                   |
| character_set_server     | latin1                                 |
| character_set_system     | utf8                                   |
| character_sets_dir       | /usr/local/mysql/share/mysql/charsets/ |
8 rows in set (0.01 sec)

Now stop the client, and then stop the server using mysqladmin. Then start the server again, but this time tell it to skip the handshake like so:

mysqld --character-set-server=utf8 --skip-character-set-client-handshake

Start the client with utf8 once again as the default character set, then display the current settings:

mysql> SHOW VARIABLES LIKE 'char%';
| Variable_name            | Value                                  |
| character_set_client     | latin1                                 |
| character_set_connection | latin1                                 |
| character_set_database   | latin1                                 |
| character_set_filesystem | binary                                 |
| character_set_results    | latin1                                 |
| character_set_server     | latin1                                 |
| character_set_system     | utf8                                   |
| character_sets_dir       | /usr/local/mysql/share/mysql/charsets/ |
8 rows in set (0.01 sec)

As you can see by comparing the differing results from SHOW VARIABLES, the server ignores the client's initial settings if the --skip-character-set-client-handshake is used.

26.11.12: Why do some LIKE and FULLTEXT searches with CJK characters fail?

There is a very simple problem with LIKE searches on BINARY and BLOB columns: we need to know the end of a character. With multi-byte character sets, different characters might have different octet lengths. For example, in utf8, A requires one byte but requires three bytes, as shown here:

| OCTET_LENGTH(_utf8 'A') | OCTET_LENGTH(_utf8 'ペ') |
|                       1 |                         3 |
1 row in set (0.00 sec)

If we don't know where the first character ends, then we don't know where the second character begins, in which case even very simple searches such as LIKE '_A%' fail. The solution is to use a regular CJK character set in the first place, or to convert to a CJK character set before comparing.

This is one reason why MySQL cannot allow encodings of nonexistent characters. If it is not strict about rejecting bad input, then it has no way of knowing where characters end.

For FULLTEXT searches, we need to know where words begin and end. With Western languages, this is rarely a problem because most (if not all) of these use an easy-to-identify word boundary — the space character. However, this is not usually the case with Asian writing. We could use arbitrary halfway measures, like assuming that all Han characters represent words, or (for Japanese) depending on changes from Katakana to Hiragana due to grammatical endings. However, the only sure solution requires a comprehensive word list, which means that we would have to include a dictionary in the server for each Asian language supported. This is simply not feasible.

26.11.13: What CJK character sets are available in MySQL?

The list of CJK character sets may vary depending on your MySQL version. For example, the eucjpms character set was not supported prior to MySQL 5.0.3 (see Section E.1.27, “Changes in release 5.0.3 (23 March 2005: Beta)”). However, since the name of the applicable language appears in the DESCRIPTION column for every entry in the INFORMATION_SCHEMA.CHARACTER_SETS table, you can obtain a current list of all the non-Unicode CJK character sets using this query:

    -> OR DESCRIPTION LIKE '%Japanese%'
    -> OR DESCRIPTION LIKE '%Korean%'
| big5               | Big5 Traditional Chinese  |
| cp932              | SJIS for Windows Japanese |
| eucjpms            | UJIS for Windows Japanese |
| euckr              | EUC-KR Korean             |
| gb2312             | GB2312 Simplified Chinese |
| gbk                | GBK Simplified Chinese    |
| sjis               | Shift-JIS Japanese        |
| ujis               | EUC-JP Japanese           |
8 rows in set (0.01 sec)

(See Section 20.9, “The INFORMATION_SCHEMA CHARACTER_SETS Table”, for more information.)

26.11.14: How do I know whether character X is available in all character sets?

The majority of simplified Chinese and basic non-halfwidth Japanese Kana characters appear in all CJK character sets. This stored procedure accepts a UCS-2 Unicode character, converts it to all other character sets, and displays the results in hexadecimal.


CREATE PROCEDURE p_convert(ucs2_char CHAR(1) CHARACTER SET ucs2)

             (ucs2 CHAR(1) character set ucs2,
              utf8 CHAR(1) character set utf8,
              big5 CHAR(1) character set big5,
              cp932 CHAR(1) character set cp932,
              eucjpms CHAR(1) character set eucjpms,
              euckr CHAR(1) character set euckr,
              gb2312 CHAR(1) character set gb2312,
              gbk CHAR(1) character set gbk,
              sjis CHAR(1) character set sjis,
              ujis CHAR(1) character set ujis);

INSERT INTO tj (ucs2) VALUES (ucs2_char);

UPDATE tj SET utf8=ucs2,

/* If there's a conversion problem, UPDATE will produce a warning. */

SELECT hex(ucs2) AS ucs2,
       hex(utf8) AS utf8,
       hex(big5) AS big5,
       hex(cp932) AS cp932,
       hex(eucjpms) AS eucjpms,
       hex(euckr) AS euckr,
       hex(gb2312) AS gb2312,
       hex(gbk) AS gbk,
       hex(sjis) AS sjis,
       hex(ujis) AS ujis
FROM tj;



The input can be any single ucs2 character, or it can be the code point value (hexadecimal representation) of that character. For example, from Unicode's list of ucs2 encodings and names (, we know that the Katakana character Pe appears in all CJK character sets, and that its code point value is 0x30da. If we use this value as the argument to p_convert(), the result is as shown here:

mysql> CALL p_convert(0x30da)//
| ucs2 | utf8   | big5 | cp932 | eucjpms | euckr | gb2312 | gbk  | sjis | ujis |
| 30DA | E3839A | C772 | 8379  | A5DA    | ABDA  | A5DA   | A5DA | 8379 | A5DA |
1 row in set (0.04 sec)

Since none of the column values is 3F — that is, the question mark character (?) — we know that every conversion worked.

26.11.15: Why don't CJK strings sort correctly in Unicode? (I)

Sometimes people observe that the result of a utf8_unicode_ci or ucs2_unicode_ci search, or of an ORDER BY sort is not what they think a native would expect. Although we never rule out the possibility that there is a bug, we have found in the past that many people do not read correctly the standard table of weights for the Unicode Collation Algorithm. MySQL uses the table found at This is not the first table you will find by navigating from the home page, because MySQL uses the older 4.0.0 “allkeys” table, rather than the more recent 4.1.0 table. This is because we are very wary about changing ordering which affects indexes, lest we bring about situations such as that reported in Bug#16526, illustrated as follows:

mysql< CREATE TABLE tj (s1 CHAR(1) CHARACTER SET utf8 COLLATE utf8_unicode_ci);
Query OK, 0 rows affected (0.05 sec)

mysql> INSERT INTO tj VALUES ('が'),('か');
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM tj WHERE s1 = 'か';
| s1   |
| が  |
| か  |
2 rows in set (0.00 sec)

The character in the first result row is not the one that we searched for. Why did MySQL retrieve it? First we look for the Unicode code point value, which is possible by reading the hexadecimal number for the ucs2 version of the characters:

mysql> SELECT s1, HEX(CONVERT(s1 USING ucs2)) FROM tj;
| s1   | HEX(CONVERT(s1 USING ucs2)) |
| が  | 304C                        |
| か  | 304B                        |
2 rows in set (0.03 sec)

Now we search for 304B and 304C in the 4.0.0 allkeys table, and find these lines:

304B  ; [.1E57.0020.000E.304B] # HIRAGANA LETTER KA
304C  ; [.1E57.0020.000E.304B][.0000.0140.0002.3099] # HIRAGANA LETTER GA; QQCM

The official Unicode names (following the “#” mark) tell us the Japanese syllabary (Hiragana), the informal classification (letter, digit, or punctuation mark), and the Western identifier (KA or GA, which happen to be voiced and unvoiced components of the same letter pair). More importantly, the primary weight (the first hexadecimal number inside the square brackets) is 1E57 on both lines. For comparisons in both searching and sorting, MySQL pays attention to the primary weight only, ignoring all the other numbers. This means that we are sorting and correctly according to the Unicode specification. If we wanted to distinguish them, we'd have to use a non-UCA (Unicode Collation Algorithm) collation (utf8_unicode_bin or utf8_general_ci), or to compare the HEX() values, or use ORDER BY CONVERT(s1 USING sjis). Being correct “according to Unicode” isn't enough, of course: the person who submitted the bug was equally correct. We plan to add another collation for Japanese according to the JIS X 4061 standard, in which voiced/unvoiced letter pairs like KA/GA are distinguishable for ordering purposes.

26.11.16: Why don't CJK strings sort correctly in Unicode? (II)

If you are using Unicode (ucs2 or utf8), and you know what the Unicode sort order is (see Section A.11, “MySQL 5.0 FAQ — MySQL Chinese, Japanese, and Korean Character Sets”), but MySQL still seems to sort your table incorrectly, then you should first verify the table character set:

******************** 1. row ******************
Table: t
Create Table: CREATE TABLE `t` (
) ENGINE=MyISAM DEFAULT charset=utf-8
1 row in set (0.00 sec)

Since the character set appears to be correct, let's see what information the INFORMATION_SCHEMA.COLUMNS table can provide about this column:

    -> WHERE COLUMN_NAME = 's1'
    -> AND TABLE_NAME = 't';
| s1          | ucs2               | ucs2_general_ci |
1 row in set (0.01 sec)

(See Section 20.3, “The INFORMATION_SCHEMA COLUMNS Table”, for more information.)

You can see that the collation is ucs2_general_ci instead of ucs2_unicode_ci. The reason why this is so can be found using SHOW CHARSET, as shown here:

mysql> SHOW CHARSET LIKE 'ucs2%';
| Charset | Description   | Default collation | Maxlen |
| ucs2    | UCS-2 Unicode | ucs2_general_ci   |      2 |
1 row in set (0.00 sec)

For ucs2 and utf8, the default collation is “general”. To specify a Unicode collation, use COLLATE ucs2_unicode_ci.

26.11.17: Why are my supplementary characters rejected by MySQL?

MySQL does not support supplementary characters — that is, characters which need more than 3 bytes — for UTF-8. We support only what Unicode calls the Basic Multilingual Plane / Plane 0. Only a few very rare Han characters are supplementary; support for them is uncommon. This has led to reports such as that found in Bug#12600, which we rejected as “not a bug”. With utf8, we must truncate an input string when we encounter bytes that we don't understand. Otherwise, we wouldn't know how long the bad multi-byte character is.

One possible workaround is to use ucs2 instead of utf8, in which case the “bad” characters are changed to question marks; however, no truncation takes place. You can also change the data type to BLOB or BINARY, which perform no validity checking.

We intend at some point in the future to add support for UTF-16, which would solve such issues by allowing 4-byte characters. However, we have as yet set no definite timetable for doing so.

26.11.18: Shouldn't it be “CJKV”?

No. The term “CJKV” (Chinese Japanese Korean Vietnamese) refers to Vietnamese character sets which contain Han (originally Chinese) characters. MySQL has no plan to support the old Vietnamese script using Han characters. MySQL does of course support the modern Vietnamese script with Western characters.

Bug#4745 is a request for a specialized Vietnamese collation, which we might add in the future if there is sufficient demand for it.

26.11.19: Does MySQL allow CJK characters to be used in database and table names?

This issue is fixed in MySQL 5.1, by automatically rewriting the names of the corresponding directories and files.

For example, if you create a database named on a server whose operating system does not support CJK in directory names, MySQL creates a directory named @0w@00a5@00ae. which is just a fancy way of encoding E6A5AE — that is, the Unicode hexadecimal representation for the character. However, if you run a SHOW DATABASES statement, you can see that the database is listed as .

26.11.20: Where can I find translations of the MySQL Manual into Chinese, Japanese, and Korean?

A Simplified Chinese version of the Manual, current for MySQL 5.1.12, can be found at The Japanese translation of the MySQL 4.1 manual can be downloaded from

26.11.21: Where can I get help with CJK and related issues in MySQL?

The following resources are available:


Bitcoin Gambling - The Original Crypto Dice Game . help with my essay

JavaScript Editor Source code editor     What Is Ajax