Character Sets

Introduction

"Character sets" are standards established for two main purposes: education and computing. Educational character sets are not the focus here, but it is important to note that in both China and Taiwan, national educational standards have been incorporated into the character sets used for computing. An excellent place to begin learning about Chinese, Japanese, and Korean (CJK) character set standards is the introductory chapter in Ken Lunde's book CJKV Information Processing: [CJKVInfoProc.Chap1.pdf]

"Encodings" map character sets to hexadecimal integers. Hexadecimals are base-16 numbers, written using 0-9 (for 0-9) and A-F (for 10-15) as digits. The encoding orders the characters in the set and assigns a value to each, known as a "code point."

"Character encoding forms" or "transformation formats" map the encoding's code points to units of data that your computer can understand. This data is conceived as sequences of binary digits (known as "bits") with a value of either 0 or 1. Until recently, most systems used 8-bit sequences (known as "bytes" or "octets") to process text, for which there are 256 possible sequences. These are represented by two-digit hexadecimal code points (00-FF, a total of 256 values). Obviously, this is not sufficient for the many thousands of Chinese characters, known as hanzi in Chinese, kanji in Japanese, and hanja in Korean. "Double-byte" character encoding forms use two bytes for each character, represented as four-digit hexadecimal code points (0000-FFFF, a total of 65,636 values).

Most recent systems use 16-bit sequences to process text. Mac OS X, for example, uses the Unicode 16-bit character encoding form, UTF-16. Nonetheless, double-byte encodings are a legacy of 8-bit data processing that will be with us for years to come, especially on the Internet. HTML and MIME, for example, are 8-bit protocols. Unicode has an 8-bit encoding form, UTF-8.

Speaking of protocols, most of the encodings discussed here have official charset names registered with the Internet Assigned Numbers Authority (IANA). The names are used to identify the encoding used in web pages and emails. No distinction is made between the use of upper and lower case letters. For example, here is a typical web-page header meta command that sets the encoding to Big Five:

<meta http-equiv="content-type" content="text/html; charset=big5">

For plain-text email, the encoding is set in the "content-type" header, as follows:

content-type: text/plain; format=flowed; charset=big5

A good way to appreciate and learn about the evolution of Chinese character-set standards is to read about the history of the supplements to Adobe's "character collections" for its fonts over the years:

Adobe-GB1: https://github.com/adobe-type-tools/Adobe-GB1

  • The current version is Adobe-GB1 Supplement 5 (2005), which supports GB 18030-2000, plus the Yi (Sichuan) character set from GB18030-2005.

Adobe-CNS1: https://github.com/adobe-type-tools/Adobe-CNS1/

  • The current version is Adobe-CNS1 Supplement 7 (2017), which supports CNS 11643-1992 Planes 1 and 2 (i.e., Big Five) and HKSCS-2016.

We don't cover them here, but Adobe's character collections for Japanese [Adobe-Japan1] and Korean [Adobe-Korea1 and the forthcoming Adobe-KR (third draft)] are also quite helpful for understanding those character-set standards.

Chinese Standards

GB 2312

GB = Guójiā Biāozhǔn 国家标准, "National Standard"

Simplified-Chinese only. GB 2312 (1980) includes 6,763 hanzi on two levels (the first is arranged by reading, the second by radical then number of strokes), along with symbols and punctuation, Japanese kana, the Greek and Cyrillic alphabets, Zhuyin, and two sets of Pinyin letters with tone marks (full-width and half-width), some of which were added with the first extension to GB, GB 6345 (1986), which also contained two corrections. Most "GB 2312" fonts contain this extension, including those distributed by Apple. There were two later extensions that were not as widely adopted, but these were all incorporated into GBK in 1995. GB 2312 and all of its extensions were replaced by GB 18030 in 2000.

GB 2312 has an analog character set in which traditional forms replace simplified forms, known as GB/T 12345 (1990). The is a bit complicated, since four Level 1 characters switch glyphs with Level 2 characters (后 and 後; 征 and 徵; 么 and 麽; 余 and 馀/餘) and 103 additional characters are needed to complete the de-simplification.

GB 2312 fonts sometimes come in pairs, one with the GB 2312 character set [a.k.a. "GB-2312简体/簡體"] and the other with the GB/T 12345 character set [a.k.a. "GB-2312繁体/繁體"].

Charset name: GB2312. In Windows, the charset name GB2312 includes all of its extensions, including GBK.

A PDF chart of GB 2312 is available at ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/AppE/

A PDF chart of GB/T 12345 is available at ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/AppF/

GBK

GBK = Guójiā Biāozhǔn Kuòzhǎn 国家标准扩展, "GB Extension"

GBK (1995) is an extension to GB 2312 that includes all 20,914 hanzi in the CJK Unified Ideographs block of Unicode, plus 101 additional hanzi.

In Windows 95 and later, the scope of the charset name GB2312 includes GBK. This makes sense, as GBK is an extension of GB 2312. The charset name GBK was not recognized until 2002. See: http://www.iana.org/assignments/charset-reg/GBK.

Microsoft code page 936 is based on GBK: http://www.microsoft.com/globaldev/reference/dbcs/936.htm

GB 18030

GB 18030 (2000) is the current Chinese national standard coded character set. It replaces GB 2312 and its major extension, GBK. All characters in GB 2312 and GBK are at the same code points in GB 18030. As of 2005, all GB 18030 characters map to Unicode characters.

GB 18030-2005 includes seven additional groups of characters: the remainder of Extension B, plus six regional scripts: Korean, Mongolian, Tai Le (Yunnan), Tibetan, Uighur, and Yi (Sichuan). None of these 2005 groups is currently required for GB 18030 compliance.

Charset name: GB18030.

http://www.iana.org/assignments/charset-reg/GB18030

Big Five

Traditional-Chinese only. Big Five (1984, "Big-5") gets its name from the consortium of five companies in Taiwan that developed it. Contains 13,051 distinct hanzi, arranged in two levels by total number of strokes then radical. The most common extension to Big Five is ETen, which includes additional punctuation and numerals, 25 radicals and radical-like elements, a full set of Japanese kana, and more. Most Big Five fonts contain this extension, including those distributed by Apple.

Big Five has an unofficial analog character set, developed by font vendors, in which simplified forms replace its traditional forms. This is known as GB Five, usually written as "GB5." Big Five fonts sometimes come in pairs, one with the standard Big-5 character set [a.k.a. "Big-5繁體"] and the other with the GB-5 character set [a.k.a. "Big-5簡體"]. This works fine going from Traditional text to Simplified text, but it is problematic going the other way.

Charset name: BIG5.

Microsoft code page 950 is based on Big Five: http://www.microsoft.com/globaldev/reference/dbcs/950.htm

Big Five Extension

Traditional-Chinese only. Big Five Extension (1998, "Big-5E") adds a select group of 3,954 hanzi inside the code space reserved by Big Five. They appear in three blocks of code points: 8140-86DF, 86E0-875C, and 8E40-A0FE.

The Traditional Chinese Input Method in Mac OS X 10.3 and above supports Big5E in the fonts LiHei 儷黑 Pro and LiSong 儷宋 Pro.

http://www.cmex.org.tw/

Hong Kong SCS

In 1995, the government of Hong Kong created its own extension to Big Five, calling it the Government Common Character Set (GCCS). In 1999, they revised it and renamed it the Hong Kong Supplementary Character Set (HKSCS or Hong Kong SCS). It was updated in 2001, 2004, 2008, and 2016 for a current total of 5,009 traditional-form hanzi.

As of 2004, all HKSCS characters map to Unicode characters. HKSCS-2008 was the last version published with Big Five code points. This allowed HKSCS-2016 to introduce 22 "horizontal" replacements for Big Five forms from Unicode (e.g., replacing 兌 with 兑, 慍 with 愠, and 告 with 吿).

The Traditional Chinese Input Method in Mac OS X 10.3 and above supports HKSCS in the fonts LiHei 儷黑 Pro and LiSong 儷宋 Pro.

Charset name: BIG5-HKSCS.

http://www.ogcio.gov.hk/ccli/eng/hkscs/

CNS 11643

CNS = Chinese National Standard

CNS 11643 is the official Taiwan national standard. The first two planes were adopted in 1986 as a corrected and reorganized edition of Big Five. In practice, "Big Five" is the standard, with one exception: CNS 11643 has been implemented for the Unix platform in EUC-TW.

In 1992, CNS 11643 was extended to seven planes and a total of 48,027 hanzi.

http://www.cns11643.gov.tw/

EUC

Extended Unix Code (EUC) is the internal code processed by Unix software configured for a specific locale:

  • EUC-TW (Taiwan) encodes the CNS 11643 character set. Charset name: EUC-TW.
  • EUC-CN (China) is identical to GB 2312. Charset name: EUC-CN.

7-bit Encodings

7-bit encodings are "mail-safe" transformation formats used primarily for internet services like TELNET and USENET:

  • HZ (1989) encodes GB 2312. See RFC 1843. Charset name: HZ-GB-2312.
  • ISO 2022-CN (1996) encodes GB 2312 and CNS 11643 Planes 1 and 2 (the Big Five character set). See RFC 1922. Charset name: ISO-2022-CN.

Unicode

"Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. ... These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption. ... Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language."

http://www.unicode.org/

ISO 10646 is fully coordinated with the Unicode Standard. Thus, ISO 10646 1:2000 has exactly the same character set and encoding as version 3.0 of the Unicode Standard, and so on. ISO 10646 is a character set. Unicode is an encoding.

The Unicode Standard defines three character encoding forms that allow the same data to be handled in 8, 16, or 32 bits per code unit, called UTF-8, UTF-16, and UTF-32.

  • UTF-8 (charset name: UTF-8) is designed for use with 8-bit protocols like HTML. It uses one to four bytes per character. Thus, UTF-8 code points can have two, four, six, or eight hexadecimal digits. Mac OS 8 and above provide support for UTF-8.
  • UTF-16 (charset name: UTF-16) uses one to two 16-bit sequences per character. Thus, UTF-16 code points have either four or eight hexadecimal digits. Mac OS 9 and above provide support for UTF-16.

"U+" is the standard notation for a Unicode scalar value, a hexadecimal number defined for use by standards such as SGML, XML, and HTML. Unicode's Basic Multilingual Plane (BMP) has room for 65,536 characters (U+0-FFFF). The Unicode scalar values for characters in the BMP are the identical to their UTF-16 code points.

Unicode 1.0 (1991) included the CJK "Han Ideographs" block, but the full specification was not published until Unicode 1.1 (1993). As of Unicode 10.0 (2017), there are a total of 27,565 "unified" CJK (Chinese, Japanese, Korean) characters in the BMP.

Unicode also provides space for over a million more characters in 16 additional planes (U+10000-10FFFF). As of Unicode 10.0 (2017), the Supplementary Ideographic Plane (SIP) contains 60,317 additional unified characters:

* = Includes links to the Unihan database page for each character.

For good introduction to all of the East Asian scripts encoded in Unicode, see chapter 18 of the current standard:

For specific information about individual hanzi, see the Unihan Database, introduced in John Jenkins and Richard Cook's A User's Guide to the Unihan Database (Unicode Technical Report #38). The home page for the database is:

It contains extensive information on each CJK Unified Ideograph, with locations in standard print dictionaries and word lists based on CEDICT and EDICT (Japanese). Navigation tools include:

The Unihan.txt file contains all of the information in the database. The latest version of the file is available at http://www.unicode.org/Public/UNIDATA/

The ISO working group charged with the task of processing CJK characters proposed for inclusion in Unicode is called the Ideographic Rapporteur Group (IRG). They have a web site at: http://www.cse.cuhk.edu.hk/~irg/

The "unification" of the Han script in Unicode was not without controversy and confusion. Ken Whistler's On the Encoding of Latin, Greek, Cyrillic, and Han (Unicode Technical Note #26) provides an excellent review of the salient issues.

Andrew West's BabelStone site is focused on Chinese and other scripts, like 'Phags-pa, Tibetan, Mongolian, Manchu, Khitan, Jurchen, and Tangut: http://www.babelstone.co.uk