Character Sets

Introduction

"Character sets" are standards established for two main purposes: education and computing. In both China and Taiwan, national educational standards have been incorporated into the character sets used in computers. An excellent place to begin learning about Chinese, Japanese, and Korean (CJK) character set standards is the introductory chapter in Ken Lunde's book CJKV Information Processing: [CJKVInfoProc.Chap1.pdf]

"Encodings" map character sets to hexadecimal integers. Hexadecimals are base-16 numbers, written using 0-9 (for 0-9) and A-F (for 10-15) as digits. The encoding orders the characters in the set and assigns a value to each, known as a "code point."

"Character encoding forms" or "transformation formats" map the encoding's code points to units of data that your computer can understand. This electronic data is conceived as sequences of binary digits (known as "bits") with a value of either 0 or 1. Until recently, most systems used 8-bit sequences (known as "bytes" or "octets") to process text, for which there are 256 possible sequences. These are represented by two-digit hexadecimal code points (00-FF, a total of 256 values). Obviously, this is not sufficient for the many thousands of Chinese characters, known as hanzi in Chinese, kanji in Japanese, and hanja in Korean. "Double-byte" character encoding forms use two bytes for each character, represented as four-digit hexadecimal code points (0000-FFFF, a total of 65,636 values).

Most recent systems use 16-bit sequences to process text. Mac OS X, for example, uses the Unicode 16-bit character encoding form, UTF-16. Nonetheless, double-byte encodings are a legacy of 8-bit data processing that will be with us for years to come, especially on the Internet. HTML (web pages) and MIME (email) are 8-bit protocols. Unicode has an 8-bit encoding form, UTF-8. The encodings discussed below all have an official "charset name" registered with the Internet Assigned Numbers Authority (IANA). These names are used to identify the encoding of web pages and email.

A good way to learn about the evolution of Chinese character-set standards is to follow the history of the supplements to Adobe's "character collections" for its fonts:

Adobe-GB1: https://github.com/adobe-type-tools/Adobe-GB1

  • The current version is Adobe-GB1 Supplement 5 (2005), which supports GB 18030-2000, plus the Yi (Sichuan) character set from GB18030-2005.

Adobe-CNS1: https://github.com/adobe-type-tools/Adobe-CNS1

  • The current version is Adobe-CNS1 Supplement 7 (2017), which supports CNS 11643-1992 Planes 1 and 2 (i.e., Big Five) and HKSCS-2016.

We don't cover them here, but Adobe's character collections for Japanese [Adobe-Japan1] and Korean [Adobe-KR] are also quite helpful for understanding CJK character-set standards.

Educational Standards

China

In 2013, the Chinese government published a revised national educational standard, Tōngyòng Guīfàn Hànzìbiǎo ("TGH") 通用规范汉字表 [PDF]. It specifies the standard simplified forms for a total of 8,105 hanzi "in current use" [通用] on three levels, listed by number of strokes, then radical: [HTML] [TXT]

DISCUSS DERIVED SIMPLIFICATIONS

DISCUSS TRADITIONAL/VARIANTS CHARSET (APPENDIX 1 OF TGH-2013 SPEC: 规范字与繁体字、异体字对照表): [HTML] [TXT]

All TGH-2013 simplified hanzi are in Unicode: 7,832 in the CJK Unified Ideographs block, 77 in Extension A, 36 in Extension B, 44 in Extension C, 8 in Extension D, and 108 in Extension E.

Academia Sinica

Founded in Nanjing in 1928, Academia Sinica [中央研究院] is based in Taiwan. It serves as the research institute for government agencies, like the Ministry of Education. In 1982, it published lists specifying a total of 11,149 traditional hanzi "in common use" [常用] on two levels, primary and secondary. In 1983, a list of traditional hanzi "in rare use" [罕用] was published. In 1984, an index of variant forms [異體] of traditional hanzi was published.

  • 《常用國字標準字體表》 September 1982 (4,808 hanzi) [HTML] [TXT]
  • 《次常用國字標準字體表》 October 1982 (6,341 hanzi) [HTML] [TXT]
  • 《罕用國字標準字體表》 October 1983 (18,388 hanzi)
  • 《異體國字字表》 March 1984 (18,610 hanzi)

Computing Standards

GB 2312

GB = Guójiā Biāozhǔn 国家标准, "National Standard"

Simplified-Chinese only. GB 2312 (1980) includes 6,763 hanzi on two levels (the first is arranged by reading, the second by radical then number of strokes). A series of minor extensions and corrections began in 1986, all of which were incorporated into GBK in 1995. GB 2312 and GBK were replaced by GB 18030 in 2000.

Charset name: GB2312.

GB 2312 has an official analog character set, in which traditional forms replace simplified forms, known as GB/T 12345 (1990). This is a bit complicated, since four Level 1 characters switch glyphs with Level 2 characters (后 and 後; 征 and 徵; 么 and 麽; 余 and 馀/餘), and 103 additional hanzi are needed for simplified-to-traditional conversions. GB 2312 fonts sometimes come in pairs, one with the GB 2312 character set [a.k.a. "GB-2312简体/簡體"] and the other with the GB/T 12345 character set [a.k.a. "GB-2312繁体/繁體"].

A PDF chart of GB 2312 is available at ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/AppE/

A PDF chart of GB/T 12345 is available at ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/AppF/

GBK

GBK = Guójiā Biāozhǔn Kuòzhǎn 国家标准扩展, "GB Extension"

GBK (1995) is an extension to GB 2312 and GB/T 12345 that includes all 20,914 hanzi in the CJK Unified Ideographs block of Unicode, plus 101 additions.

In Windows 95 and later, the scope of the charset name GB2312 includes GBK. The charset name GBK was not registered until 2002:

http://www.iana.org/assignments/charset-reg/GBK

GB 18030

GB 18030 (2000) is the current Chinese national standard coded character set. It replaces GB 2312 and its major extension, GBK. All characters in GB 2312 and GBK are at the same code points in GB 18030. As of 2005, all GB 18030 characters map to Unicode characters.

GB 18030-2005 added seven additional groups of characters: Unicode's CJK Unified Ideographs Extension B, plus six regional scripts: Korean, Mongolian, Tai Le (Yunnan), Tibetan, Uighur, and Yi (Sichuan). None of these groups was required for GB 18030 compliance.

Charset name: GB18030.

http://www.iana.org/assignments/charset-reg/GB18030

CNS 11643

CNS = Chinese National Standard

CNS 11643 [中文標準交換碼] is the official Taiwan national standard. The first two planes were adopted in 1986, in a correction and revision of the Big Five (see below) character set, along with a third plane containing 6,148 hanzi ("in use by government agencies") that were not part of Big Five. While CNS 11643 is the national character-set standard in Taiwan, Big Five has been the de facto encoding standard. Since about 2004, the widespread adoption of Unicode has rendered the issue largely moot.

In 1992, CNS 11643 was extended to seven planes and a total of 48,027 hanzi. It is basically a corrected and revised version of the comprehensive indices of 48,147 traditional hanzi published from 1982 to 1984 by Academic Sinica (see above).

In 2004, CNS 11643 was extended to 15 planes, while its encoding was expanded to 80 planes: cns11643expand.ppt (in Chinese).

In 2007, the third edition of CNS 11643 was published. The project is ongoing, with a current (November 2017) total of 58,010 TSource hanzi in Unicode, and 20,135 TSource hanzi not in Unicode (see below).

http://www.cns11643.gov.tw/

https://data.gov.tw/dataset/5961

Using this data in conjunction with the Unihan database, we provide HTML pages with links to all Unicode hanzi in CNS 11643 Planes 1-7, current as of April 26, 2017. In the case of conflicts, we follow the Unicode data, as noted at the bottom of each page:

CNS 11643 beyond Planes 1-7 is more problematic. Unicode mixes TSource hanzi (from various sources in Taiwan) in with those from other sources. CNS, in turn, mixes blocks of non-TSource Unicode hanzi in among its TSource blocks. Plus, not all TSource hanzi make it through the Unihan process, leaving gaps in the original CNS blocks. Not to mention duplicates, and variants that are treated as duplicates. CNS even has its own Unicode block dedicated to them, the CJK Compatibility Ideographs Supplement: [PDF] It's a tangled web, tied to the history of Taiwan, and unraveling it here is beyond our scope. Instead, we provide a list of current Unihan TSource Hanzi in CNS 11643 Planes 8+.

For fonts and glyphs of CNS hanzi not in Unicode, see CNS 11643 in Unicode's Supplementary Private Use Area.

Big Five

Traditional-Chinese only. Big Five (1984, "Big-5") gets its name from the consortium of five companies in Taiwan that developed it. The starting point was two lists of "common" [常用] hanzi sponsored by the Ministry of Education in 1982 (a total of 11,149 hanzi), with an additional selection of "rare" [罕用] hanzi used for scientific and other kinds of specialized publishing. The basic Big Five character set contains 13,051 distinct hanzi, arranged in two separate levels by total number of strokes, then radical. Most Big Five fonts also contain the ETen (倚天) supplement, which adds three Private Use Areas, seven additional hanzi [碁銹裏墻恒粧嫺] needed to support IBM systems in Taiwan, and more.

Charset name: BIG5.

Big Five Plus (1997, Big-5+) moved to expand Big Five's code space to accomodate an additional 4,670 "standard" [標準] (not all of these were hanzi) and 3,250 "recommended" [推薦] CNS 11643 characters needed in Taiwan. Apple, Microsoft, and everyone else (other than TwinBridge), however, did not implement this expansion of Big Five, because all of these characters were already supported by Unicode. The long-term solution was to support Unicode.

Big Five Extension (1998, Big-5E) was a short-term, practical solution to Taiwan's immediate need at the time for more hanzi to be encoded in Big Five. It placed a select group of 3,954 "standard" hanzi from Big5+ inside Big Five's legacy code space, appropriating three blocks of code points in the Private Use Areas: 8E40-A0FE, 8140-86DF, and 86E0-875C.

The Traditional Chinese Input Method in Mac OS X 10.3 and above supports the Big5E character set in the fonts LiHei 儷黑 Pro and LiSong 儷宋 Pro.

http://www.cns11643.gov.tw

Note: Big Five has an unofficial analog character set in which simplified forms replace its traditional forms. This is known as GB Five, usually written as "GB-5." Big Five fonts sometimes come in pairs, one with the standard Big5 character set [a.k.a. "Big-5繁體"] and the other with the GB5 character set [a.k.a. "Big-5簡體"]. This is only intended to go in one direction for one purpose, from Traditional print to Simplified print. It is useless for anything else.

Hong Kong SCS

In 1995, the government of Hong Kong created its own extension to Big Five, calling it the Government Common Character Set (GCCS). It was revised in 1999 and renamed the Hong Kong Supplementary Character Set (HKSCS or Hong Kong SCS), updated in 2001, 2004, 2008, and 2016 for a current total of 5,009 traditional-form hanzi.

All HKSCS characters map to Unicode characters. HKSCS-2008 was the last version published with Big Five code points. HKSCS-2016 introduced 22 "horizontal" replacements of Big Five forms from Unicode (for example, replacing 兌 with 兑, 慍 with 愠, and 告 with 吿).

Informative, up-to-date charts of HKSCS are available at 中文編碼網頁. The Traditional Chinese Input Method in Mac OS X 10.3 and above supports HKSCS in the fonts LiHei 儷黑 Pro and LiSong 儷宋 Pro.

Charset name: BIG5-HKSCS.

http://www.ogcio.gov.hk/ccli/eng/hkscs/

EUC

Extended Unix Code (EUC) is the internal code processed by Unix software configured for a specific locale:

  • EUC-TW (Taiwan) encodes the CNS 11643 character set. Charset name: EUC-TW.
  • EUC-CN (China) is identical to GB 2312. Charset name: EUC-CN.

7-bit Encodings

7-bit encodings are "mail-safe" transformation formats used primarily for internet services like TELNET and USENET:

  • HZ (1989) encodes GB 2312. See RFC 1843. Charset name: HZ-GB-2312.
  • ISO 2022-CN (1996) encodes GB 2312 and CNS 11643 Planes 1 and 2 (the Big Five character set). See RFC 1922. Charset name: ISO-2022-CN.

Unicode

"Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. ... These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption. ... Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language."

http://www.unicode.org/

ISO 10646 is fully coordinated with the Unicode Standard. Thus, ISO 10646 1:2000 has exactly the same character set and encoding as version 3.0 of the Unicode Standard, and so on. ISO 10646 is a character set. Unicode is an encoding.

The Unicode Standard defines three character encoding forms that allow the same data to be handled in 8, 16, or 32 bits per code unit, called UTF-8, UTF-16, and UTF-32.

  • UTF-8 (charset name: UTF-8) is designed for use with 8-bit protocols like HTML. It uses one to four bytes per character. Thus, UTF-8 code points can have two, four, six, or eight hexadecimal digits. Mac OS 8 and above provide support for UTF-8.
  • UTF-16 (charset name: UTF-16) uses one to two 16-bit sequences per character. Thus, UTF-16 code points have either four or eight hexadecimal digits. Mac OS 9 and above provide support for UTF-16.

"U+" is the standard notation for a "Unicode scalar value," a hexadecimal number defined for use by standards such as SGML, XML, and HTML. Unicode's Basic Multilingual Plane (BMP) has room for 65,536 characters (U+0-FFFF). The Unicode scalar values for characters in the BMP are the identical to their UTF-16 code points.

Unicode 1.0 (1991) included the CJK "Han Ideographs" block, but the full specification was not published until Unicode 1.1 (1993). As of Unicode 10.0 (2017), there are a total of 27,565 "unified" (see below) CJK (Chinese, Japanese, Korean) characters in the BMP.

Unicode also provides space for over a million more characters in 16 additional planes (U+10000-10FFFF). As of Unicode 10.0 (2017), the Supplementary Ideographic Plane (SIP) contains 60,317 additional unified characters:

  • CJK Unified Ideographs block (U+4E00-9FEA, 20,971 total): U4E00.pdf [TXT] [HTML]
    • Twelve additional CJK Unified Ideographs in the CJK Compatibility Ideographs block: UF900.pdf
  • CJK Unified Ideographs Extension A block (U+3400-4DB5, 6,582 total): U3400.pdf [TXT] [HTML]
  • CJK Unified Ideographs Extension B block (U+20000-2A6D6, 42,711 total): U20000.pdf [TXT] [HTML]
  • CJK Unified Ideographs Extension C block (U+2A700–2B734, 4,149 total): U2A700.pdf [TXT] [HTML]
  • CJK Unified Ideographs Extension D block (U+2B740–2B81D, 222 total): U2B740.pdf [TXT] [HTML]
  • CJK Unified Ideographs Extension E block (U+2B820–2CEA1, 5,762 total): U2B820.pdf [TXT] [HTML]
  • CJK Unified Ideographs Extension F block (U+2CEB0–2EBE0, 7,473 total): U2CEB0.pdf [TXT] [HTML]

For an introduction to all of the East Asian scripts encoded in Unicode, see chapter 18 of the current standard:

For specific information about individual hanzi, see the Unihan Database, introduced in John Jenkins and Richard Cook's A User's Guide to the Unihan Database (Unicode Technical Report #38). The home page for the database is:

The Unihan.zip folder contains all of the information in the database. The latest version is always available at http://www.unicode.org/Public/UNIDATA/

The "unification" of the Han script in Unicode was not without controversy and confusion. Ken Whistler's On the Encoding of Latin, Greek, Cyrillic, and Han (Unicode Technical Note #26) provides an excellent review of the salient issues.

The ISO working group charged with the task of processing CJK characters proposed for inclusion in Unicode is called the Ideographic Rapporteur Group (IRG). They have a web site at: http://www.cse.cuhk.edu.hk/~irg/

Finally, Andrew West's BabelStone site is focused on Chinese and other scripts, like 'Phags-pa, Tibetan, Mongolian, Manchu, Khitan, Jurchen, and Tangut: http://www.babelstone.co.uk