Character encoding

File Format
Name	Character encoding
Ontology	Electronic File Formats Character Encodings ; ;

Revision as of 16:03, 6 September 2014

Character Encodings are methods of representing characters of text, usually as numeric values which can be stored on computers as bits and bytes, but sometimes in other things (e.g., Braille represents them as patterns of raised dots). Sometimes they're also referred to as "character sets", but purists will make a distinction in that, strictly speaking, a character set is merely a repertoire of characters, the list of characters supported by some system, protocol, or file format, without it necessarily having any inherent order or numbering system. A character encoding assigns specific values (in some coding system) to each character. However, the distinction can get vague and fuzzy; there are multiple levels of abstraction (Unicode includes a set of defined characters as well as assigned numeric code points for each, but leaves it to other more specific encodings such as UTF-8 to define the specific bits/bytes that represent them in a file), and some protocols even use parameter names such as 'charset' to indicate which character encoding is in use, so the terminology can slip and slide even in "tech" uses. This section documents all the various sorts of character sets/encodings of any sort.

See Fonts for the renditions of character encodings as seen on screens and printouts. The appearance of a character is known as a "glyph", and a font consists of a set of glyphs mapped onto the more abstractly-defined characters as included in the character set that is part of a character encoding.

Specific character sets or encodings

Adobe Standard Encoding
Amstrad CP/M Plus character set
ANSEL
- MARC-8
APL code page
ARMSCII
ASCII
ATASCII (used by Atari computers)
Baudot code
Braille
- BRF
- Nemeth Code
- Taylor Code
Compucolor character set
DEC (Digital Equipment Corporation)
- PDP-1 alphanumeric codes
EBCDIC
- CP037
- CP285
- CP424
- CP500
- CP875
- CP1026
- CP1047
- CP1140
- CP1148
- CP1155
- CP4971
- CP9067
- CP12712
- EBCDIC 6-Bit
Flag semaphore
GB 2312
ISO 646
- ISO 646-CA (Canada / French)
- ISO 646-CA-2 (Canada / French)
- ISO 646-CH (Switzerland)
- ISO 646-CN (China / Basic Latin)
- ISO 646-CU (Cuba / Spanish)
- ISO 646-DE (Germany)
- ISO 646-DK (Denmark)
- ISO 646-FI (Finland)
- ISO 646-FR (France)
- ISO 646-GB (Great Britain)
- ISO 646-HU (Hungary)
- ISO 646-IRV (International Reference Version)
- ISO 646-IT (Italy)
- ISO 646-JP (Japan / Romaji)
- ISO 646-JP OCR-B (Japan / Romaji)
- ISO 646-KR (Korea / Latin)
- ISO 646-MT (Malta)
- ISO 646-NL (Netherlands)
- ISO 646-NO (Norway)
- ISO 646-NO-2 (Norway)
- ISO 646-PT (Portugal)
- ISO 646-SE (Sweden)
- ISO 646-SE-2 (Sweden)
- ISO 646-US (Same as ASCII)
- ISO 646-YU (Yugoslavia)
ISO 2022
ISO 8859
- ISO 8859-1 (Latin-1)
- ISO 8859-2 (Latin-2, Central/East European)
- ISO 8859-3 (Latin-3, Esperanto, Galician, Maltese, and Turkish)
- ISO 8859-4 (Latin-4, Scandinavian and Baltic)
- ISO 8859-5 (Cyrillic)
- ISO 8859-6 (Arabic)
- ISO 8859-7 (Modern Greek)
- ISO 8859-8 (Hebrew)
- ISO 8859-9 (Latin-5, Turkish)
- ISO 8859-10 (Latin-6, Lappish, Nordic, and Inuit)
- ISO 8859-11 (Thai)
- ISO 8859-13 (Latin-7, Baltic Rim)
- ISO 8859-14 (Celtic)
- ISO 8859-15 (Latin-9, Latin-1 with a Euro sign)
- ISO 8859-16 (Romanian)
JIS
- JIS X 0201
- JIS X 0208
- Shift-JIS
KOI8
- KOI8-CS (Czechoslovakia)
- KOI8-R (Russia)
- KOI8-U (Ukraine)
Macintosh encodings
- MacCE
- MacCyrillic
- MacDingbat
- MacGreek
- MacGujarati
- MacGurmukhi
- MacIceland
- MacRoman
- MacRomania
- MacSymbol
- MacThai
- MacTurkish
- MacUkraine
Mattel Aquarius character set
Morse code
MS-DOS encodings (IBM PC code pages)
- CP437 (Latin US)
- CP737 (Greek)
- CP775 (Baltic Rim)
- CP850 (Latin-1)
- CP851 (Greek 1)
- CP852 (Latin-2)
- CP855 (Cyrillic)
- CP857 (Turkish)
- CP860 (Portuguese)
- CP861 (Icelandic)
- CP862 (Hebrew)
- CP863 (French Canada)
- CP864 (Arabic)
- CP865 (Nordic)
- CP866 (Cyrillic CIS 1)
- CP869 (Greek 2)
PETSCII (or PET ASCII or CBM ASCII; used by Commodore computers)
Unicode
- UTF-1
- UTF-7
- UTF-8
- CESU-8
- UTF-EBCDIC
- UTF-9
- UTF-16
- UCS-2
- UTF-18
- UTF-32 (UCS-4)
- GB18030
- Punycode
VISCII
Windows encodings
- Windows 1252 (ISO 8859-1 plus additional characters)
- Windows 1255 (Hebrew)
- Windows 1256 (Arabic, Farsi, Urdu)
- Windows 1257 (Baltic Rim)
- Windows 1258 (Vietnamese)

Format details

Byte Order Mark
C0 controls (ASCII control characters, 7 bit)
C1 controls (extended control characters, 8 bit)

Character escape codes

(used to enter characters in various systems and formats)

Alt codes (DOS/Windows)
Backslash escapes (used in various programming and markup languages)
HTML character references (entities and numeric values)

Tools

Kreative Recode: software to convert character encodings

Commentary and satire

References

Ken Lunde, CJKV Information Processing, O'Reilly 2008, ISBN 978-0-596-51447-1 (has lots of information on encodings and Unicode in general, not only for CJKV locales)
IBM 3270 character set reference (1987)

@@ Line 111: / Line 111: @@
 * [[Morse code]]
 * [[MS-DOS encodings]] (IBM PC code pages)
-** [[CP437]] (MS-DOS Latin US)
+** [[CP437]] (Latin US)
-** [[CP737]] (MS-DOS Greek)
+** [[CP737]] (Greek)
-** [[CP775]] (MS-DOS Baltic Rim)
+** [[CP775]] (Baltic Rim)
-** [[CP850]] (MS-DOS Latin-)
+** [[CP850]] (Latin-1)
-** [[CP851]] (MS-DOS Greek 1)
+** [[CP851]] (Greek 1)
-** [[CP852]] (MS-DOS Latin-2)
+** [[CP852]] (Latin-2)
-** [[CP855]] (MS-DOS Cyrillic)
+** [[CP855]] (Cyrillic)
-** [[CP857]] (MS-DOS Turkish)
+** [[CP857]] (Turkish)
-** [[CP860]] (MS-DOS Portuguese)
+** [[CP860]] (Portuguese)
-** [[CP861]] (MS-DOS Icelandic)
+** [[CP861]] (Icelandic)
-** [[CP862]] (MS-DOS Hebrew)
+** [[CP862]] (Hebrew)
-** [[CP863]] (MS-DOS French Canada)
+** [[CP863]] (French Canada)
-** [[CP864]] (MS-DOS Arabic)
+** [[CP864]] (Arabic)
-** [[CP865]] (MS-DOS Nordic)
+** [[CP865]] (Nordic)
-** [[CP866]] (MS-DOS Cyrillic CIS 1)
+** [[CP866]] (Cyrillic CIS 1)
-** [[CP869]] (MS-DOS Greek 2)
+** [[CP869]] (Greek 2)
 * [[PETSCII]] (or PET ASCII or CBM ASCII; used by Commodore computers)
 * [[Unicode]]

Character encoding

Revision as of 16:03, 6 September 2014

Contents

Specific character sets or encodings

Format details

Character escape codes

Tools

Commentary and satire

Other external links

References

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Toolbox