Unicode
Dan Tobias (Talk | contribs) m |
Dan Tobias (Talk | contribs) (→Tools) |
||
(32 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
{{FormatInfo | {{FormatInfo | ||
|formattype=electronic | |formattype=electronic | ||
− | |subcat=Character | + | |subcat=Character encoding |
+ | |thiscat=Unicode | ||
+ | |released=1991 | ||
+ | |pronom={{PRONOM|x-fmt/16}} | ||
+ | |wikidata={{wikidata|Q8819}} | ||
}} | }} | ||
− | '''Unicode''' is a | + | '''Unicode''' is a standard ''character set'': an assignment of numeric values to characters. A huge number of characters from various [[Written Languages|writing systems]] (modern or ancient), as well as special symbols of many types, are each given a number. It was devised beginning in 1987, with the first version published in 1991. Subsequent revisions have continually expanded its character repertoire. |
− | Unicode was | + | Unicode was developed in reaction to the unwieldy multiplicity of character sets that had arisen to include various subsets of the many characters left out of the English-centric [[ASCII]] set. It has been successful to the point where just about all technical standards dealing with characters now are defined in terms of Unicode, with even the older proprietary encodings cross-referenced to the Unicode characters they encode. |
− | + | The Unicode character set is defined in ISO-10646. The Unicode standard takes the character set from ISO-10646, and adds standard algorithms and rules for how to use it. For example, it defines rules about character composition with separate diacritical elements and left-to-right vs. right-to-left character positioning, so things can get a bit more complex than just converting a series of numbers into characters. | |
− | + | The term ''character'' is ambiguous, and Unicode encodes many things that are arguably not characters, so the term ''code point'' is often used instead. ''Code point'' technically refers to the numeric value, but in practice it also refers to the entity encoded by that value. | |
− | + | The standard way to denote a Unicode code point is to prefix it with "U+", and write the number in hexadecimal, with a minimum of four hex digits. For example, code point 42 is written as U+002A, and code point 1,114,109 is U+10FFFD. | |
− | + | Each code point is also assigned a human-readable name, which may be written after the "U+" notation. For example, you might see "U+002A ASTERISK" or "U+03A9 GREEK CAPITAL LETTER OMEGA". | |
− | == | + | == Details == |
+ | Early versions of Unicode attempted to be a 16-bit character encoding where characters in a potential repertoire of 65,536 code points could be represented as 16-bit (2-byte) unsigned integers. The [[Endianness|"big-endian vs. little-endian"]] problem caused there to be two possible byte streams corresponding to a particular document, but the [[Byte Order Mark]] character could be used to distinguish them. | ||
+ | |||
+ | Later versions of Unicode expanded the potential number of code points (to a range of 0 to 1,114,111), so that even 16 bits weren't enough to encode all possible characters. | ||
+ | |||
+ | Unicode is sometimes described as consisting of 17 ''planes'' of 65,536 code points each, with plane 0 ranging from U+0000 to U+FFFF, plane 1 ranging from U+10000 to U+1FFFF, and so on. Plane 0 is known as the ''Basic Multilingual Plane'' or ''BMP'', and an attempt is made to place the most important characters in it. | ||
+ | |||
+ | The first 128 Unicode code points, 0-127, correspond to the same code points in ASCII (including both printable characters and the [[C0 controls]]). The next 128, 128-255, correspond to the same points in [[ISO 8859-1]] (including the [[C1 controls]]), which in turn contains the same characters at 0-127 as ASCII, so the entire first 256 characters in Unicode are equivalent to that standard. | ||
+ | |||
+ | == Encodings == | ||
+ | Once numbers are assigned to characters, they can be encoded as sequences of bytes in various ways, as defined in the specifications of particular character encodings. | ||
+ | |||
+ | The most common Unicode encodings are [[UTF-8]], [[UTF-16]], and [[UTF-32]]. See [[Character Encodings]] for a longer list. | ||
+ | |||
+ | There is no encoding named simply "Unicode". If a format specification says that text is encoded in "Unicode", it probably means [[UTF-16]] or [[UCS-2]]. If the document is related to Microsoft Windows, it probably means UTF-16LE. | ||
+ | |||
+ | == Notes == | ||
+ | And if you think Unicode is full of crap, you've got some support with [http://www.fileformat.info/info/unicode/char/1f4a9/index.htm this character] (U+1F4A9, "Pile of Poo"). | ||
+ | |||
+ | == Code charts and references == | ||
* [http://www.unicode.org/ Unicode official site] -- has lots of standards documents and code charts | * [http://www.unicode.org/ Unicode official site] -- has lots of standards documents and code charts | ||
− | |||
* [http://en.wikipedia.org/wiki/List_of_Unicode_characters Wikipedia list of Unicode Characters] | * [http://en.wikipedia.org/wiki/List_of_Unicode_characters Wikipedia list of Unicode Characters] | ||
+ | * [http://www.evertype.com/standards/csur/ ConScript Unicode Registry]: an unofficial registry for Private Use Area blocks used for constructed scripts (e.g., Klingon), not part of the official Unicode standard | ||
+ | * [http://www.alanwood.net/unicode/miscellaneous_symbols.html Miscellaneous symbols in Unicode] | ||
+ | * [http://unicodeheart.com/ Unicode heart] -- if you're looking for all the heart characters in Unicode | ||
+ | * [http://seriot.ch/resources/talks_papers/i_love_unicode_softshake.pdf I <?> Unicode] - a detailed and useful presentation describing Unicode and the challenges around it. | ||
+ | * [http://www.unicode.org/reports/tr10/ Unicode Collation Algorithm] | ||
+ | |||
+ | == Tools == | ||
+ | * [http://unicodelookup.com/ Unicode character lookup] | ||
+ | * [http://www.unicodetools.com/ Unicode tools] | ||
* [http://shapecatcher.com/ Shapecatcher] - Site for finding unicode characters by drawing them | * [http://shapecatcher.com/ Shapecatcher] - Site for finding unicode characters by drawing them | ||
− | * | + | * [https://r12a.github.io/app-conversion/ Tool for converting Unicode into other character formats such as UTF-8, HTML, etc.] |
− | * http://www.fileformat.info/info/unicode/char/search.htm | + | * [http://r12a.github.io/uniview/ UniView] |
+ | * [http://www.fileformat.info/info/unicode/char/search.htm Unicode search at Fileformat.Info] | ||
+ | * [http://qaz.wtf/u/convert.cgi?text=unicode+converter Unicode text converter] (convert text to obscure characters from Unicode) | ||
+ | |||
+ | == Other links == | ||
+ | * [http://en.wikipedia.org/wiki/Unicode Wikipedia entry on Unicode] | ||
+ | * [http://babelstone.blogspot.co.uk/2013/10/whats-new-in-unicode-70.html What's new in Unicode 7.0?] | ||
+ | * [http://www.unicode.org/versions/beta-7.0.0.html Unicode 7.0 beta] | ||
+ | * [https://plus.google.com/109925364564856140495/posts Fake Unicode Consortium] | ||
+ | * [http://parkerhiggins.net/2013/01/writing-the-prince-symbol-in-unicode/ Writing the Prince symbol in Unicode] | ||
+ | * [http://blog.unicode.org/2015/06/announcing-unicode-standard-version-80.html Announcing the Unicode standard version 8.0] | ||
+ | * [http://hea-www.harvard.edu/~fine/OSX/unicode_apple_logo.html The (nonstandard) Apple logo in Unicode] |
Latest revision as of 04:54, 30 August 2019
Unicode is a standard character set: an assignment of numeric values to characters. A huge number of characters from various writing systems (modern or ancient), as well as special symbols of many types, are each given a number. It was devised beginning in 1987, with the first version published in 1991. Subsequent revisions have continually expanded its character repertoire.
Unicode was developed in reaction to the unwieldy multiplicity of character sets that had arisen to include various subsets of the many characters left out of the English-centric ASCII set. It has been successful to the point where just about all technical standards dealing with characters now are defined in terms of Unicode, with even the older proprietary encodings cross-referenced to the Unicode characters they encode.
The Unicode character set is defined in ISO-10646. The Unicode standard takes the character set from ISO-10646, and adds standard algorithms and rules for how to use it. For example, it defines rules about character composition with separate diacritical elements and left-to-right vs. right-to-left character positioning, so things can get a bit more complex than just converting a series of numbers into characters.
The term character is ambiguous, and Unicode encodes many things that are arguably not characters, so the term code point is often used instead. Code point technically refers to the numeric value, but in practice it also refers to the entity encoded by that value.
The standard way to denote a Unicode code point is to prefix it with "U+", and write the number in hexadecimal, with a minimum of four hex digits. For example, code point 42 is written as U+002A, and code point 1,114,109 is U+10FFFD.
Each code point is also assigned a human-readable name, which may be written after the "U+" notation. For example, you might see "U+002A ASTERISK" or "U+03A9 GREEK CAPITAL LETTER OMEGA".
Contents |
[edit] Details
Early versions of Unicode attempted to be a 16-bit character encoding where characters in a potential repertoire of 65,536 code points could be represented as 16-bit (2-byte) unsigned integers. The "big-endian vs. little-endian" problem caused there to be two possible byte streams corresponding to a particular document, but the Byte Order Mark character could be used to distinguish them.
Later versions of Unicode expanded the potential number of code points (to a range of 0 to 1,114,111), so that even 16 bits weren't enough to encode all possible characters.
Unicode is sometimes described as consisting of 17 planes of 65,536 code points each, with plane 0 ranging from U+0000 to U+FFFF, plane 1 ranging from U+10000 to U+1FFFF, and so on. Plane 0 is known as the Basic Multilingual Plane or BMP, and an attempt is made to place the most important characters in it.
The first 128 Unicode code points, 0-127, correspond to the same code points in ASCII (including both printable characters and the C0 controls). The next 128, 128-255, correspond to the same points in ISO 8859-1 (including the C1 controls), which in turn contains the same characters at 0-127 as ASCII, so the entire first 256 characters in Unicode are equivalent to that standard.
[edit] Encodings
Once numbers are assigned to characters, they can be encoded as sequences of bytes in various ways, as defined in the specifications of particular character encodings.
The most common Unicode encodings are UTF-8, UTF-16, and UTF-32. See Character Encodings for a longer list.
There is no encoding named simply "Unicode". If a format specification says that text is encoded in "Unicode", it probably means UTF-16 or UCS-2. If the document is related to Microsoft Windows, it probably means UTF-16LE.
[edit] Notes
And if you think Unicode is full of crap, you've got some support with this character (U+1F4A9, "Pile of Poo").
[edit] Code charts and references
- Unicode official site -- has lots of standards documents and code charts
- Wikipedia list of Unicode Characters
- ConScript Unicode Registry: an unofficial registry for Private Use Area blocks used for constructed scripts (e.g., Klingon), not part of the official Unicode standard
- Miscellaneous symbols in Unicode
- Unicode heart -- if you're looking for all the heart characters in Unicode
- I <?> Unicode - a detailed and useful presentation describing Unicode and the challenges around it.
- Unicode Collation Algorithm
[edit] Tools
- Unicode character lookup
- Unicode tools
- Shapecatcher - Site for finding unicode characters by drawing them
- Tool for converting Unicode into other character formats such as UTF-8, HTML, etc.
- UniView
- Unicode search at Fileformat.Info
- Unicode text converter (convert text to obscure characters from Unicode)