UTF-8
From Just Solve the File Format Problem
(Difference between revisions)
Dan Tobias (Talk | contribs) |
|||
Line 3: | Line 3: | ||
|subcat=Character Encodings | |subcat=Character Encodings | ||
}} | }} | ||
− | |||
− | UTF-8 is | + | '''UCS Transformation Format—8-bit''' ('''UTF-8''') is a byte-oriented [[Unicode]] [[Character Encodings|character encoding]]. It offers good compatibility with [[ASCII]], because codes 0–127 (00–7F hexadecimal) represent the equivalent ASCII characters, and these codes are never used in any other context. |
+ | |||
+ | UTF-8 is most efficient with scripts that make heavy use of the Roman alphabet. With other scripts it may not provide as efficient an encoding as [[UTF-16]]. | ||
+ | |||
+ | == Format == | ||
+ | |||
+ | A Unicode code point is encoded as either 1, 2, 3, or 4 bytes. (Early versions of UTF-8 defined sequences with more than 4 bytes, but they are obsolete.) Code points U+0000 to U+007F use 1 byte, U+0080 to U+07FF use 2, U+0800 to U+FFFF use 3, and U+10000 to U+10FFFF use 4. | ||
+ | |||
+ | == See also == | ||
+ | * [[Byte Order Mark]] | ||
+ | * [[CESU-8]] | ||
== Specifications == | == Specifications == |
Revision as of 20:46, 22 February 2013
UCS Transformation Format—8-bit (UTF-8) is a byte-oriented Unicode character encoding. It offers good compatibility with ASCII, because codes 0–127 (00–7F hexadecimal) represent the equivalent ASCII characters, and these codes are never used in any other context.
UTF-8 is most efficient with scripts that make heavy use of the Roman alphabet. With other scripts it may not provide as efficient an encoding as UTF-16.
Contents |
Format
A Unicode code point is encoded as either 1, 2, 3, or 4 bytes. (Early versions of UTF-8 defined sequences with more than 4 bytes, but they are obsolete.) Code points U+0000 to U+007F use 1 byte, U+0080 to U+07FF use 2, U+0800 to U+FFFF use 3, and U+10000 to U+10FFFF use 4.
See also
Specifications
- STD 63
- Unicode 6.0, Chapter 3 (2011) – §3.9 D92, §3.10 D95
- ISO/IEC 10646:2003 Annex D (2003)