UTF-8

File Format
Name	UTF-8
Ontology	Electronic File Formats Character Encodings UTF-8 ; ; ;

Revision as of 20:46, 22 February 2013

UCS Transformation Format—8-bit (UTF-8) is a byte-oriented Unicode character encoding. It offers good compatibility with ASCII, because codes 0–127 (00–7F hexadecimal) represent the equivalent ASCII characters, and these codes are never used in any other context.

UTF-8 is most efficient with scripts that make heavy use of the Roman alphabet. With other scripts it may not provide as efficient an encoding as UTF-16.

Format

A Unicode code point is encoded as either 1, 2, 3, or 4 bytes. (Early versions of UTF-8 defined sequences with more than 4 bytes, but they are obsolete.) Code points U+0000 to U+007F use 1 byte, U+0080 to U+07FF use 2, U+0800 to U+FFFF use 3, and U+10000 to U+10FFFF use 4.

Specifications

STD 63
- RFC 3629 (2003-11)
- RFC 2279 (1998-01)
- RFC 2044 (1996-10)
Unicode 6.0, Chapter 3 (2011) – §3.9 D92, §3.10 D95
ISO/IEC 10646:2003 Annex D (2003)

External links

@@ Line 3: / Line 3: @@
 |subcat=Character Encodings
 }}
-'''[[UCS]] Transformation Format—8-bit''' (UTF-8) is a [[Unicode]] character encoding. Codes 0-127 (00-7F hexadecimal) represent the equivalent [[ASCII]] characters, and these codes in a UTF-8 stream are never used in any other context. Codes FE and FF are never used, except in the optional [[Byte Order Mark]] at the beginning of a document. In UTF-8 the BOM is encoded as the bytes 0xEF, 0xBB, 0xBF. Since UTF-8 has no "endianness," this is not actually a byte order indicator but can be treated as a signature indicating the document is UTF-8 encoded.
-UTF-8 is best suited for scripts that make heavy use of the Roman alphabet. With other scripts it may not provide as efficient an encoding as [[UTF-16]] or [[UTF-32]].
+'''UCS Transformation Format—8-bit''' ('''UTF-8''') is a byte-oriented [[Unicode]] [[Character Encodings|character encoding]]. It offers good compatibility with [[ASCII]], because codes 0–127 (00–7F hexadecimal) represent the equivalent ASCII characters, and these codes are never used in any other context.
+UTF-8 is most efficient with scripts that make heavy use of the Roman alphabet. With other scripts it may not provide as efficient an encoding as [[UTF-16]].
+== Format ==
+A Unicode code point is encoded as either 1, 2, 3, or 4 bytes. (Early versions of UTF-8 defined sequences with more than 4 bytes, but they are obsolete.) Code points U+0000 to U+007F use 1 byte, U+0080 to U+07FF use 2, U+0800 to U+FFFF use 3, and U+10000 to U+10FFFF use 4.
+== See also ==
+* [[Byte Order Mark]]
+* [[CESU-8]]
 == Specifications ==

UTF-8

Revision as of 20:46, 22 February 2013

Contents

Format

See also

Specifications

External links

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Toolbox