UTF-8

File Format
Name	UTF-8
Ontology	Electronic File Formats Character Encodings UTF-8 ; ; ;

UCS Transformation Format—8-bit (UTF-8) is a byte-oriented Unicode character encoding. It offers good compatibility with ASCII, because codes 0–127 (00–7F hexadecimal) represent the equivalent ASCII characters, and these codes are never used in any other context.

UTF-8 is most efficient with scripts that make heavy use of the Roman alphabet. With other scripts it may not provide as efficient an encoding as UTF-16.

Format

A Unicode code point is encoded as either 1, 2, 3, or 4 bytes. (Early versions of UTF-8 defined sequences with more than 4 bytes, but they are obsolete.) Code points U+0000 to U+007F use 1 byte, U+0080 to U+07FF use 2, U+0800 to U+FFFF use 3, and U+10000 to U+10FFFF use 4.

Specifications

STD 63
- RFC 3629 (2003-11)
- RFC 2279 (1998-01)
- RFC 2044 (1996-10)
Unicode 6.0, Chapter 3 (2011) – §3.9 D92, §3.10 D95
ISO/IEC 10646:2003 Annex D (2003)

External links

UTF-8

Contents

Format

See also

Specifications

External links

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Toolbox