UCS-2

From Just Solve the File Format Problem
(Difference between revisions)
Jump to: navigation, search
(Surrogate pairs)
 
(6 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
{{FormatInfo
 
{{FormatInfo
 
|formattype=electronic
 
|formattype=electronic
|subcat=Character Encodings
+
|subcat=Character encoding
 +
|subcat2=Unicode
 +
|charset=ISO-10646-UCS-2
 +
|charsetaliases=csUnicode
 +
|mibenum=1000
 
}}
 
}}
'''UCS-2''' is the trivial 16-bit [[Unicode]] encoding. It is considered to be obsolete.
+
'''UCS-2''' is the trivial 16-bit [[Unicode]] encoding. It was defined in versions of Unicode prior to 2.0, and is now considered to be obsolete. It was standardized as part of ISO 10646.
 
+
The term ''UCS-2'' is sometimes used as a synonym for [[UTF-16]], but this article describes UCS-2 as it was originally defined.
+
  
 
UCS-2 was at one time the only popular Unicode encoding, so there was little need to distinguish between the terms ''Unicode'' and ''UCS-2''. If an old format specification says that text is encoded in "Unicode", it probably means UCS-2.
 
UCS-2 was at one time the only popular Unicode encoding, so there was little need to distinguish between the terms ''Unicode'' and ''UCS-2''. If an old format specification says that text is encoded in "Unicode", it probably means UCS-2.
  
 +
There may be some disagreement about precisely what "UCS-2" means. Besides the original definition, it could mean the encoding with ''surrogate pairs'' allowed (see below), or it could be an old name for the encoding whose current version is [[UTF-16]].
 +
 +
The terms ''UCS-2'' and ''UTF-16'' are sometimes used interchangeably, even though they really shouldn't be. However, the data formats do end up being identical in many real-world situations.
 +
 +
== Original format ==
 
UCS-2 encodes a sequence of Unicode code points in a sequence of unsigned 16-bit integers, one code point per integer, in the obvious way (U+0000=0x0000, U+0001=0x0001, ..., U+FFFF=0xffff). It is only capable of encoding code points up to U+FFFF, and does not support the higher code points (U+10000 through U+10FFFF).
 
UCS-2 encodes a sequence of Unicode code points in a sequence of unsigned 16-bit integers, one code point per integer, in the obvious way (U+0000=0x0000, U+0001=0x0001, ..., U+FFFF=0xffff). It is only capable of encoding code points up to U+FFFF, and does not support the higher code points (U+10000 through U+10FFFF).
  
Line 16: Line 23:
 
''[Note: This is not the orthodox way of explaining surrogate pairs.]''
 
''[Note: This is not the orthodox way of explaining surrogate pairs.]''
  
Although UCS-2 does not support codepoints beyond U+FFFF, a hack called ''surrogate pairs'' was invented to allow such codepoints to safely pass through UCS-2 systems (e.g. file formats, databases, programming languages) in many cases.
+
Although UCS-2 does not support code points beyond U+FFFF, a hack called ''surrogate pairs'' was invented to allow such code points to safely pass through UCS-2 systems (e.g. file formats, databases, strings in some programming languages) in many cases.
  
When using this hack, a codepoint beyond U+FFFF is encoded as two codepoints in the reserved range U+D800 through U+DFFF. Each of these codepoints is then called a ''surrogate'', and together they form a ''surrogate pair''. Assuming that the reserved codepoints are not used in any other way, this format is identical to [[UTF-16]].
+
When using this hack, a code point beyond U+FFFF is encoded as two code points in the reserved range U+D800 through U+DFFF. Each of these code points is then called a ''surrogate'', and together they form a ''surrogate pair''. Assuming that the reserved code points are not used in any other way, this format is identical to [[UTF-16]].
  
 
When converting from UCS-2 to another encoding, it is a good idea to be aware of this hack. In most cases, sequences that look like surrogate pairs should be interpreted as such. One should keep in mind that UCS-2 systems generally aren't aware of UTF-16's rules, so it might be a bad idea to blindly interpret UCS-2 data as if it were UTF-16. For example, a string containing only the single code unit 0xDFFF is valid UCS-2, but invalid UTF-16.
 
When converting from UCS-2 to another encoding, it is a good idea to be aware of this hack. In most cases, sequences that look like surrogate pairs should be interpreted as such. One should keep in mind that UCS-2 systems generally aren't aware of UTF-16's rules, so it might be a bad idea to blindly interpret UCS-2 data as if it were UTF-16. For example, a string containing only the single code unit 0xDFFF is valid UCS-2, but invalid UTF-16.

Latest revision as of 02:35, 21 May 2019

File Format
Name UCS-2
Ontology
IANA charset ISO-10646-UCS-2
IANA aliases csUnicode
IANA MIBenum 1000

UCS-2 is the trivial 16-bit Unicode encoding. It was defined in versions of Unicode prior to 2.0, and is now considered to be obsolete. It was standardized as part of ISO 10646.

UCS-2 was at one time the only popular Unicode encoding, so there was little need to distinguish between the terms Unicode and UCS-2. If an old format specification says that text is encoded in "Unicode", it probably means UCS-2.

There may be some disagreement about precisely what "UCS-2" means. Besides the original definition, it could mean the encoding with surrogate pairs allowed (see below), or it could be an old name for the encoding whose current version is UTF-16.

The terms UCS-2 and UTF-16 are sometimes used interchangeably, even though they really shouldn't be. However, the data formats do end up being identical in many real-world situations.

Contents

[edit] Original format

UCS-2 encodes a sequence of Unicode code points in a sequence of unsigned 16-bit integers, one code point per integer, in the obvious way (U+0000=0x0000, U+0001=0x0001, ..., U+FFFF=0xffff). It is only capable of encoding code points up to U+FFFF, and does not support the higher code points (U+10000 through U+10FFFF).

Since it is often necessary to encode code points into bytes, instead of 16-bit integers, there are two flavors of UCS-2 which do that: USC-2BE (big-endian) and UCS-2LE (little-endian).

[edit] Surrogate pairs

[Note: This is not the orthodox way of explaining surrogate pairs.]

Although UCS-2 does not support code points beyond U+FFFF, a hack called surrogate pairs was invented to allow such code points to safely pass through UCS-2 systems (e.g. file formats, databases, strings in some programming languages) in many cases.

When using this hack, a code point beyond U+FFFF is encoded as two code points in the reserved range U+D800 through U+DFFF. Each of these code points is then called a surrogate, and together they form a surrogate pair. Assuming that the reserved code points are not used in any other way, this format is identical to UTF-16.

When converting from UCS-2 to another encoding, it is a good idea to be aware of this hack. In most cases, sequences that look like surrogate pairs should be interpreted as such. One should keep in mind that UCS-2 systems generally aren't aware of UTF-16's rules, so it might be a bad idea to blindly interpret UCS-2 data as if it were UTF-16. For example, a string containing only the single code unit 0xDFFF is valid UCS-2, but invalid UTF-16.

[edit] See also

[edit] External links

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox