Plain text
265 993 303 (Talk | contribs) |
m (→Identification) |
||
Line 25: | Line 25: | ||
== Identification == | == Identification == | ||
− | UTF-32 text files are usually detected by starting with the | + | [[UTF-32]] text files are usually detected by starting with the ''Byte Order Mark'' (BOM) consisting of the bytes <code>FF FE 00 00</code> (for little endian <code>0x0000FEFF</code>) or <code>00 00 FE FF</code> (for big endian <code>0x0000FFFE</code>). In some cases UTF-32 files may occur without the BOM, however, only <code>0x00000000</code>—<code>0x0000D7FF</code> and <code>0x0000E000</code>—<code>0x0010FFFF</code> are valid ranges for dwords; <code>0x0000D800</code>—<code>0x0000DFFF</code> and <code>0x00110000</code>—<code>0xFFFFFFFF</code> are invalid. |
− | UTF-16 text files are usually detected by starting with the byte order mark (BOM) consisting of the bytes FF FE (for little endian 0xFEFF) or FE FF (for big endian 0xFFFE). However, in some cases UTF-16 files may occur without the BOM, in which case, detection is not guaranteed to be reliable, but the line feed (0x000A) in its byte reversal (0x0A00) is not in Unicode 15.0, and null bytes are unlikely to occur in other text encodings, so the presence of word-aligned 00 0A or 0A 00 can rule out 8-bit encodings and one of the endianness and therefore may be used for UTF-16 detection. On the other hand, the bytes 0D 0A in little endian form U+0A0D which is not in Unicode 15.0 either but it is a common newline in 8-bit encodings. The detection of UCS-2 text works similarly, since UCS-2 is the precursor of UTF-16, as UTF-16 introduced surrogate pairs formed by | + | [[UTF-16]] text files are usually detected by starting with the byte order mark (BOM) consisting of the bytes <code>FF FE</code> (for little endian <code>0xFEFF</code>) or FE FF (for big endian <code>0xFFFE</code>). However, in some cases UTF-16 files may occur without the BOM, in which case, detection is not guaranteed to be reliable, but the line feed (<code>0x000A</code>) in its byte reversal (<code>0x0A00</code>) is not in ''Unicode 15.0'', and null bytes are unlikely to occur in other text encodings, so the presence of word-aligned <code>00 0A</code> or <code>0A 00</code> can rule out 8-bit encodings and one of the endianness and therefore may be used for UTF-16 detection. On the other hand, the bytes <code>0D 0A</code> in little endian form <code>U+0A0D</code> which is not in ''Unicode 15.0'' either but it is a common newline in 8-bit encodings. The detection of [[UCS-2]] text works similarly, since UCS-2 is the precursor of UTF-16, as UTF-16 introduced surrogate pairs formed by <code>0xD800</code>—<code>0xDBFF</code> followed by <code>0xDC00</code>—<code>0xDFFF</code>, with other combinations of <code>0xD800</code>—<code>0xDFFF</code> being invalid. |
− | ASCII only text files may be detected by verifying that the file has all | + | [[ASCII|ASCII-only]] text files may be detected by verifying that the file has all <code>0x01</code>—<code>0x7F</code> bytes. |
− | UTF-8 text files may be detected by presence of any bytes from | + | [[UTF-8]] text files may be detected by presence of any bytes from <code>0x80</code>—<code>0xFF</code>, absence of null bytes (if UTF-16 hasn't been ruled out yet), or verifying that the file is valid UTF-8. UTF-8 has many error cases; the only valid bit patterns are <code>0xxxxxxx</code> (where x forms <code>0x00</code>—<code>0x7F</code>), <code>110xxxxx</code> <code>10xxxxxx</code> (where x forms <code>0x0080</code>—<code>0x07FF</code>, but not <code>0x00</code>—<code>0x7F</code>), <code>1110xxxx</code> <code>10xxxxxx</code> <code>10xxxxxx</code> (where x forms <code>0x0800</code>—<code>0xD7FF</code> <code>0xE000</code>—<code>0xFFFF</code>, but not <code>0x0000</code>—<code>0x07FF</code> or <code>0xD800</code>—<code>0xDFFF</code>), and <code>11110xxx</code> <code>10xxxxxx</code> <code>10xxxxxx</code> <code>10xxxxxx</code> (where x forms <code>0x10000</code>—<code>0x10FFFF</code>, but not <code>0x0000</code>—<code>0xFFFF</code> or <code>0x110000</code>—<code>0x1FFFFF</code>). UTF-8 text files may also start with the UTF-8 byte order mark (EF BB BF). |
− | When a file is known to be a plain text file but UTF-32, UTF-16, ASCII, and UTF-8 were already ruled out, only 8-bit encodings or mixed single byte/double byte encodings (such as Shift JIS) remain. In this case, the only thing left (other than applying complex heuristics) is to use the regional or system text encoding, such as CP1252, CP1250, CP437, CP852, etc.. | + | When a file is known to be a plain text file but [[UTF-32]], [[UTF-16]], [[ASCII]], and [[UTF-8]] were already ruled out, only 8-bit encodings or mixed single byte/double byte encodings (such as [[JIS|Shift JIS]]) remain. In this case, the only thing left (other than applying complex heuristics) is to use the regional or system text encoding, such as [[Windows 1252|CP1252]], [[Windows 1250|CP1250]], [[CP437]], [[CP852]], etc.. |
== See also == | == See also == |
Revision as of 12:25, 20 June 2023
Plain text files (also known by the extension TXT) consist of characters encoded sequentially in some particular character encoding. Plain text files contain no formatting information other than white space characters. Some data formats (usually those intended to be human-readable) are based on plain text; see Text-based data for some structured formats that are stored in plain text (and hence can be opened in a plain text editor if no more specific program is available).
Traditionally, ASCII was used much of the time for maximum interoperability, though many platform-specific character sets were also in use. For non-English text an encoding supporting a broader character repertoire is needed, often UTF-8 nowadays. Note that if the file consists only of 7-bit ASCII characters, the bytes of the file are identical in us-ascii, ISO-8859-1, UTF-8, and a number of other encodings, so such a file can be identified as any of these depending on what is most convenient for a particular application. It is only when characters out of this repertoire are used that encoding-specific details need be considered. Some formats, such as HTML and XML, provide some sort of escape sequences (such as ampersands used for character references and entities) allowing special characters to be referenced within the document while leaving the document itself entirely ASCII.
Another point of contention or incompatibility in text-file formats is the conventions for line and paragraph breaks. Depending on what system the file was created on or intended to be viewed on, line breaks may be done as Carriage Return (ASCII 0D hex) and Linefeed (ASCII 0A hex) together (usually in that order, though in rare cases in the opposite order), or just one of those characters alone. Some text viewing or editing programs that are not cross-platform-friendly will really mess up badly in attempting to view/edit files using a different line break convention than the program expects, so you might see lines overwriting one another instead of going to the next line, or peculiar control characters show up within the file, or other strangeness. Files with linefeed alone are often referred to as "UNIX mode" (and the linefeed, in this context, referred to as NL for Newline), while files with carriage return alone are referred to as "Mac mode" (though it's also common in other early platforms such as the Apple II and Commodore 64, and no longer used in current Macs), while the CR+LF format is called "DOS" or "PC" or "Windows" mode (though it was used in various mainframes and network protocols as well).
Files may also use hard line breaks to keep line length within a fixed number of columns (usually 80, but other values such as 40 or 65 are used sometimes), or just have line breaks at the end of paragraphs and expect systems to word-wrap long lines; encountering files of a different convention than you expect may result in lines running way off to the right of the screen and requiring horizontal scrolling, or else short, choppy lines. Many text editors have a "paragraph reformat" command to bring paragraphs into compliance with your desired conventions.
Most operating systems include a simple text editor (e.g., Windows Notepad) which can open text files, but many other text editors exist (and computer people sometimes have "holy wars" over which one is best). Some of the common text editors are EMACS, vi, and UltraEdit. In the earlier days of computing, there was less distinction between text editors and word processors than there is now, as word processors generally used a format that was mostly plain text and could even be completely plain text if you refrained from using special embedded commands and features. However, modern word processors such as Microsoft Word default to using program-specific save formats that have little resemblance to plain text, unless you go out of your way to "Save As" .txt. A common "newbie error" is to attempt to create or edit plain text files in such a program, leaving the files as proprietarily-formatted in a way that messes up the operation of other programs that expect to find plain text.
Creating artwork using text characters is known as ASCII Art, or other variants such as ANSI Art if special control or escape codes are used in addition to the plain text characters.
Contents |
Extension
The traditional extension for text files is .txt
, but lots of other extensions have been used. Occasionally on systems permitting extensions longer than three letters, .text
has been used, and .asc
for ASCII has also had some use; .doc
has also sometimes been used for files "documenting" something (like the manual accompanying a piece of downloaded software), but that went out of common use once that extension became associated with Microsoft Word's DOC format.
Identification
UTF-32 text files are usually detected by starting with the Byte Order Mark (BOM) consisting of the bytes FF FE 00 00
(for little endian 0x0000FEFF
) or 00 00 FE FF
(for big endian 0x0000FFFE
). In some cases UTF-32 files may occur without the BOM, however, only 0x00000000
—0x0000D7FF
and 0x0000E000
—0x0010FFFF
are valid ranges for dwords; 0x0000D800
—0x0000DFFF
and 0x00110000
—0xFFFFFFFF
are invalid.
UTF-16 text files are usually detected by starting with the byte order mark (BOM) consisting of the bytes FF FE
(for little endian 0xFEFF
) or FE FF (for big endian 0xFFFE
). However, in some cases UTF-16 files may occur without the BOM, in which case, detection is not guaranteed to be reliable, but the line feed (0x000A
) in its byte reversal (0x0A00
) is not in Unicode 15.0, and null bytes are unlikely to occur in other text encodings, so the presence of word-aligned 00 0A
or 0A 00
can rule out 8-bit encodings and one of the endianness and therefore may be used for UTF-16 detection. On the other hand, the bytes 0D 0A
in little endian form U+0A0D
which is not in Unicode 15.0 either but it is a common newline in 8-bit encodings. The detection of UCS-2 text works similarly, since UCS-2 is the precursor of UTF-16, as UTF-16 introduced surrogate pairs formed by 0xD800
—0xDBFF
followed by 0xDC00
—0xDFFF
, with other combinations of 0xD800
—0xDFFF
being invalid.
ASCII-only text files may be detected by verifying that the file has all 0x01
—0x7F
bytes.
UTF-8 text files may be detected by presence of any bytes from 0x80
—0xFF
, absence of null bytes (if UTF-16 hasn't been ruled out yet), or verifying that the file is valid UTF-8. UTF-8 has many error cases; the only valid bit patterns are 0xxxxxxx
(where x forms 0x00
—0x7F
), 110xxxxx
10xxxxxx
(where x forms 0x0080
—0x07FF
, but not 0x00
—0x7F
), 1110xxxx
10xxxxxx
10xxxxxx
(where x forms 0x0800
—0xD7FF
0xE000
—0xFFFF
, but not 0x0000
—0x07FF
or 0xD800
—0xDFFF
), and 11110xxx
10xxxxxx
10xxxxxx
10xxxxxx
(where x forms 0x10000
—0x10FFFF
, but not 0x0000
—0xFFFF
or 0x110000
—0x1FFFFF
). UTF-8 text files may also start with the UTF-8 byte order mark (EF BB BF).
When a file is known to be a plain text file but UTF-32, UTF-16, ASCII, and UTF-8 were already ruled out, only 8-bit encodings or mixed single byte/double byte encodings (such as Shift JIS) remain. In this case, the only thing left (other than applying complex heuristics) is to use the regional or system text encoding, such as CP1252, CP1250, CP437, CP852, etc..