CSV
CSV (Comma Separated Values) is a text-based format typically used for the storage or exchange of database-like records. In Microsoft Windows systems, it will commonly open in Excel, but it is not a proprietary format and can be used in many other programs.
Contents |
Description
In essence, CSV files consist of a series of records each of which contains zero or more fields. The fields are separated by a known delimiter - canonically a comma - and the records are typically separated by whatever constitutes a newline on the system which generated the CSV file. A quote character is used to surround fields which themselves contain the delimiter character or the quote character, and in some implementations is used to surround any field which contains non-alphanumeric characters. The quote character is typically " but is often '.
RFC 4180 gives a spec for the format. CSV Schema is a format for providing formal definitions of the field structure of a particular CSV-based format. JSON Table Schema is another format that can provide such definitions for CSV or other tabular data formats. CSV Dialect Description Format can describe the particulars of a variety of CSV or other delimited file (e.g., the delimiters and quoting rules).
A simplistic and quite possibly syntactically invalid BNF definition for CSV is as follows:
<CSVFile> ::= <Record>*
<Record> ::= { <Field> (<Delimiter> <Field>)* } <EOL>
<Field> ::= <SimpleField> <QuotedField>
<SimpleField> :== AlphaNum* ; Any sequence of alpha-numeric characters
<QuotedField> :== <QuoteChar> <Anychar>* <QuoteChar> ; See below for quite how flexible <anychar> is
<QuoteChar> :== " | ' ;but note that they generally must match
<Delimiter> :== ","
Implementations vary in their interpretation and generation of CSV files. The best, as ever, are strict in what they generate but generous in what they accept. Known variants of what is acceptable include:
- Whether quoted fields must be quoted with " or ' or whether either is acceptable
- Whether <EOL> is ASCII NL, CR, CR NL, NL CR or any combination of these. (For instance, some implementations expecting a bare NL but seeing a record ending in CR NL will treat the CR as part of the final field; some will see it as a record delimiter on its own, making a blank record following; some will correctly recognise CR NL as a variant form of <EOL>)
- Whether all records must contain the same number of fields or not
- Whether special interpretation can be given to the first record, naming the fields in subsequent records (implementations that accept this will typically expect every record to contain the same number of fields as the first record)
- Whether quotes only appear around fields which themselves contain either a quote character, a delimiter or a newline or whether quotes can be placed around any field
- Whether quotes inside a field are doubled or escaped. E.g. if the quote character is " and a field's value is you"re should the field appear as "you\"re" or "you""re" ? (The second option is the most widely adopted, and that used by MS Excel)
- Whether it is possible to place quotes or delimiters inside field values at all (Some simplistic CSV libraries simply split the input line at every occurrence of a delimiter, ignoring quoting.)
- Whether quoting a field allows it to contain a newline or whether a newline always terminates a record regardless of where it appears.
With the generic interpretation of CSV given above, it is possible to see Tab-Delimited-Values as a special case of CSV that frequently avoids many of these problems with quoting, since typical textual data is less likely to contain real tab characters than commas or quotes. Tab delimited files rarely if ever need to use quote characters to surround fields and therefore can be handled simply by splitting lines at each tab character, as long as one does not need to support multi-line fields. Other characters are also sometimes used as delimiters, including the pipe (upright bar: |). The ASCII character set's C0 controls actually include control characters such as Unit Separator intended for use as delimiters, but those are not often encountered.
A Microsoft Excel CSV file may write or expect a semicolon as separator, depending on the system's regional settings. This is to provide support in areas where the comma is used as the decimal separator instead of the period.
CSVs loaded into Excel and saved back out might have some of their data "munged" in various ways; for instance, ZIP codes have their leading zeroes stripped, and account numbers (and other long items with all-numeric characters) could get placed into exponential notation, losing information. Files may also be truncated if they have too many lines for Excel to handle (varying by version). Unfortunately, with Windows defaulting to opening CSV files in Excel, and often browsers, e-mail programs, and other software launching downloaded CSV files automatically in that program, office workers often mess up data in this manner instead of transferring them unmolested. If you receive a CSV file from one source (e.g., a Web download or e-mail attachment) and need to send it on to another place (e.g., data import to accounting software or e-mail attachment to somebody else), it's best to save it straight to your computer's hard disk from the place you get it, then upload it from there; don't open it in Excel and save it from there.
Byte Order Mark
Some UTF-8 CSV files are written with a prepending "Byte Order Mark" or BOM consisting of 3 bytes: EF BB BF
.
"Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature." [1]
See also
Examples
- Sample CSV file (with double quotes around values)
- Subject themed sample datasets
- National Archives (UK) datasets (includes some CSV files)
Software
- Tabula: converts tabular data in PDFs to CSV
- csvkit 0.7.3: utilities for CSV
- csvkit 0.8.0: utilities for CSV
- CSV Validator 1.0 (more info)
- CSV Validator 1.1
- CSV Validator 1.1.3
- testdat
- Comma Chameleon (Discussion)
- q: Command line tool to do SQL-like queries on CSV and similar files
Online utilities
- CSV Lint
- PapaParse: in-browser conversions between CSV and JSON
- Validator for CKAN CSV datasets (used by UK government)
Code
- NDAD The public release of the NDAD code contained a generic set of routines to read and write CSV files which supported arbitrary quoting and multi-line fields, as well as supporting files which contained different numbers of fields in each record. readcsv_v & writecsv are buried in this tarball:
- There's also readcsv, a simpler implementation which expects the caller to know how many fields the input record will contain and returns an error if that isn't what it finds.
- GDAL/OGR, an open-source C++ library designed for geospatial data formats, includes routines to read and write CSV files. GDAL/OGR assumes CSV files are in UTF-8.
- csv is a Python module in Python Standard Library since version 2.3. It implements classes to read and write tabular data in CSV format.
- Unix: Chopping up CSV files (using Perl module)
- Bits and pieces required to complete CSV ingest in Rosetta by Ex-Libris at Archives New Zealand
Identifiers
- File extension: .CSV
- MIME type (Internet media type): text/csv is the officially-registered type, but a number of other types have been used starting before the registration took place, but continuing afterward. text/x-csv and application/x-csv are common, and application/csv, text/x-comma-separated-values, and text/comma-separated-values are also encountered. Sometimes such files are even served as application/vnd.ms-excel in order to get them to load in MS Excel (and be unnecessarily perceived as in a proprietary format by those who can't or won't use Microsoft products). Incidentally, Windows in its default configuration refers to files with a .csv extension as "Microsoft Office Excel Comma Separated Values File", implying that MS is claiming ownership of the format.
- Non-computer-geek people will often mistakenly refer to these files as "CVS". They're probably thinking of the drugstore chain; CVS is also a revision control system for software projects, but you can't expect non-geeks to know that.
Hints and tips
Other links and references
- RFC 4180
- Comma-separated values : Wikipedia
- Comma Separated Value (.csv) from GDAL/OGR documentation
- Flight MH370 data was released as a PDF, but somebody extracted it to CSV to make it more useful for data analysis.
- CSV on the Web: A Primer (W3C)
- CSV on the Web Working Group (W3C)
- Generating RDF from tabular data on the Web (W3C)
- Generating JSON from tabular data on the Web (W3C)
- Working with CSV using PDL
- It's Time to Retire the CSV