Avro

File Format
Name	Avro
Ontology	Electronic File Formats Serialization Avro ; ; ;
Extension(s)	.avsc, .avro
Released	2009

Avro, commonly known prefixed by the name of its creator organization as "Apache Avro", is a serialization format (and RPC system) used largely within "big data" systems, specifically the Apache Hadoop environment. It has been designed for extreme "flexibility", which seems to actually mean that it is meant to meet several apparently contradictory requirements at once, and, as a result, it has a somewhat confusing structure.

At its basis, Avro is an imitation of Protocol Buffers, with a schema written is JSON that defines a wire format^{[Note 1]} for serialized data that essentially consists of the field values converted to binary and concatendated. (Instead of this binary encoding, the message itself may also be serialized using JSON, although this is relatively rare.^[2]) However, this becomes complicated by the fact that its schema is usually stored with the message, which gives rise to Avro's distinctive qualities.

One of the features frequently touted as an advantage of Avro is that it can work without Protocol Buffers-like code generation^[3]^[4] (although in practice this seems to usually be done, anyhow). This only affects ease of programming, not the data format. (Technically, it allows for messages to be parsed without knowing anything about the schema in advance, but this is only useful in the course of writing something like a debug tool.)

Extensions

Schema files use the extension ".avsc", while encoded files (of any sort) use the extension ".avro".^[5]

Parsing Canonical Form

The "parsing canocial form" is used to test for equality of two schemas. It is computed by taking a schema, specified in JSON, and performing a number of transformations on it. If the result of this process is the same for two schemas, they are considered identical.^[6] The parsing canocial form becomes important when schemas are embedded in files (as is discussed in the next section).

Embedded Schemas and their Consequences

The distinctive feature of Avro is that it almost always stores or transmits the serialized message along with its schema. This is how Avro tries to solve the problem of backwards and forwards compatibility in schema versions (a difficult problem for serialization formats; see e.g. ^[7]).

The Wrapper Formats

The schema and the message are stored together with two formats: the "single-object encoding" format and the "object container file" format. The former does not actually store the schema in full, only a reference.

Single-Object Encoding

The "single-object encoding" consists of^[8]:

The magic number c3 01.
A 64-bit fingerprint of the schema, generated by a variant of the Rabin fingerprint (itself apparently a type of CRC) called the "AVRO fingerprinting algorithm".
The encoded message.

This allows for the schema to be looked up in a database.

Object Container File

The "object container file", which seems to be more widely-used, is more complicated. It consists of^[9]:

The magic number 4f 62 6a 01.
The full JSON schema, and name of the compression algorithm, themselves encoded together as an Avro "map" object.
16 random bytes called the "sync marker"
Several "blocks" of the encoded binary message, with optional compression (the specification calls a compression algorithm a "codec", which is unusual usage outside of media compression).

See #Specifications for more details.

Specifications

1.9.0 Specification

Links

Notes

↑ In this case, "wire format" is meant in the generic sense, not Avro's specific usage of it to refer to its network protocol.

References

[1] In this case, "wire format" is meant in the generic sense, not Avro's specific usage of it to refer to its network protocol.

[0] Date of the earliest release, i.e. https://archive.apache.org/dist/avro/avro-1.0.0/

[2] ttps://avro.apache.org/docs/current/spec.html#Encodings

[3] Wikipedia:Apache Avro

[4] ttp://cloudurable.com/blog/avro/index.html

[5] ttps://avro.apache.org/docs/current/gettingstartedpython.html

[6] ttps://avro.apache.org/docs/current/spec.html#Parsing+Canonical+Form+for+Schemas

[7] ttps://capnproto.org/faq.html#how-do-i-make-a-field-required-like-in-protocol-buffers

[8] ttps://avro.apache.org/docs/current/spec.html#single_object_encoding

[9] ttps://avro.apache.org/docs/current/spec.html#Object+Container+Files

[1]

[Note 1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

Avro

Contents

Extensions

Parsing Canonical Form

Embedded Schemas and their Consequences

The Wrapper Formats

Single-Object Encoding

Object Container File

Specifications

Links

Notes

References

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Toolbox