Compact Serialization Binary Specification
The Compact Serialization Binary Specification defines the binary format of the Compact Serialization, the supported types, and the Compact Schema.
The supported data types include:
-
Fixed-size types like boolean, int8, int16, int32, int64, float32, and float64
-
Nullable versions of the fixed-size types
-
Variable-size types like string, decimal, time, date, timestamp, and timestamp with timezone
-
Arrays of the types listed above, and
-
Nested Compact serializable objects containing the types listed above, and arrays of them.
Note: Except fixed-size types, all other types are considered variable-size.
See Compact Serialization on implementing and using this Hazelcast-specific serialization option.
Introduction
Every serialized user object consists of a header and data. Compact Serialization includes a schema that describes the serialized data and a schema ID, represented by an 8-byte long fingerprint, that uniquely identifies the schema. The schema and the data are separate entities that are connected by the schema ID.
Data Types
The table below shows the data types that are supported by the Compact Serialization.
All primitive types, user types, array size, number of items, length of data, schema ID, and offsets are configured in Compact Serialization with a default endianness of BIG_ENDIAN
.
You cannot redefine the endianness in Compact Serialization; it is inherited from existing configuration.
Type | Java | C++ | C# | Python | Node.js | Go | SQL | Description | Fixed Size |
---|---|---|---|---|---|---|---|---|---|
boolean |
boolean |
bool |
bool |
bool |
boolean |
bool |
BOOLEAN |
true or false represented by 1 bit as either 1 or 0. Up to 8 booleans packed into a single byte |
yes |
int8 |
byte |
int8_t |
sbyte |
int |
number |
int8 |
TINYINT |
8 bit two’s complement signed integer |
yes |
int16 |
short |
int16_t |
short |
int |
number |
int16 |
SMALLINT |
16-bit two’s-complement signed integer |
yes |
int32 |
int |
int32_t |
int |
int |
number |
int32 |
INTEGER |
32-bit two’s-complement signed integer |
yes |
int64 |
long |
int64_t |
long |
int |
Long |
int64 |
BIGINT |
64-bit two’s-complement signed integer |
yes |
float32 |
float |
float |
float |
float |
number |
float32 |
REAL |
32-bit IEEE 754 floating-point number |
yes |
float64 |
double |
double |
double |
float |
number |
float64 |
DOUBLE |
64-bit IEEE 754 floating-point number |
yes |
string |
String |
optional<std::string> |
string? |
Optional[str] |
string | null |
*string |
STRING |
null or number of bytes in the string(int32) + UTF-8 string https://tools.ietf.org/html/rfc3629 |
no |
decimal |
BigDecimal |
optional<hazelcast:client:decimal> |
HBigDecimal? |
Optional[decimal.Decimal] |
BigDecimal | null |
*types.Decimal |
DECIMAL |
null or |
no |
time |
LocalTime |
optional<hazelcast:client:local_time> |
HLocalTime? |
Optional[datetime.time] |
LocalTime | null |
*types.LocalTime |
TIME |
null or |
no(since it is nullable) |
date |
LocalDate |
optional<hazelcast:client:local_date> |
HLocalDate? |
Optional[datetime.date] |
LocalDate | null |
*types.LocalDate |
DATE |
null or |
no(since it is nullable) |
timestamp |
LocalDateTime |
optional<hazelcast:client:local_date_time> |
HLocalDateTime? |
Optional[datetime.datetime] |
LocalDateTime | null |
*types.LocalDateTime |
TIMESTAMP |
null or |
no(since it is nullable) |
timestampWithTimeZone |
OffsetDateTime |
optional<hazelcast:client:offset_date_time> |
HOffsetDateTime? |
Optional[datetime.datetime] |
OffsetDateTime | null |
*types.OffsetDateTime |
TIMESTAMP W/ TZ |
null or |
no(since it is nullable) |
compact |
T |
template<typename T> optional<T> |
T? |
Optional[Any] |
T | null |
interface{} |
OBJECT |
A user defined compact |
no |
boolean[] |
boolean[] |
optional<std::vector<bool>> |
bool[]? |
Optional[list[bool]] |
boolean[] | null |
[]bool |
Array of booleans |
no |
|
int8[] |
byte[] |
optional<std::vector<int8_t>> |
sbyte[]? |
Optional[list[int]] |
Buffer | null |
[]int8 |
Array of int8s |
no |
|
int16[] |
short[] |
optional<std::vector<int16_t>> |
short[]? |
Optional[list[int]] |
number[] | null |
[]int16 |
Array of int16s |
no |
|
int32[] |
int[] |
optional<std::vector<int32_t>> |
int[]? |
Optional[list[int]] |
number[] | null |
[]int32 |
Array of int32s |
no |
|
int64[] |
long[] |
optional<std::vector<int64_t>> |
long[]? |
Optional[list[int]] |
Long[] | null |
[]int64 |
Array of int64s |
no |
|
float32[] |
float[] |
optional<std::vector<float>> |
float[]? |
Optional[list[float]] |
number[] | null |
[]float32 |
Array of float32s |
no |
|
float64[] |
double[] |
optional<std::vector<double>> |
double[]? |
Optional[list[float]] |
number[] | null |
[]float64 |
Array of float64s |
no |
|
string[] |
String[] |
optional<std::vector<optional<std::string>>> |
string?[]? |
Optional[list[Optional[str]]] |
(string | null)[] | null |
[]*string |
Array of strings |
no |
|
decimal[] |
BigDecimal[] |
optional<std::vector<optional<decimal>>> |
HBigDecimal?[]? |
Optional[list[Optional[decimal.Decimal]]] |
(BigDecimal | null)[] | null |
[]*types.Decimal |
Array of Decimals |
no |
|
time[] |
LocalTime[] |
optional<std::vector<optional<hazelcast:client:local_time>>> |
HLocalTime?[]? |
Optional[list[Optional[datetime.time]]] |
(LocalTime | null)[] | null |
[]*types.LocalTime |
Array of Times |
no |
|
date[] |
LocalDate[] |
optional<std::vector<optional<hazelcast:client:local_date>>> |
HLocalDate?[]? |
Optional[list[Optional[datetime.date]]] |
(LocalDate | null)[] | null |
[]*types.LocalDate |
Array of Dates |
no |
|
timestamp[] |
LocalDateTime[] |
optional<std::vector<optional<hazelcast:client:local_date_time>>> |
HLocalDateTime?[]? |
Optional[list[Optional[datetime.datetime]]] |
(LocalDateTime | null)[] | null |
[]*types.LocalDateTime |
Array of Timestamps |
no |
|
timestampWithTimeZone[] |
OffsetDateTime[] |
optional<std::vector<optional<hazelcast:client:offset_date_time>>> |
HOffsetDateTime?[]? |
Optional[list[Optional[datetime.datetime]]] |
(OffsetDateTime | null)[] | null |
[]*types.OffsetDateTime |
Array of TimestampWithTimeZones |
no |
|
compact[] |
T[] |
template<typename T> optional<std::vector<optional<T>>> |
T?[]? |
Optional[list[Optional[Any]]] |
(T | null)[] | null |
[]interface{} |
Array of compacts |
no |
|
nullable-boolean |
Boolean |
optional<bool> |
bool? |
Optional[bool] |
boolean | null |
*bool |
null or int8 1 for true int8 0 for false |
no |
|
nullable-int8 |
Byte |
optional<int8_t> |
sbyte? |
Optional[int] |
number | null |
*int8 |
An int8 that can also be null |
no |
|
nullable-int16 |
Short |
optional<int16_t> |
short? |
Optional[int] |
number | null |
*int16 |
An int16 that can also be null |
no |
|
nullable-int32 |
Integer |
optional<int32_t> |
int? |
Optional[int] |
number | null |
*int32 |
An int32 that can also be null |
no |
|
nullable-int64 |
Long |
optional<int64_t> |
long? |
Optional[int] |
Long | null |
*int64 |
An int64 that can also be null |
no |
|
nullable-float32 |
Float |
optional<float> |
float? |
Optional[float] |
number | null |
*float32 |
A float32 that can also be null |
no |
|
nullable-float64 |
Double |
optional<double> |
double? |
Optional[float] |
number | null |
*float64 |
A double that can also be null |
no |
|
nullable-boolean[] |
Boolean[] |
optional<std::vector<optional<bool>>> |
bool?[]? |
Optional[list[Optional[bool]]] |
(boolean | null)[] | null |
[]*bool |
Array of nullable booleans |
no |
|
nullable-int8[] |
Byte[] |
optional<std::vector<optional<int8_t>>> |
sbyte?[]? |
Optional[list[Optional[int]]] |
(number | null)[] | null |
[]*int8 |
Array of nullable int8s |
no |
|
nullable-int16[] |
Short[] |
optional<std::vector<optional<int16_t>>> |
short?[]? |
Optional[list[Optional[int]]] |
(number | null)[] | null |
[]*int16 |
Array of nullable i1int6s |
no |
|
nullable-int32[] |
Integer[] |
optional<std::vector<optional<int32_t>>> |
int?[]? |
Optional[list[Optional[int]]] |
(number | null)[] | null |
[]*int32 |
Array of nullable int32s |
no |
|
nullable-int64[] |
Long[] |
optional<std::vector<optional<int64_t>>> |
long?[]? |
Optional[list[Optional[int]]] |
(Long | null)[] | null |
[]*int64 |
Array of nullable int64s |
no |
|
nullable-float32[] |
Float[] |
optional<std::vector<optional<float>>> |
float?[]? |
Optional[list[Optional[float]]] |
(number | null)[] | null |
[]*float32 |
Array of nullable float32s |
no |
|
nullable-float64[] |
Double[] |
optional<std::vector<optional<double>>> |
double?[]? |
Optional[list[Optional[float]]] |
(number | null)[] | null |
[]*float64 |
Array of nullable float64 |
no |
Type IDs
Each type supported in the wire format has its type ID. The type IDs are used while constructing the schemas, performing type checks when accessing fields, and are exposed in a public API.
Header
The partition hash and the type ID are common for all serialization methods supported by Hazelcast, including Compact Serialization. Every serialized object has a header and the payload on the wire.
Name |
Type |
Description |
Partition hash |
i32 |
|
Type ID |
i32 |
|
Var-size Objects
Var-size objects are user-defined objects whose binary representation consists of Header, Data, and Offsets sections, given in this order.
Data
Name |
Description |
Fixed-size Fields |
Fixed-size field offsets are deduced from the schema. |
Variable-size Fields |
Offsets
Name |
Type |
Description |
Variable-Size FieldOffset index 0 |
u8/u16/i32 |
The index of a field offset is written in the schema. Offsets of variable length fields. -1 for null |
Variable-Size FieldOffset index 1 |
u8/u16/i32 |
The index of a field offset is written in the schema. Offsets of variable length fields. -1 for null |
… |
… |
|
Variable-Size FieldOffset index n |
u8/u16/i32 |
The index of a field offset is written in the schema. Offsets of variable length fields. -1 for null |
Note: If the composed data does not include any variable-size field in the schema, Variable-Size FieldOffset
and DataLength
will not exist on the wire.
Similarly, if there is no fixed-size field in the schema, Fixed-Size Fields
will not exist on the wire.
Variable-Size FieldOffset`s are calculated from the beginning of the `DATA SECTION
shown in the table above.
Variable-Size FieldOffset
sizes vary depending on the Data Length.
-
Data Length ⇐
254
, offsets areu8
(255
is reserved fornull
) -
Data Length ⇐
65534
, offsets areu16
(65535
is reserved fornull
) -
Otherwise, offsets are
i32
.
Length is written before offsets so that the binary can be skipped even when the schema cannot be found.
A Variable-Size FieldOffset is -1
if a Variable-Size field is null
.
Fixed-Size Fields cannot be null
.
Fixed-Size Fields
The fixed-size fields are written after the Length
field, consecutively. Fixed-size fields are accessed via their offsets that are calculated from the schema.
On the schema, the offset for a fixed-size field is determined as follows:
-
The first field always starts from offset 0.
-
Fields are ordered by their size in descending order.
-
When field sizes are the same, fields are ordered by the field name.
-
Each offset is calculated by adding the size of the last field to the last offset.
The only exception to the above rule is boolean fields. The size of a boolean is a byte, but 8 booleans can be packed into a single byte and these booleans can share the same offset and byte. To achieve that, extra information is stored in the schema (nothing extra on the data) for the bit index of the boolean fields. Boolean fields are written at the end of the fixed-size fields.
Variable-size Fields
The offsets of variable-size fields are written at the end in the alphabetical order of the field names. Each variable-size field offset has an index, starting from 0, written in the schema. To read a variable-size field from the data, one should read the index of the offset from the schema. Then, the offset associated with the related index is read from the end of the data. The variable-size field can be read using this offset.
On the schema, the index for a variable-size field is determined as follows:
-
The fields are given the index incrementally, according to the order of the field names, starting from 0.
Based on the length of the serialized data, the offsets of the variable-size fields might be represented by 1, 2, or 4 bytes. That is, all variable-size field offsets are either 1, 2, or 4 bytes per serialized object, depending on the field size.
Schema
Name |
Type |
type name |
string |
number of fields |
i32 |
name of field 0 |
string |
type ID of field 0 |
i32 |
name of field 1 |
string |
type ID of field 1 |
i32 |
… |
… |
name of field n |
string |
type ID of field n |
i32 |
A schema keeps the type name of the Compact type and some fields. A schema represents the structure of a Compact serialized data and allows reading the data. The type name is used to differentiate the user types which have the same number of fields with the same types.
Only schema ID is written to serialized data, however, schemas need to be exchanged between a client and a member to be able to serialize that data. A schema orders its fields by field name during initialization, so that the schema in both the client and the member produces the same schema ID.
During initialization, field offsets and indexes are assigned as well. Offsets are pointers to the locations of the fields in the serialized data. Indexes are for enumerating variable-size fields' offsets. Variable-size fields' offsets are written in the serialized data. For details on how fixed-size field offsets are assigned, see Fixed-Size Fields. For details on how variable-size field indexes and offsets are assigned, see Variable-size Fields.
In the schema class, each field will either:
-
have a positive offset, if it is a fixed-size field
-
have a positive index if it is a variable-size field
-
have a positive bit offset if it is a boolean field, which is the offset within the byte given by the normal offset.
Schema ID
We are using 64bit Rabin fingerprint to create a schema ID.
Rabin fingerprint is chosen mostly because it is recommended in Avro’s documentation as follows.
At the opposite extreme, the smallest fingerprint recommended is a 64-bit Rabin fingerprint. Below, there is a provided pseudo-code for this algorithm that can be easily translated into any programming language. 64-bit fingerprints should guarantee uniqueness for schema caches of up to a million entries (for such a cache, the chance of a collision is 3E-8). It is not recommended to use shorter fingerprints, as the chances of collisions are too high (for example, with 32-bit fingerprints, a cache with as few as 100,000 schemas has a 50% chance of having a collision).
According to the quote, even with a schema cache with a million entries, the chance of a collision is very low. Therefore, there should not be a need to the change number of bits of the hashing algorithm soon.
The schema ID is calculated from the byte array representation of the schema described above.
The implementation is as follows:
long fingerprint64(byte[] buf) {
if (FP_TABLE == null) initFPTable();
long fp = EMPTY;
for (int i = 0; i < buf.length; i++)
fp = (fp >>> 8) ^ FP_TABLE[(int)(fp ^ buf[i]) & 0xff];
return fp;
}
static long EMPTY = 0xc15d213aa4d7a795L;
static long[] FP_TABLE = null;
void initFPTable() {
FP_TABLE = new long[256];
for (int i = 0; i < 256; i++) {
long fp = i;
for (int j = 0; j < 8; j++)
fp = (fp >>> 1) ^ (EMPTY & -(fp & 1L));
FP_TABLE[i] = fp;
}
}
Arrays
Arrays of fixed-size items can not have null
items. On the other hand, arrays of variable-size items may contain null
items.
Array of Fixed-size Items
Name | Type |
---|---|
Number of items |
i32 |
item 0 |
item type |
item 1 |
item type |
item 2 |
item type |
item n |
item type |
Offsets
Name |
Type |
Item 0 offset |
u8/u16/i32 |
Item 1 offset |
u8/u16/i32 |
… |
… |
Item n offset |
u8/u16/i32 |
An array can contain only a single type of item. In the case of Compact[], all the items must have the same schema, that is, their schema ID must be equal.
Offsets are calculated from the beginning of the Data section.
Data Length
is the length of the Data section.
Offset sizes vary depending on the Data Length.
-
For Data Length ⇐
254
, offsets areu8
(255
is reserved fornull
) -
For
255
< Data Length ⇐65534
, offsets areu16
(65535
is reserved fornull
) -
For Data Length > 65535, offsets are
i32
.
Variable-size items can be null
. The corresponding offset will be set to -1
in that case.