VectorCollection data structure design

A Hazelcast vector database engine is a specialized type of database, which is optimized for storing, searching, and managing vector embeddings and additional metadata. You can include values in the metadata to provide filtering, additional processing, or analysis of vectors.

For further information on the design of the Vector Collection data structure, see Vector Collection.

Architecture

The main components of vector search are illustrated in the following high-level diagram:

The components shown in the diagram are as follows:

Clients: methods and libraries for connecting and working with the vector database. Currently, two language clients are available: Java and Python.
Collections: a set consisting of one or several indexes and a one metadata storage. For further information about vector collection, see Vector Collection.
Indexes: named sets of vectors. Vectors within the same index must have the same dimensions and be compared using the same metric. For further information on indexes, see Index Overview.
Metadata storage: key-value storage for storing metadata values. The key and value can be any objects; for example, JSON, unstructured text, or java-serialized POJO. For more information about the available storage types, see Serializing Objects and Classes

An overview of vector collection is illustrated below:

Vector collection overview

A user-defined key, which is assigned during the addition of the vector to the collection, establishes a link between the vector and the metadata. Each key corresponds to a single vector from each index.

Index

Essentially, each index serves as a distinct vector space. In terms of storage, the index is a graph where each node represents a vector, and the edges are organized to optimize search efficiency.

The index is based on the JVector library, which implements a DiskANN algorithm for similarity search.

Partitioning and Replication

Each collection is partitioned and replicated based on the system’s general partitioning rules. Data partitioning is carried out using the collection key.

For further information on Hazelcast partitioning, see Data Partitioning and Replication.

Version 5.5/beta supports partitioning and migration but does not include support for the backup process.

Data store

Hazelcast stores data in-memory (RAM) for faster access. Presently, the only available data storage option is the JVM heap store.

Fault Tolerance

Hazelcast distributes storage data across all cluster members. In the event of a graceful shutdown, the data is migrated to remaining active members. In version 5.5, there is no automatic data restoration in the event of an unexpected member loss.

Partitioned similarity search

Vector indexes are partitioned, so when you execute similarity search all partitions need to be searched and partial results aggregated. This process impacts search performance and recall:

considering more candidate results usually yields better recall but uses more resources,
staged execution affects latency because there is more communication and some stages need to wait for all partial results before proceeding.

Two-stage search

The default search algorithm is a two-stage search which works as follows:

The member that received the query becomes a coordinator.
The coordinator distributes the search request to each member (including the coordinator member). Each member is tasked with returning results from partitions it owns.
Each member executes the search on owned partitions in parallel and aggregates the partition results.
Each member returns the partially aggregated results to the coordinator.
The coordinator aggregates the partial results and generates the final result. If required, searches on some partitions can be retried individually. For example, this can be useful for migrations, when members leave the cluster, or to resolve errors.

At each stage, aggregation is based on score and only the best results are retained.

Two important parameters in this search algorithm determine the amount of data sent between the members and the quality of the final result. These parameters are as follows:

partitionLimit - number of search results obtained from each partition
memberLimit - number of search results returned from member to coordinator

To allow the system to return enough results, the following conditions must be satisfied where topK denotes the number of entries requested:

partitionLimit * partitionCount >= topK, partitionLimit <= topK
memberLimit * memberCount >= topK, memberLimit <= topK

By default, partitionLimit and memberLimit are equal to topK. While this satisfies the inequalities given above, it can result in the processing of more results than requested. This improves the overall quality of the results but can have a significant performance overhead because more entries are fetched from each partition of the index and sent between the members.

Consider tuning partitionLimit based on quality and latency requirements. The number of partitions must also be considered and updated as required when making adjustments to partitionLimit. For further information on the implications of the partition count, see Partition Count Impact. memberLimit is less critical for overall behavior if there are only a few members.

Single-stage search

A simplified search algorithm can be used, which does not perform intermediate aggregation of results at member level. It is used where the cluster has only a single member, or can be enabled using search hint.

A single-stage search request is executed in parallel on all partitions (on their owners) and partition results are aggregated directly on the coordinator member to produce the final result.

This search algorithm uses the partitionLimit parameter, which behaves in the same way as for two-stage search.

Partition count impact

The number of partitions has a big impact on the performance of the vector collection. The conflicting factors that can impact the selection of an optimal partition count are as follows:

data ingestion: a greater number of partitions results in improved parallelism, up to around the total number of partition threads in the cluster. After this point, more partitions will not significantly improve ingestion speed. If there are fewer partitions than number of cores, not all available resources will be utilized during ingestion because updates on a given partition are executed by single thread.
similarity search: in general, having fewer partitions results in better search performance and reduced latency. However, the impact on quality/recall is complicated and depends also on partitionLimit.
migration: avoid partitions with a large memory size, including metadata, vectors and vector index internal representation. In general, the recommendation is for a partition size of around 50-100MB per partition, which results in fast migrations and small pressure on heap during migration. However, for vector search, the partition size can be increased above that general recommendation provided that there is enough heap memory for migrations (see below).
other data structures: number of partitions is a cluster-wide setting shared by all data structures. If the needs are vastly different, you might consider creating separate clusters.

It is not possible to change the number of partitions for an existing cluster.

For this Beta version, the following apply:

The default value of 271 partitions can result in inefficient vector similarity searches. We recommend that you tune the number of partitions for use in clusters with vector collections.
The entire collection partition is migrated as a single chunk. If using partitions that are larger than the recommended size, ensure that you have sufficient heap memory to run migrations. The amount of heap memory required is approximately the size of the vector collection partition multiplied by the number of parallel migrations. To decrease pressure on heap memory, you can decrease the number of parallel migrations using hazelcast.partition.max.parallel.migrations and hazelcast.partition.max.parallel.replications.

Tuning tips

For searches with small topK (for example, 10) it may be beneficial to artificially increase topK, adjust partitionLimit accordingly, and discard extra results. If you need 10 results, a good starting point for tuning could be topK=100 and a partitionLimit between 50 and 100. While this will make the search slower, it will also improve quality, sometimes significantly. Overall, this setup can be more efficient than increasing index build parameters (max-degree, ef-construction) which results in slower index builds and searches. With a very small topK or paritionLimit, the search algorithm is less able to escape local minima and find the best results.
Vector deduplication does not incur significant overhead for uploads (usually less than 1%) and searches. You may consider disabling it to get slightly better performance and smaller memory usage if your dataset does not contain duplicated vectors. However, be aware that in the presence of many duplicated vectors with deduplication disabled, a similarity search may return poor quality results.
For a given query, each vector index partition is searched by 1 thread. The number of concurrent partition searches is configured by specifying a pool size for hz:query executor, which by default has 16 threads per member. If optimizing for search, we recommend setting the hz:query pool size to be that of the physical core count of your host machines: this will result in a good balance between search throughput and CPU utilization. Setting hz:query to have a pool size greater than that of the physical core count will not deliver a significant increase in throughput but it will increase total CPU utilization. The hz:query pool size can be changed as follows:
- Java
- XML
- YAML
Config cfg = new Config(); cfg.getExecutorConfig("hz:query").setPoolSize(100);
<hazelcast> ... <executor-service name="hz:query"> <pool-size>100</pool-size> </executor-service> ... </hazelcast>
hazelcast: ... executor-service: "hz:query": pool-size: 100
If there are fewer partitions than available cores, not all cores will be used for single search execution. This is ok if you are focused on throughput, as in general fewer partitions means you need less resources. However, if you want to achieve the best latency for a single client, it is better to distribute the search to as many cores as possible, which requires having at least as many partitions as cores in the cluster.