VectorCollection data structure design
A Hazelcast vector database engine is a specialized type of database, which is optimized for storing, searching, and managing vector embeddings and additional metadata. You can include values in the metadata to provide filtering, additional processing, or analysis of vectors.
For further information on the design of the Vector Collection data structure, see Vector Collection.
Architecture
The main components of vector search are illustrated in the following high-level diagram:
The components shown in the diagram are as follows:
-
Clients: methods and libraries for connecting and working with the vector database. Currently, two language clients are available: Java and Python.
-
Collections: a set consisting of one or several indexes and a one metadata storage. For further information about vector collection, see Vector Collection.
-
Indexes: named sets of vectors. Vectors within the same index must have the same dimensions and be compared using the same metric. For further information on indexes, see Index Overview.
-
Metadata storage: key-value storage for storing metadata values. The key and value can be any objects; for example, JSON, unstructured text, or java-serialized POJO. For more information about the available storage types, see Serializing Objects and Classes
An overview of vector collection is illustrated below:
A user-defined key, which is assigned during the addition of the vector to the collection, establishes a link between the vector and the metadata. Each key corresponds to a single vector from each index.
Index
Essentially, each index serves as a distinct vector space. In terms of storage, the index is a graph where each node represents a vector, and the edges are organized to optimize search efficiency.
Partitioning and Replication
Each collection is partitioned and replicated based on the system’s general partitioning rules. Data partitioning is carried out using the collection key.
For further information on Hazelcast partitioning, see Data Partitioning and Replication.
Version 5.5/beta supports partitioning and migration but does not include support for the backup process. |
Partitioned similarity search
Vector indexes are partitioned, so when you execute similarity search all partitions need to be searched and partial results aggregated. This process impacts search performance and recall:
-
considering more candidate results usually yields better recall but uses more resources,
-
staged execution affects latency because there is more communication and some stages need to wait for all partial results before proceeding.
Two-stage search
The default search algorithm is a two-stage search which works as follows:
-
The member that received the query becomes a coordinator.
-
The coordinator distributes the search request to each member (including the coordinator member). Each member is tasked with returning results from partitions it owns.
-
Each member executes the search on owned partitions in parallel and aggregates the partition results.
-
Each member returns the partially aggregated results to the coordinator.
-
The coordinator aggregates the partial results and generates the final result. If required, searches on some partitions can be retried individually. For example, this can be useful for migrations, when members leave the cluster, or to resolve errors.
At each stage, aggregation is based on score and only the best results are retained.
Two important parameters in this search algorithm determine the amount of data sent between the members and the quality of the final result. These parameters are as follows:
-
partitionLimit
- number of search results obtained from each partition -
memberLimit
- number of search results returned from member to coordinator
To allow the system to return enough results, the following conditions must be satisfied where topK
denotes the number of entries requested:
-
partitionLimit * partitionCount >= topK
,partitionLimit <= topK
-
memberLimit * memberCount >= topK
,memberLimit <= topK
By default, partitionLimit
and memberLimit
are equal to topK
. While this satisfies the inequalities given above, it can result in the processing of more results than requested.
This improves the overall quality of the results but can have a significant performance overhead because more entries are fetched from each partition of the index and sent between the members.
Consider tuning partitionLimit based on quality and latency requirements. The number of partitions must also be considered and updated as required when making adjustments to partitionLimit . For further information on the implications of the partition count, see Partition Count Impact.
memberLimit is less critical for overall behavior if there are only a few members.
|
Single-stage search
A simplified search algorithm can be used, which does not perform intermediate aggregation of results at member level. It is used where the cluster has only a single member, or can be enabled using search hint.
A single-stage search request is executed in parallel on all partitions (on their owners) and partition results are aggregated directly on the coordinator member to produce the final result.
This search algorithm uses the partitionLimit
parameter, which behaves in the same way as for two-stage search.
Partition count impact
The number of partitions has a big impact on the performance of the vector collection. The conflicting factors that can impact the selection of an optimal partition count are as follows:
-
data ingestion: a greater number of partitions results in improved parallelism, up to around the total number of partition threads in the cluster. After this point, more partitions will not significantly improve ingestion speed. If there are fewer partitions than number of cores, not all available resources will be utilized during ingestion because updates on a given partition are executed by single thread.
-
similarity search: in general, having fewer partitions results in better search performance and reduced latency. However, the impact on quality/recall is complicated and depends also on
partitionLimit
. -
migration: avoid partitions with a large memory size, including metadata, vectors and vector index internal representation. In general, the recommendation is for a partition size of around 50-100MB per partition, which results in fast migrations and small pressure on heap during migration. However, for vector search, the partition size can be increased above that general recommendation provided that there is enough heap memory for migrations (see below).
-
other data structures: number of partitions is a cluster-wide setting shared by all data structures. If the needs are vastly different, you might consider creating separate clusters.
It is not possible to change the number of partitions for an existing cluster. |
For this Beta version, the following apply:
|
Tuning tips
-
For searches with small
topK
(for example, 10) it may be beneficial to artificially increasetopK
, adjustpartitionLimit
accordingly, and discard extra results. If you need 10 results, a good starting point for tuning could betopK=100
and apartitionLimit
between 50 and 100. While this will make the search slower, it will also improve quality, sometimes significantly. Overall, this setup can be more efficient than increasing index build parameters (max-degree
,ef-construction
) which results in slower index builds and searches. With a very smalltopK
orparitionLimit
, the search algorithm is less able to escape local minima and find the best results. -
Vector deduplication does not incur significant overhead for uploads (usually less than 1%) and searches. You may consider disabling it to get slightly better performance and smaller memory usage if your dataset does not contain duplicated vectors. However, be aware that in the presence of many duplicated vectors with deduplication disabled, a similarity search may return poor quality results.
-
For a given query, each vector index partition is searched by 1 thread. The number of concurrent partition searches is configured by specifying a pool size for
hz:query
executor, which by default has 16 threads per member. In configurations with many cores, it can be increased to fully utilize all available cores as follows: -
If there are fewer partitions than available cores, not all cores will be used for single search execution. This is ok if you are focused on throughput, as in general fewer partitions means you need less resources. However, if you want to achieve the best latency for a single client, it is better to distribute the search to as many cores as possible, which requires having at least as many partitions as cores in the cluster.