Cardinality Estimator Service
Hazelcast’s cardinality estimator service is a data structure which implements Flajolet’s HyperLogLog algorithm for estimating cardinalities of unique objects in theoretically huge data sets. The implementation offered by Hazelcast includes improvements from Google’s version of the algorithm, i.e., HyperLogLog++.
The cardinality estimator service does not provide any ways to configure its properties, but rather uses some well tested defaults:
-
P
: Stands for precision with a default value of 14 (using the 14 LSB of the hash for the index) -
M
: 2 ^ P = 16384 (16K) registers -
P'
: Stands for sparse precision with a default value of 25 -
Durability
: Count of backups for each estimator with a default value of 2
It is important to understand that this data structure is not
100% accurate, it is used to provide estimates. The error rate is typically
a result of 1.04/sqrt(M) which in our implementation is around 0.81% for high percentiles.
|
The memory consumption of this data structure is close to 16K despite the size of elements in the source data set or stream.
There are two phases in using the cardinality estimator.
-
Add objects to the instance of the estimator, e.g., for IPs
estimator.add("0.0.0.0.")
. The provided object is first serialized and then the byte array is used to generate a hash for that object.Objects must be serializable in a form that Hazelcast understands. -
Compute the estimate of the set so far
estimator.estimate()
.
See the cardinality estimator Javadoc for more information on its API.
The following is an example code.
HazelcastInstance hz = Hazelcast.newHazelcastInstance();
CardinalityEstimator visitorsEstimator = hz.getCardinalityEstimator("visitors");
InputStreamReader isr = new InputStreamReader(ExampleCardinalityEstimator.class.getResourceAsStream("visitors.txt"));
BufferedReader br = new BufferedReader(isr);
try {
String visitor = br.readLine();
while (visitor != null) {
visitorsEstimator.add(visitor);
visitor = br.readLine();
}
} catch (IOException e) {
e.printStackTrace();
} finally {
closeResource(br);
closeResource(isr);
}
System.out.printf("Estimated unique visitors seen so far: %d%n", visitorsEstimator.estimate());
Hazelcast.shutdownAll();
Split-Brain Protection for Cardinality Estimator
Cardinality Estimator can be configured to check for a minimum number of available members before applying its operations (see the Split-Brain Protection section). This is a check to avoid performing successful queue operations on all parts of a cluster during a network partition.
The following is a list of methods, grouped by quorum type, that support split-brain protection checks:
-
WRITE, READ_WRITE:
-
add
-
addAsync
-
-
READ, READ_WRITE:
-
estimate
-
estimateAsync
-
Configuring Split-Brain Protection
Split-brain protection for Cardinality Estimator can be configured
programmatically using the method setQuorunName(),
or declaratively using the element quorum-ref
.
Following is an example declarative configuration:
<hazelcast>
...
<cardinality-estimator name="default">
<quorum-ref>quorumname</quorum-ref>
</cardinality-estimator>
...
</hazelcast>
The value of quorum-ref
should be the quorum configuration name which you configured under the quorum
element as explained in the
Split-Brain Protection section.
Configuring Merge Policy
While recovering from a split-brain syndrome, Cardinality Estimator
in the small cluster merges into the bigger cluster based on a configured
merge policy. When an estimator merges into the cluster, an estimator with
the same name might already exist in the cluster.
So the merge policy resolves these kinds of conflicts with different out-of-the-box strategies.
It can be configured programmatically using the method
setMergePolicyConfig(),
or declaratively using the element merge-policy
.
Following is an example declarative configuration:
<hazelcast>
...
<cardinality-estimator name="default">
<merge-policy>HyperLogLogMergePolicy</merge-policy>
</cardinality-estimator>
...
</hazelcast>
The following out-of-the-box merge policies are available:
-
DiscardMergePolicy
: Estimator from the smaller cluster is discarded. -
HyperLogLogMergePolicy
: Estimator merges with the existing one, using the algorithmic merge for HyperLogLog. This is the default policy. -
PassThroughMergePolicy
: Estimator from the smaller cluster wins. -
PutIfAbsentMergePolicy
: Estimator from the smaller cluster wins if it doesn’t exist in the cluster.