MapReduce Deprecation and Removal
This section informs Hazelcast users about the MapReduce deprecation and removal, its motivation and replacements.
We’ve decided to deprecate the MapReduce framework in Hazelcast IMDG 3.8. MapReduce support was completely removed in Hazelcast IMDG 4.0. The MapReduce framework provided the distributed computing model and it was used to back the old Aggregations system. Unfortunately the implementation didn’t live up to the expectations and adoption wasn’t high, so it never got out of Beta status. Apart from that the current shift in development away from M/R-like processing to a more near-realtime, streaming approach left us with the decision to deprecate and finally remove the MapReduce framework from Hazelcast IMDG. With that said, we want to introduce the successors and replacements; fast Aggregations on top of Query infrastructure and the Hazelcast Jet distributed computing platform.
MapReduce is a very powerful tool, however it’s demanding in terms of space, time and bandwidth. We realized that we don’t need so much power when we simply want to find out a simple metric such as the number of entries matching a predicate. Therefore, the built-in aggregations were rebuilt on top of the existing Query infrastructure (count, sum, min, max, mean, variance) which automatically leverages any matching query index. The aggregations are computed in tho phases:
1st phase: on each member (scatter)
2nd phase: one member aggregates responses from members (gather)
It is not as flexible as a full-blown M/R system due to the 2nd phase being single-member and the input can be massive in some use cases. The member doing the 2nd step needs enough capacity to hold all intermediate results from all members from the 1st step, but in practice it is sufficient for many aggregation tasks like "find average" or "find highest" and other common examples.
The benefits are:
utilization of existing indexes.
Hazelcast has native support for aggregation operations on the contents of its distributed data structures. They operate on the assumption that the aggregating function is commutative and associative, which allows the two-tiered approach where first the local data is aggregated, then all the local subresults sent to one member, where they are combined and returned to the user. This approach works quite well as long as the result is of manageable size. Many interesting aggregations produce an O(1) result and for those, the native aggregations are a good match.
The main area where native aggregations may not be sufficient are the operations that
group the data by key and produce results of size O (
keyCount). The architecture of
Hazelcast aggregations is not well adapted to this use case, although it still works
even for moderately-sized results (up to 100 MB, as a ballpark figure). Beyond these
numbers, and whenever something more than a single aggregation step is needed, Jet
becomes the preferred choice. In the mentioned use case Jet helps because it doesn’t
send the entire hashtables in serialized form and materialize all the results on the
user’s machine, but rather streams the key-value pairs directly into a target IMap.
Since it is a distributed structure, it doesn’t focus its load on a single member.
Jet’s DAG paradigm offers much more than the basic map-reduce-combine cascade. Among other setups, it can compose several such cascades and also perform co-grouping, joining and many other operations in complex combinations.