Process Customer Satisfaction Scores on Hazelcast Cloud

In this tutorial, you’ll build an application that calculates the standard deviation of customer satisfaction scores, using code that’s executed on each member of a Hazelcast Cloud cluster.

Context

Knowing the center of a dataset (average, mean, or median) doesn’t provide enough insight into the best and worse scores. To find out how the customer satisfaction scores are spread out, it is better to use standard deviation.

Standard deviation is a statistical calculation that has been adopted for calculating how numbers deviate from the average. The business logic for standard deviation includes these steps:

  1. Calculate the average of the input numbers.

  2. Calculate the running total of the square of how much each number differs from the average.

  3. Calculate the average of the running total.

  4. Take the square root of the running total divided by the number of inputs and call it the standard deviation.

The second step is open to parallelism. If you have more than one CPU, you can run this step for several numbers at once, as long as you have a way to combine these independent answers into the running total. This is what you’ll use Hazelcast for.

Here is an example of standard deviation for two sets of numbers.

Table 1. First set of numbers
Number Difference from the average (3) Square Running total

1

2

4

4

2

1

1

5

3

0

0

5

4

1

1

6

5

2

4

10

Table 2. Second set of numbers
Number Difference from the average (3) Square Running total

1

2

4

4

1

2

4

8

3

0

0

8

5

2

4

12

5

2

4

16

Examples:

  • The running total for the first set is 10 and there were 5 numbers. 10 divided by 5 is 2. The square root of 2 gives the standard deviation of 1.41.

  • The running total for the second set is 16 and there were 5 numbers. 16 divided by 5 is 3.2. The square root of 3.2 gives the standard deviation of 1.78.

The standard deviation results prove that the first set of numbers doesn’t vary as much as the second set.

In this tutorial, you’ll deploy the following to a Cloud Standard cluster:

  • A Customer class to store Customer objects in a map. The objects will include a customer satisfaction score.

  • An executor that the cluster will use to calculate the standard deviation of all customer satisfaction scores in the map.

You’ll then use a client to connect to the cluster, load some Customer objects into a map, and calculate the standard deviation of all customer satisfaction scores.

The client code shows both the junior developer and senior developer approach. The junior developer approach does the processing on the client-side. The senior developer approach is more efficient because it offloads the processing to the Cloud Standard cluster, using the executor.

The code in this tutorial is available as a sample app on GitHub.

Before you Begin

You’ll need the following to complete this tutorial:

Step 1. Clone the Sample

In this step, you’ll clone the sample code from GitHub and learn how it works.

Clone the GitHub repository.

  • HTTPS

  • SSH

git clone https://github.com/hazelcast-guides/standard-deviation-parallel-processing.git

cd standard-deviation-parallel-processing
git clone git@github.com:hazelcast-guides/standard-deviation-parallel-processing.git

cd standard-deviation-parallel-processing

The code is separated into two parts:

  • The client/ directory contains the client application that connects to the cluster and runs both the junior developer code and the senior developer code.

  • The cluster-side/ directory contains the classes that need to be uploaded to the cluster to allow the cluster to store and process the standard deviation of customer satisfaction scores. Only the senior developer code triggers the processing of customer satisfaction scores on the cluster. The junior developer code processes them on the client side.

The important part to understand is the difference between the junior developer code and the senior developer code.

In each example, the cluster stores five Customer objects.

The junior developer code requests five integers from the cluster in order to run the calculation. The senior developer code runs the calculation on the cluster, which returns one double from each member.

Now imagine that you have 1,000,000 Customer objects stored in the cluster. The junior developer approach now requests 1,000,000 integers from the cluster across the network and the senior developer approach still returns one double from each member.

It’s clear that if data volumes increase, the cluster-side computation copes better. The same amount of data still has to be examined, but less data moves across the network.

Client JuniorDeveloper.java

In this class, the average is calculated like this:

public static double average(IMap<Integer, Customer> iMap) {

    int count = 0;
    double total = 0;

    for (Integer key : iMap.keySet()) {
        count++;
        total += iMap.get(key).getSatisfaction();
    }

    return (total / count);
}

This code is easy to understand, which is good for maintenance. However, it has three flaws:

  • The code requires every customer record to be moved from where the data is stored, in this case the Hazelcast cluster, to where the calculation is run. Even if a projection were added, this approach places a heavy load on the network.

  • The keySet() operation produces a collection that contains all the keys to iterate across. If this collection is large, it could result in memory overflow.

  • If the map is empty, the code results in a division by zero. A good unit test would find this.

The sum of the square of the differences is calculated on the client-side like this:

public static double totalDifferenceSquared(IMap<Integer, Customer> iMap, double average) {

    double total = 0;

    for (Integer key : iMap.keySet()) {
        int satisfaction = iMap.get(key).getSatisfaction();
        double difference = satisfaction - average;
        total += difference * difference;
    }

    return total;
}
Client SeniorDeveloper.java

In this class, the average is calculated in one line. Hazelcast provides built-in functions for this type of calculation.

public static double average(IMap<Integer, Customer> iMap) {
    return iMap.aggregate(Aggregators.integerAvg("satisfaction"));
}

The sum of the square of the differences is calculated on the cluster-side like this:

public static double totalDifferenceSquared(IMap<Integer, Customer> iMap, double average,
        HazelcastInstance hazelcastInstance) {

    TotalDifferenceSquaredCallable totalDifferenceSquaredCallable =
            new TotalDifferenceSquaredCallable(iMap.getName(), average);

    IExecutorService executorService =
            hazelcastInstance.getExecutorService("default");

    // Run the Callable on all members in parallel.
    Map<Member, Future<Double>> results =
            executorService.submitToAllMembers(totalDifferenceSquaredCallable);

    double total = 0;

    for (Entry<Member, Future<Double>> entry: results.entrySet()) {
        try {
            total += entry.getValue().get();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    return total;
}

The client uses the ExecutorService API to run a Callable task on each member in the cluster. The Callable class is named TotalDifferenceSquaredCallable.java and it runs the calculation on the Hazelcast member where it is invoked.

Each member calculates a subtotal only for the data that member owns. If you have two members, the calculation takes half the time. If you have ten members, the calculation takes a tenth of the time. Each member runs its calculation independently, the runtime is dependent on how much data each member hosts rather than how much data exists as a whole.

To do this, the task uses the localKeySet() method instead of the keySet() method. This localKeySet() method returns the keys that are held by the current process, and since the task is run on all processes, all the keys are included.

Each member in the cluster only needs to work with the keys, and therefore entries, that it owns. As a result, members don’t have to do any network calls to retrieve the keys from elsewhere.

Finally, the task is submitted across each member in the cluster, which results in a collection of Future objects. One Future is returned for each member that runs the task, and the task needs to wait for them all to finish. This is not exactly difficult, since all members will have roughly the same amount of data. You can assume that the execution time is the same for all members and iterate across this collection running Future.get() to get the subtotal from each member.

The code in this example disregards concurrency. Step 1 calculates the average, then step 2 calculates the deviation from the average. It is possible that records are added or removed between step 1 and 2, which makes the calculation wrong.

Step 2. Deploy the Classes to the Cluster

In this step, you’ll use the Hazelcast Cloud Maven plugin to package the cluster-side modules into a single JAR file and upload that file to your cluster.

  1. Open the pom.xml file in the cluster-side/ directory.

  2. Configure the Maven plugin with values for the following elements:

    Element Location in Cloud console

    <clusterName>

    Next to Connect Client, select any client and go to Advanced setup. The cluster name/ID is at the top of the list.

    <apiKey> and <apiSecret>

    You can create only one API key and secret pair on your account. If you need to change your API credentials, you must first remove your existing credentials, and then create new credentials.

    To create a set of API credentials, do the following:

    1. Sign into the Cloud console

    2. Select Account from the side navigation bar

    3. Select Developer from the Account options

      The Developer screen displays.

    4. Select the Generate New API Key button

    Use these credentials in your applications to manage all clusters in your account.

    After creating your credentials, ensure that you applications are correctly configured to use the API credentials.

    <plugin>
        <groupId>com.hazelcast.cloud</groupId>
        <artifactId>hazelcast-cloud-maven-plugin</artifactId>
        <version>0.2.0</version>
        <configuration>
            <clusterId></clusterId>
            <apiKey></apiKey>
            <apiSecret></apiSecret>
        </configuration>
    </plugin>
  3. Change into the cluster-side/ directory.

  4. Execute the following goal of the Maven plugin to package the project into a JAR file and deploy that file to your cluster:

    mvn clean package hazelcast-cloud:deploy

The file is uploaded and is ready to be used:

[INFO] Artifact with custom classes standard-deviation-cluster-side-0.1-SNAPSHOT.jar was uploaded and is ready to be used

Step 3. Run the Client Application

Now that the cluster-side modules are ready to use, you can run the client to trigger the process of calculating the standard deviation of customer satisfaction scores.

  1. Open the Cloud console.

  2. Next to Connect Client, select any client.

  3. Select Advanced setup and click Download keystore file.

  4. Extract the files and copy the client.keystore and client.truststore files to the client/src/main/resources directory.

  5. Leave the Cloud console open. You’ll need some of these details in next steps.

  6. Change into the client/ directory.

  7. Execute the client.

    mvn clean compile exec:java@standard-deviation -Dexec.cleanupDaemonThreads=false
  8. When prompted, enter the cluster name, keystore password, and discovery token for your cluster. These details are in the Advanced setup tab in the Cloud console.

The client connects to the cluster and calculates the standard deviation, using both the junior developer code and the senior developer code.

--------------------------------------------------
Hazelcast client 'hz.client_1', using map 'your.company.name.Customer'
--------------------------------------------------
TestDataLoader.loadTestData('your.company.name.Customer')
 -> 0 Customer [firstName=Brian, satisfaction=4]
 -> 1 Customer [firstName=Mick, satisfaction=1]
 -> 2 Customer [firstName=Keith, satisfaction=2]
 -> 3 Customer [firstName=Bill, satisfaction=5]
 -> 4 Customer [firstName=Charlie, satisfaction=3]

Step 1 : ----------------
Locally calculated average..: 3.0
Remotely calculated average.: 3.0
-------------------------
Step 2 : ----------------
Locally calculated total difference squared..: 10.0
Remotely calculated total difference squared.: 10.0
-------------------------
Step 3 : ----------------
Locally calculated average difference squared..: 2.0
Remotely calculated average difference squared.: 2.0
-------------------------
Step 4 : ----------------
Locally calculated STANDARD DEVIATION..: 1.4142135623730951
Remotely calculated STANDARD DEVIATION.: 1.4142135623730951
-------------------------
--------------------------------------------------
Disconnecting the Hazelcast client
--------------------------------------------------

Summary

In this tutorial, you learned how to do the following:

  • Use an executor to process data in parallel on a Cloud Standard cluster.

  • Deploy cluster-side modules to a Cloud Standard cluster using the Hazelcast Cloud Maven plugin.