Get Started with SQL Over Files
In this tutorial, you learn the basics of querying files in SQL by creating and querying a CSV file.
Before You Begin
To complete this tutorial, you need the following:
Prerequisites | Useful resources |
---|---|
A Hazelcast cluster in client/server mode and an instance of Management Center running on your local network. |
|
A connection to the SQL shell |
Step 1. Create a Data Source to Query
SQL can query data in maps, Kafka topics and the local cluster’s file system.
In this step, you create a CSV file that you can use to query.
-
Create a file named
likes.csv
. -
Add the following data to your file.
msg_id,likes,dislikes 1,20,13 2,108,25 3,122,73 4,9,88 5,51,42
This example file contains a record of the number of likes each person has.
Step 2. Create a Mapping to the File
To allow Hazelcast to find and recognize the data in your CSV file, you need to create a mapping to it.
In the SQL shell, use the CREATE MAPPING
statement to configure the file connector and give Hazelcast access to the data in the likes.csv
file.
CREATE MAPPING csv_likes (msg_id INT, likes INT, dislikes INT)
TYPE File
OPTIONS ('format'='csv',
'path'='/absolute/path/to/current/directory', 'glob'='likes.csv');
Make sure you replace the path option with the absolute path to your CSV file.
|
Step 3. Run Ad-Hoc Queries
Ad-hoc queries allow you to retrieve a small subset of data. Usually these queries are simple and you can have many of them running concurrently in a Hazelcast cluster.
-
Use a
SELECT
statement to query all the data in thelikes.csv
file.SELECT * FROM csv_likes;
You should see the following:
+------------+------------+------------+ | msg_id| likes| dislikes| +------------+------------+------------+ | 1| 20| 13| | 4| 9| 88| | 3| 122| 73| | 2| 108| 25| | 5| 51| 42| +------------+------------+------------+
-
Query only the
msg_id
andlikes
columns, by adding them as a comma-separated list after theSELECT
statement.SELECT msg_id, likes FROM csv_likes;
+------------+------------+ | msg_id| likes| +------------+------------+ | 1| 20| | 3| 122| | 4| 9| | 2| 108| | 5| 51| +------------+------------+
-
Use a filter to display only the message numbers with more than 50 likes.
SELECT msg_id FROM csv_likes WHERE likes > 50;
+------------+ | msg_id| +------------+ | 2| | 5| | 3| +------------+
-
Give the
msg_id
column an alias for the query results.This clause does not rename the column in the table. SELECT msg_id AS message_number, likes, dislikes FROM csv_likes WHERE likes > 20;
+--------------+------------+------------+ |message_number| likes| dislikes| +--------------+------------+------------+ | 2| 108| 25| | 3| 122| 73| | 5| 51| 42| +--------------+------------+------------+
-
To filter rows on more than one condition, you can join conditions with the
AND
,OR
, andNOT
operators.SELECT * FROM csv_likes WHERE likes > 100 OR dislikes < 30;
+------------+------------+------------+ | msg_id| likes| dislikes| +------------+------------+------------+ | 1| 20| 13| | 2| 108| 25| | 3| 122| 73| +------------+------------+------------+
SELECT * FROM csv_likes WHERE likes > 100 AND dislikes < 30;
+------------+------------+------------+ | msg_id| likes| dislikes| +------------+------------+------------+ | 2| 108| 25| +------------+------------+------------+
If you need more control over how your data is being transformed and aggregated, you may want to build a pipeline with the Jet API.
Step 4. Run Federated Queries
Federated queries are those that join tables from different datasets.
Normally, SQL queries are executed on one particular database or dataset. However, with Hazelcast, you can pull information from different sources and present a more complete picture of the data.
-
Configure the map connector to create a new table called
dislikes
.CREATE MAPPING names TYPE IMap OPTIONS ('keyFormat'='int', 'valueFormat'='varchar');
This table is mapped to a distributed map in Hazelcast where the key is an integer and the value is a string.
-
Use
SINK INTO
statements to add some entries to the map.SINK INTO names VALUES (1, 'Greg'), (2, 'Jerry'), (3, 'Mary'), (4, 'Jerry'), (5, 'Joe');
-
Use the
JOIN
clause to merge results from themessages
andnames
tables so you can see who has the most likes and dislikes.The data source on the right of the join must always be a map. SELECT names.this, csv_likes.likes, csv_likes.dislikes FROM csv_likes JOIN names ON csv_likes.msg_id = names.__key;
+--------------------+------------+------------+ |this | likes| dislikes| +--------------------+------------+------------+ |Jerry | 108| 25| |Greg | 20| 13| |Jerry | 9| 88| |Joe | 51| 42| |Mary | 122| 73| +--------------------+------------+------------+
-
Use the
ORDER BY
clause to order the results by name and use theLIMIT
clause to limit them so that only the first two are displayed. Change the header of the names column to name.SELECT names.this AS name, csv_likes.likes, csv_likes.dislikes FROM csv_likes JOIN names ON csv_likes.msg_id = names.__key ORDER BY names.this LIMIT 2;
+--------------------+------------+------------+ |name | likes| dislikes| +--------------------+------------+------------+ |Greg | 20| 13| |Jerry | 9| 88| +--------------------+------------+------------+
-
Use the
SUM()
function to aggregate the total number of likes for each person and group the results by name.SELECT names.this AS name, sum(csv_likes.likes) AS total_likes FROM csv_likes JOIN names ON csv_likes.msg_id = names.__key GROUP BY name;
You should see the following:
+--------------------+--------------------+ |name | total_likes| +--------------------+--------------------+ |Greg | 20| |Mary | 122| |Joe | 51| |Jerry | 117| +--------------------+--------------------+
The results do not include a row for each Jerry because the
GROUP BY
statement groups the results by name. -
Filter for the names that have more than 100 likes combined, using the
HAVING
clause. This clause is equivalent to theWHERE
clause but for aggregate groups.SELECT names.this AS most_liked FROM csv_likes JOIN names ON csv_likes.msg_id = names.__key GROUP BY names.this HAVING SUM(likes) > 100;
+--------------------+ |most_liked | +--------------------+ |Jerry | |Mary | +--------------------+
For a list of available aggregations, see SQL Functions and Operators.
Step 5. Ingest Query Results into a Hazelcast Map
To save your query results as a view, you can cache them in Hazelcast by ingesting them into a map.
-
Configure the map connector to create a new table called
likes_and_dislikes
.CREATE MAPPING likes_and_dislikes ( __key INT, name VARCHAR, likes INT, dislikes INT ) TYPE IMap OPTIONS ('keyFormat'='int', 'valueFormat'='json-flat');
This table is mapped to a distributed map in Hazelcast where the key is an integer and the value is an object that’s serialized to JSON.
-
Run the
JOIN
query to merge results from the CSV file and thedislikes
map and insert them into thelikes_and_dislikes
map.INSERT INTO likes_and_dislikes SELECT csv_likes.msg_id, names.this, csv_likes.likes, csv_likes.dislikes FROM csv_likes JOIN names ON csv_likes.msg_id = names.__key;
-
Make sure that the query results were added to the map.
SELECT * FROM likes_and_dislikes ORDER BY __key;
+------------+--------------------+------------+------------+ | __key|name | likes| dislikes| +------------+--------------------+------------+------------+ | 1|Greg | 20| 13| | 2|Jerry | 108| 25| | 3|Mary | 122| 73| | 4|Jerry | 9| 88| | 5|Joe | 51| 42| +------------+--------------------+------------+------------+
Next Steps
Learn how to query maps with SQL.
Explore all available SQL statements.