Get Started with SQL Over Files
In this tutorial, you learn the basics of querying files in SQL by creating and querying a CSV file.
Before You Begin
To complete this tutorial, you need the following:
Prerequisites | Useful resources |
---|---|
A Hazelcast cluster in client/server mode and an instance of Management Center running on your local network. |
|
A connection to the SQL shell |
Step 1. Create a Data Source to Query
SQL can query data in maps, Kafka topics and the local cluster’s file system.
In this step, you create a CSV file that you can use to query.
Step 2. Create a Mapping to the File
To allow Hazelcast to find and recognize the data in your CSV file, you need to create a mapping to it.
In the SQL shell, use the CREATE MAPPING
statement to configure the file connector and give Hazelcast access to the data in the likes.csv
file.
Make sure you replace the path option with the absolute path to your CSV file.
|
Step 3. Run Ad-Hoc Queries
Ad-hoc queries allow you to retrieve a small subset of data. Usually these queries are simple and you can have many of them running concurrently in a Hazelcast cluster.
-
Use a
SELECT
statement to query all the data in thelikes.csv
file.You should see the following:
-
Query only the
msg_id
andlikes
columns, by adding them as a comma-separated list after theSELECT
statement. -
Use a filter to display only the message numbers with more than 50 likes.
-
Give the
msg_id
column an alias for the query results.This clause does not rename the column in the table. -
To filter rows on more than one condition, you can join conditions with the
AND
,OR
, andNOT
operators.
If you need more control over how your data is being transformed and aggregated, you may want to build a pipeline with the Jet API.
Step 4. Run Federated Queries
Federated queries are those that join tables from different datasets.
Normally, SQL queries are executed on one particular database or dataset. However, with Hazelcast, you can pull information from different sources and present a more complete picture of the data.
-
Configure the map connector to create a new table called
dislikes
.This table is mapped to a distributed map in Hazelcast where the key is an integer and the value is a string.
-
Use the
SINK INTO
statement to add some entries to the map. -
Use the
JOIN
clause to merge results from themessages
andnames
tables so you can see who has the most likes and dislikes.
+
-
Use the
ORDER BY
clause to order the results by name and use theLIMIT
clause to limit them so that only the first two are displayed. Change the header of thenames
column toname
. -
Use the
SUM()
function to aggregate the total number of likes for each person and group the results by name.You should see the following:
The results do not include a row for each Jerry because the
GROUP BY
statement groups the results by name. -
Filter for the names that have more than 100 likes combined, using the
HAVING
clause. This clause is equivalent to theWHERE
clause but for aggregate results.
For a list of available aggregations, see Expressions, Functions and Operators.
Step 5. Ingest Query Results into a Hazelcast Map
To save your query results as a view, you can cache them in Hazelcast by ingesting them into a map.
-
Configure the map connector to create a new table called
likes_and_dislikes
.This table is mapped to a distributed map in Hazelcast where the key is an integer and the value is an object that’s serialized to JSON.
-
Run the
JOIN
query to merge results from the CSV file and thedislikes
map and insert them into thelikes_and_dislikes
map. -
Make sure that the query results were added to the map.
+------------+--------------------+------------+------------+ | __key|name | likes| dislikes| +------------+--------------------+------------+------------+ | 1|Greg | 20| 13| | 2|Jerry | 108| 25| | 3|Mary | 122| 73| | 4|Jerry | 9| 88| | 5|Joe | 51| 42| +------------+--------------------+------------+------------+
Next Steps
Learn how to query maps with SQL.
Explore all available SQL statements.