Creating a Mapping to a File System
To query data in files, you need to create a mapping to a file, using the unified file connector so that the SQL service knows how to access the data.
You can only read from file mappings. You cannot add data to files. |
What is the Unified File Connector
The unified file connector provides a simple way to read files across different sources. Using this connector, you can read files from the local filesystem, HDFS, and cloud storage systems such as Amazon S3, Google Cloud Storage, or Azure Blob Storage. This connector supports various data formats, including, CSV, JSON, and Avro.
Installing the Connector
To read from a local or shared filesystem, you do not need to install any additional modules as these are included in Hazelcast.
To access Hadoop or any of the cloud-based file systems, you may need to add one of the downloadable modules listed in Supported Storage Systems to your members' classpaths.
Depending on the format of your files, you may also need to add other modules to the classpath.
Permissions
Enterprise
If security is enabled, you can set up permissions to restrict clients' access to your files. For details, see Securing Jobs.
Configuration Options
Use these options to configure the file connector, using either the CREATE MAPPING
statement or a file table function.
CREATE MAPPING my_files
TYPE File
OPTIONS (
/* insert configuration options here */
)
SELECT * FROM TABLE(
JSON_FLAT_FILE(/* insert configuration options here */)
);
Option | Default | Description | Example |
---|---|---|---|
none |
The absolute path to a directory in one of the supported file systems. This option does not support wildcard characters. If you omit a column list, Hazelcast will try to determine column names and types by reading a sample file in the given directory. |
||
none |
|||
|
A filename pattern that uses wildcard characters such as |
If a directory contains JSON files named using the
|
|
|
Return zero results instead of throwing an error when files in the If you set this option to This option is not valid for file table functions, because they always need at least one record from which to derive the column list. |
|
|
none |
Options to pass to the file system such as authentication options. |
|
|
|
|
Supported File Systems
Hazelcast supports the following file systems.
If you use the slim distribution of Hazelcast, be sure to add the respective modules to your members' classpaths.
Any path that does not start with a schema is considered local or shared on the Hazelcast members. |
Storage system | Module | Example path |
---|---|---|
Member Filesystems (both shared and local) |
Included in both full and slim distributions of Hazelcast. |
|
Hadoop Distributed File System (HDFS) |
|
|
Amazon S3 |
|
|
Google Cloud Storage |
|
|
Windows Azure Blob Storage |
|
|
Azure Data Lake Generation 1 |
|
|
Azure Data Lake Generation 2 |
|
Although these are the officially supported sources, you can also read from any file system that’s compatible with Hadoop.
Supported File Formats
Hazelcast supports reading from the following file formats.
Avro
The Avro format allows you to read data from files in the Avro Object Container File
format. To use the Avro format you must have the
hazelcast-jet-avro
module on your members' classpaths.
CREATE MAPPING users
TYPE File
OPTIONS (
'path' = '/users',
'format' = 'avro',
'glob' = '*.avro'
)
Avro type | SQL type |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
All other types |
|
CSV
CSV files must have a header. If you omit the column
list from the CREATE MAPPING
statement, Hazelcast will try to infer the column names from the file header. Columns lists that do not match any
header fields are ignored.
CREATE MAPPING my_files
TYPE File
OPTIONS (
'path' = '/path/to/directory',
'format' = 'csv'
)
All columns have VARCHAR
type.
JSON
JSON files must be in the JSON Lines format, where files are expected to contain one valid JSON document per
line and be UTF-8
encoded. If you omit any mapping columns from the
declaration, Hazelcast infers names and types based on a sample file.
CREATE MAPPING my_files
TYPE File
OPTIONS (
'path' = '/path/to/directory',
'format' = 'json-flat'
)
Parquet
Apache Parquet is a columnar storage format. It describes how the data is stored on disk. It doesn’t specify how the data is supposed to be deserialized, instead it uses other libraries to achieve that. Hazelcast uses Apache Avro for deserialization.
Parquet has a dependency on Hadoop, so it can be used only with one of the Hadoop based modules.
CREATE MAPPING my_files
TYPE File
OPTIONS (
'path' = 'hdfs://path/to/directory',
'format' = 'parquet'
/* more Hadoop options ... */
)
Changing the Default Data Type
Depending on the file format that you choose, data is converted to a default SQL type. For example, the CSV format uses the VARCHAR
type for all fields in the file.
To change this default type, name the columns
explicitly in the column list while creating the mapping. For example, if you specify INT
manually on a header field of a CSV file, the behavior would be as if CAST(column
AS INT)
were used to convert VARCHAR
to INT
.
CREATE MAPPING my_files (
name VARCHAR,
id INT
)
TYPE File
OPTIONS (
'path' = '/path/to/directory',
'format' = 'csv'
)
For details which data types can be converted to others, see SQL Data Types.
Authentication for Remote File Systems
To access data on remote file systems (files that aren’t on Hazelcast members), you must also configure the authentication credentials in options
.
Amazon S3
To connect to Amazon S3, you must add the following authentication credentials as keys and values in options
.
fs.s3a.access.key
fs.s3a.secret.key
If your file is not stored in the default region (us-west-1
), you must also provide your region in the fs.s3a.endpoint
option.
CREATE MAPPING nasdaqlisted
TYPE File
OPTIONS (
'path' = 's3a://diagnostic-test-alexander/',
'format' = 'csv',
'glob' = 'nasdaqlisted.txt',
'fs.s3a.endpoint' = 's3.us-west-2.amazonaws.com',
'fs.s3a.access.key' = 'my-access-key',
'fs.s3a.secret.key' = 'my-secret-key'
);
For additional ways to authenticate see the Hadoop-AWS documentation and Amazon S3 documentation .
Google Cloud Storage
Provide a location of the keyfile via
google.cloud.auth.service.account.json.keyfile
source option.
The file must be available on all cluster members. |
For additional ways to authenticate see Google Cloud Hadoop connector.
Windows Azure Blob Storage
Provide an account key via
fs.azure.account.key.<your account name>.blob.core.windows.net
source
option.
For additional ways to authenticate see Hadoop Azure Blob Storage support.
Azure Data Lake Generation 1
Provide the following options:
-
fs.adl.oauth2.access.token.provider.type
-
fs.adl.oauth2.refresh.url
-
fs.adl.oauth2.client.id
-
fs.adl.oauth2.credential
For additional ways to authenticate see Hadoop Azure Data Lake Support
Azure Data Lake Generation 2
For additional ways to authenticate see Hadoop Azure Data Lake Storage Gen2
Turning off Authentication Caching
To improve performance, authentication credentials are cached by default. However, cached credentials may cause issues when submitting jobs that use different credentials, or even the same jobs with new credentials such as after credentials rotation.
To turn off authentication caching, set the
fs.<prefix>.impl.disable.cache
option to true
, where <prefix>
is the schema of your file system.