Hive connector with Alluxio#

Примечание

Ниже приведена оригинальная документация Trino. Скоро мы ее переведем на русский язык и дополним полезными примерами.

The Hive коннектор can read and write tables stored in the Alluxio Data Orchestration System, leveraging Alluxio’s distributed block-level read/write caching functionality. The tables must be created in the Hive metastore with the alluxio:// location prefix (see Running Apache Hive with Alluxio for details and examples).

Trino queries will then transparently retrieve and cache files or objects from a variety of disparate storage systems including HDFS and S3.

Alluxio client-side configuration#

To configure Alluxio client-side properties on Trino, append the Alluxio configuration directory (${ALLUXIO_HOME}/conf) to the Trino JVM classpath, so that the Alluxio properties file alluxio-site.properties can be loaded as a resource. Update the Trino JVM config file etc/jvm.config to include the following:

-Xbootclasspath/a:<path-to-alluxio-conf>

The advantage of this approach is that all the Alluxio properties are set in the single alluxio-site.properties file. For details, see Customize Alluxio Presto Properties.

Alternatively, add Alluxio configuration properties to the Hadoop configuration files (core-site.xml, hdfs-site.xml) and configure the Hive connector to use the Hadoop configuration files via the hive.config.resources connector property.

Deploy Alluxio with Trino#

To achieve the best performance running Trino on Alluxio, it is recommended to collocate Trino workers with Alluxio workers. This allows reads and writes to bypass the network (short-circuit). See Performance Tuning Tips for Presto with Alluxio for more details.

Alluxio catalog service#

An alternative way for Trino to interact with Alluxio is via the Alluxio catalog service. The primary benefits for using the Alluxio catalog service are simpler deployment of Alluxio with Trino, and enabling schema-aware optimizations such as transparent caching and transformations. Currently, the catalog service supports read-only workloads.

The Alluxio catalog service is a metastore that can cache the information from different underlying metastores. It currently supports the Hive metastore as an underlying metastore. In order for the Alluxio catalog to manage the metadata of other existing metastores, the other metastores must be «attached» to the Alluxio catalog. To attach an existing Hive metastore to the Alluxio catalog, simply use the Alluxio CLI attachdb command. The appropriate Hive metastore location and Hive database name need to be provided.

./bin/alluxio table attachdb hive thrift://HOSTNAME:9083 hive_db_name

Once a metastore is attached, the Alluxio catalog can manage and serve the information to Trino. To configure the Hive connector for Alluxio catalog service, simply configure the connector to use the Alluxio metastore type, and provide the location to the Alluxio cluster. For example, your etc/catalog/alluxio.properties should include the following:

connector.name=hive
hive.metastore=alluxio-deprecated
hive.metastore.alluxio.master.address=HOSTNAME:PORT

Replace HOSTNAME with the Alluxio master hostname, and replace PORT with the Alluxio master port. An example of an Alluxio master address is master-node:19998. Now, Trino queries can take advantage of the Alluxio catalog service, such as transparent caching and transparent transformations, without any modifications to existing Hive metastore deployments.