Hudi коннектор#

Примечание

Ниже приведена оригинальная документация Trino. Скоро мы ее переведем на русский язык и дополним полезными примерами.

The Hudi connector enables querying Hudi tables.

Requirements#

To use the Hudi connector, you need:

  • Network access from the Trino coordinator and workers to the Hudi storage.

  • Access to the Hive metastore service (HMS).

  • Network access from the Trino coordinator to the HMS.

Конфигурация#

The connector requires a Hive metastore for table metadata and supports the same metastore configuration properties as the Hive connector. At a minimum, hive.metastore.uri must be configured. The connector recognizes Hudi tables synced to the metastore by the Hudi sync tool.

To create a catalog that uses the Hudi connector, create a catalog properties file, for example etc/catalog/example.properties, that references the hudi connector. Update the hive.metastore.uri with the URI of your Hive metastore Thrift service:

connector.name=hudi
hive.metastore.uri=thrift://example.net:9083

Additionally, following configuration properties can be set depending on the use-case.

Hudi configuration properties#

Property name

Description

Default

hudi.metadata-enabled

Fetch the list of file names and sizes from metadata rather than storage.

false

hudi.columns-to-hide

List of column names that are hidden from the query output. It can be used to hide Hudi meta fields. By default, no fields are hidden.

hudi.parquet.use-column-names

Access Parquet columns using names from the file. If disabled, then columns are accessed using the index. Only applicable to Parquet file format.

true

hudi.min-partition-batch-size

Minimum number of partitions returned in a single batch.

10

hudi.max-partition-batch-size

Maximum number of partitions returned in a single batch.

100

hudi.size-based-split-weights-enabled

Unlike uniform splitting, size-based splitting ensures that each batch of splits has enough data to process. By default, it is enabled to improve performance.

true

hudi.standard-split-weight-size

The split size corresponding to the standard weight (1.0) when size-based split weights are enabled.

128MB

hudi.minimum-assigned-split-weight

Minimum weight that a split can be assigned when size-based split weights are enabled.

0.05

hudi.max-splits-per-second

Rate at which splits are queued for processing. The queue is throttled if this rate limit is breached.

Integer.MAX_VALUE

hudi.max-outstanding-splits

Maximum outstanding splits in a batch enqueued for processing.

1000

Supported file types#

The connector supports Parquet file type.

SQL support#

The connector provides read access to data in the Hudi table that has been synced to Hive metastore. The globally available and read operation statements are supported.

Supported query types#

Hudi supports two types of tables depending on how the data is indexed and laid out on the file system. The following table displays a support matrix of tables types and query types for the connector.

Table type

Supported query type

Copy on write

Snapshot queries

Merge on read

Read optimized queries

Examples queries#

In the queries below, stock_ticks_cow is a Hudi copy-on-write table that we refer in the Hudi quickstart documentation.

Here are some sample queries:

USE a-catalog.myschema;

SELECT symbol, max(ts)
FROM stock_ticks_cow
GROUP BY symbol
HAVING symbol = 'GOOG';
  symbol   |        _col1         |
-----------+----------------------+
 GOOG      | 2018-08-31 10:59:00  |
(1 rows)
SELECT dt, symbol
FROM stock_ticks_cow
WHERE symbol = 'GOOG';
    dt      | symbol |
------------+--------+
 2018-08-31 |  GOOG  |
(1 rows)
SELECT dt, count(*)
FROM stock_ticks_cow
GROUP BY dt;
    dt      | _col1 |
------------+--------+
 2018-08-31 |  99  |
(1 rows)