Hudi коннектор#

Примечание

Ниже приведена оригинальная документация Trino. Скоро мы ее переведем на русский язык и дополним полезными примерами.

The Hudi connector enables querying Hudi tables.

Требования#

To use the Hudi connector, you need:

Network access from the Trino coordinator and workers to the Hudi storage.
Access to the Hive metastore service (HMS).
Network access from the Trino coordinator to the HMS.

Конфигурация#

The connector requires a Hive metastore for table metadata and supports the same metastore configuration properties as the Hive connector. At a minimum, hive.metastore.uri must be configured. The connector recognizes Hudi tables synced to the metastore by the Hudi sync tool.

To create a catalog that uses the Hudi connector, create a catalog properties file etc/catalog/example.properties that references the hudi connector. Update the hive.metastore.uri with the URI of your Hive metastore Thrift service:

connector.name=hudi
hive.metastore.uri=thrift://example.net:9083

Additionally, following configuration properties can be set depending on the use-case.

Hudi configuration properties#
Property name	Description	Default
`hudi.metadata-enabled`	Fetch the list of file names and sizes from metadata rather than storage.	`false`
`hudi.columns-to-hide`	List of column names that are hidden from the query output. It can be used to hide Hudi meta fields. By default, no fields are hidden.
`hudi.parquet.use-column-names`	Access Parquet columns using names from the file. If disabled, then columns are accessed using the index. Only applicable to Parquet file format.	`true`
`hudi.min-partition-batch-size`	Minimum number of partitions returned in a single batch.	`10`
`hudi.max-partition-batch-size`	Maximum number of partitions returned in a single batch.	`100`
`hudi.size-based-split-weights-enabled`	Unlike uniform splitting, size-based splitting ensures that each batch of splits has enough data to process. By default, it is enabled to improve performance.	`true`
`hudi.standard-split-weight-size`	The split size corresponding to the standard weight (1.0) when size-based split weights are enabled.	`128MB`
`hudi.minimum-assigned-split-weight`	Minimum weight that a split can be assigned when size-based split weights are enabled.	`0.05`
`hudi.max-splits-per-second`	Rate at which splits are queued for processing. The queue is throttled if this rate limit is breached.	`Integer.MAX_VALUE`
`hudi.max-outstanding-splits`	Maximum outstanding splits in a batch enqueued for processing.	`1000`

Supported file types#

The connector supports Parquet file type.

SQL support#

The connector provides read access to data in the Hudi table that has been synced to Hive metastore. The globally available and read operation statements are supported.

Supported query types#

Hudi supports two types of tables depending on how the data is indexed and laid out on the file system. The following table displays a support matrix of tables types and query types for the connector.

Table type	Supported query type
Copy on write	Snapshot queries
Merge on read	Read optimized queries

Examples queries#

In the queries below, stock_ticks_cow is a Hudi copy-on-write table that we refer in the Hudi quickstart documentation.

Here are some sample queries:

USE example.example_schema;

SELECT symbol, max(ts)
FROM stock_ticks_cow
GROUP BY symbol
HAVING symbol = 'GOOG';

  symbol   |        _col1         |
-----------+----------------------+
 GOOG      | 2018-08-31 10:59:00  |
(1 rows)

SELECT dt, symbol
FROM stock_ticks_cow
WHERE symbol = 'GOOG';

    dt      | symbol |
------------+--------+
 2018-08-31 |  GOOG  |
(1 rows)

SELECT dt, count(*)
FROM stock_ticks_cow
GROUP BY dt;

    dt      | _col1 |
------------+--------+
 2018-08-31 |  99  |
(1 rows)