Metastores#
Примечание
Ниже приведена оригинальная документация Trino. Скоро мы ее переведем на русский язык и дополним полезными примерами.
Object storage access is mediated through a metastore. Metastores provide information on directory structure, file format, and metadata about the stored data. Object storage connectors support the use of one or more metastores. A supported metastore is required to use any object storage connector.
Additional configuration is required in order to access tables with Athena partition projection metadata or implement first class support for Avro tables. These requirements are discussed later in this topic.
General metastore configuration properties#
The following table describes general metastore configuration properties, most of which are used with either metastore.
At a minimum, each Delta Lake, Hive or Hudi object storage catalog file must set
the hive.metastore
configuration property to define the type of metastore to
use. Iceberg catalogs instead use the iceberg.catalog.type
configuration
property to define the type of metastore to use.
Additional configuration properties specific to the Thrift and Glue Metastores are also available. They are discussed later in this topic.
Property Name |
Description |
Default |
---|---|---|
|
The type of Hive metastore to use. Trino currently supports the default Hive
Thrift metastore ( |
|
|
The Iceberg table format manages most metadata in metadata files in the object storage itself. A small amount of metadata, however, still requires the use of a metastore. In the Iceberg ecosystem, these smaller metastores are called Iceberg metadata catalogs, or just catalogs. The examples in each subsection depict the contents of a Trino catalog file that uses the the Iceberg connector to configures different Iceberg metadata catalogs. You must set this property in all Iceberg catalog property files. Valid
values are |
|
|
Enable caching for partition metadata. You can disable caching to avoid inconsistent behavior that results from it. |
|
|
Duration of how long cached metastore data is considered valid. |
|
|
Duration of how long cached metastore statistics are considered valid. |
|
|
Maximum number of metastore data objects in the Hive metastore cache. |
|
|
Asynchronously refresh cached metastore data after access if it is older than this but is not yet expired, allowing subsequent accesses to see fresh data. |
|
|
Maximum threads used to refresh cached metastore data. |
|
|
Controls whether to hide Delta Lake tables in table listings. Currently applies only when using the AWS Glue metastore. |
|
Thrift metastore configuration properties#
In order to use a Hive Thrift metastore, you must configure the metastore with
hive.metastore=thrift
and provide further details with the following
properties:
Property name |
Description |
Default |
---|---|---|
|
The URIs of the Hive metastore to connect to using the Thrift protocol.
If a comma-separated list of URIs is provided, the first URI is used by
default, and the rest of the URIs are fallback metastores. This property
is required. Example: |
|
|
The username Trino uses to access the Hive metastore. |
|
|
Hive metastore authentication type. Possible values are |
|
|
Socket connect timeout for metastore client. |
|
|
Socket read timeout for metastore client. |
|
|
Enable Hive metastore end user impersonation. |
|
|
Enable usage of table statistics generated by Apache Spark when Hive table statistics are not available. |
|
|
Time to live delegation token cache for metastore. |
|
|
Delegation token cache maximum size. |
|
|
Use SSL when connecting to metastore. |
|
|
Path to private key and client certification (key store). |
|
|
Password for the private key. |
|
|
Path to the server certificate chain (trust store). Required when SSL is enabled. |
|
|
Password for the trust store. |
|
|
Enable fetching tables and views from all schemas in a single request. |
|
|
The Kerberos principal of the Hive metastore service. |
|
|
The Kerberos principal that Trino uses when connecting to the Hive metastore service. |
|
|
Hive metastore client keytab location. |
|
|
Actively delete the files for managed tables when performing drop table or partition operations, for cases when the metastore does not delete the files. |
|
|
Allow the metastore to assume that the values of partition columns can be
converted to string values. This can lead to performance improvements in
queries which apply filters on the partition columns. Partition keys with a
|
|
|
SOCKS proxy to use for the Thrift Hive metastore. |
|
|
Maximum number of retry attempts for metastore requests. |
|
|
Scale factor for metastore request retry delay. |
|
|
Total allowed time limit for a metastore request to be retried. |
|
|
Minimum delay between metastore request retries. |
|
|
Maximum delay between metastore request retries. |
|
|
Maximum time to wait to acquire hive transaction lock. |
|
Use the following configuration properties for HTTP client transport mode, so
when the hive.metastore.uri
uses the http://
or https://
protocol.
Property name |
Description |
---|---|
|
The authentication type to use with the HTTP client transport mode. When set
to the only supported value of |
|
Bearer token to use for authentication with the metastore service when HTTPS
transport mode is used by using a |
|
Additional headers to send with metastore service requests. These headers
must be comma-separated and delimited using |
Thrift metastore authentication#
In a Kerberized Hadoop cluster, Trino connects to the Hive metastore Thrift service using SASL and authenticates using Kerberos. Kerberos authentication for the metastore is configured in the connector’s properties file using the following optional properties:
Property value |
Description |
Default |
---|---|---|
|
Hive metastore authentication type. One of When set to |
|
|
Enable Hive metastore end user impersonation. See KERBEROS authentication with impersonation for more information. |
|
|
The Kerberos principal of the Hive metastore service. The coordinator uses this to authenticate the Hive metastore. The Example: |
|
|
The Kerberos principal that Trino uses when connecting to the Hive metastore service. Example: The Unless KERBEROS authentication with impersonation is enabled, the principal
specified by Warning: If the principal does have sufficient permissions, only the metadata is removed, and the data continues to consume disk space. This occurs because the Hive metastore is responsible for deleting the internal table data. When the metastore is configured to use Kerberos authentication, all of the HDFS operations performed by the metastore are impersonated. Errors deleting data are silently ignored. |
|
|
The path to the keytab file that contains a key for the principal
specified by |
The following sections describe the configuration properties and values needed for the various authentication configurations needed to use the Hive metastore Thrift service with the Hive connector.
Default NONE
authentication without impersonation#
hive.metastore.authentication.type=NONE
The default authentication type for the Hive metastore is NONE
. When the
authentication type is NONE
, Trino connects to an unsecured Hive
metastore. Kerberos is not used.
KERBEROS
authentication with impersonation#
hive.metastore.authentication.type=KERBEROS
hive.metastore.thrift.impersonation.enabled=true
hive.metastore.service.principal=hive/hive-metastore-host.example.com@EXAMPLE.COM
hive.metastore.client.principal=trino@EXAMPLE.COM
hive.metastore.client.keytab=/etc/trino/hive.keytab
When the authentication type for the Hive metastore Thrift service is
KERBEROS
, Trino connects as the Kerberos principal specified by the
property hive.metastore.client.principal
. Trino authenticates this
principal using the keytab specified by the hive.metastore.client.keytab
property, and verifies that the identity of the metastore matches
hive.metastore.service.principal
.
When using KERBEROS
Metastore authentication with impersonation, the
principal specified by the hive.metastore.client.principal
property must be
allowed to impersonate the current Trino user, as discussed in the section
Impersonation in Hadoop.
Keytab files must be distributed to every node in the Trino cluster.
AWS Glue catalog configuration properties#
In order to use an AWS Glue catalog, you must configure your catalog file as follows:
hive.metastore=glue
and provide further details with the following
properties:
Property Name |
Description |
Default |
---|---|---|
|
AWS region of the Glue Catalog. This is required when not running in EC2, or
when the catalog is in a different region. Example: |
|
|
Glue API endpoint URL (optional). Example:
|
|
|
AWS region of the STS service to authenticate with. This is required when
running in a GovCloud region. Example: |
|
|
The ID of the Glue Proxy API, when accessing Glue via an VPC endpoint in API Gateway. |
|
|
STS endpoint URL to use when authenticating to Glue (optional). Example:
|
|
|
Pin Glue requests to the same region as the EC2 instance where Trino is running. |
|
|
Max number of concurrent connections to Glue. |
|
|
Maximum number of error retries for the Glue client. |
|
|
Default warehouse directory for schemas created without an explicit
|
|
|
Fully qualified name of the Java class to use for obtaining AWS credentials. Can be used to supply a custom credentials provider. |
|
|
AWS access key to use to connect to the Glue Catalog. If specified along
with |
|
|
AWS secret key to use to connect to the Glue Catalog. If specified along
with |
|
|
The ID of the Glue Catalog in which the metadata database resides. |
|
|
ARN of an IAM role to assume when connecting to the Glue Catalog. |
|
|
External ID for the IAM role trust policy when connecting to the Glue Catalog. |
|
|
Number of segments for partitioned Glue tables. |
|
|
Number of threads for parallel partition fetches from Glue. |
|
|
Number of threads for parallel statistic fetches from Glue. |
|
|
Number of threads for parallel statistic writes to Glue. |
|
Iceberg-specific Glue catalog configuration properties#
When using the Glue catalog, the Iceberg connector supports the same general Glue configuration properties as previously described with the following additional property:
Property name |
Description |
Default |
---|---|---|
|
Skip archiving an old table version when creating a new version in a commit. See AWS Glue Skip Archive. |
|
Iceberg-specific metastores#
The Iceberg table format manages most metadata in metadata files in the object storage itself. A small amount of metadata, however, still requires the use of a metastore. In the Iceberg ecosystem, these smaller metastores are called Iceberg metadata catalogs, or just catalogs.
You can use a general metastore such as an HMS or AWS Glue, or you can use the Iceberg-specific REST, Nessie or JDBC metadata catalogs, as discussed in this section.
REST catalog#
In order to use the Iceberg REST catalog, configure the catalog type
with iceberg.catalog.type=rest
, and provide further details with the
following properties:
Property name |
Description |
---|---|
|
REST server API endpoint URI (required). Example:
|
|
Warehouse identifier/location for the catalog (optional). Example:
|
|
The type of security to use (default: |
|
Session information included when communicating with the REST Catalog.
Options are |
|
The bearer token used for interactions with the server. A |
|
The credential to exchange for a token in the OAuth2 client credentials flow
with the server. A |
The following example shows a minimal catalog configuration using an Iceberg REST metadata catalog:
connector.name=iceberg
iceberg.catalog.type=rest
iceberg.rest-catalog.uri=http://iceberg-with-rest:8181
The REST catalog does not support view management or materialized view management.
JDBC catalog#
The Iceberg JDBC catalog is supported for the Iceberg connector. At a minimum,
iceberg.jdbc-catalog.driver-class
, iceberg.jdbc-catalog.connection-url
and iceberg.jdbc-catalog.catalog-name
must be configured. When using any
database besides PostgreSQL, a JDBC driver jar file must be placed in the plugin
directory.
Предупреждение
The JDBC catalog may have compatibility issues if Iceberg introduces breaking changes in the future. Consider the REST catalog as an alternative solution.
The JDBC catalog requires the metadata tables to already exist. Refer to Iceberg repository for creating those tables.
At a minimum, iceberg.jdbc-catalog.driver-class
,
iceberg.jdbc-catalog.connection-url
, and
iceberg.jdbc-catalog.catalog-name
must be configured. When using any
database besides PostgreSQL, a JDBC driver jar file must be placed in the plugin
directory. The following example shows a minimal catalog configuration using an
Iceberg REST metadata catalog:
connector.name=iceberg
iceberg.catalog.type=jdbc
iceberg.jdbc-catalog.catalog-name=test
iceberg.jdbc-catalog.driver-class=org.postgresql.Driver
iceberg.jdbc-catalog.connection-url=jdbc:postgresql://example.net:5432/database
iceberg.jdbc-catalog.connection-user=admin
iceberg.jdbc-catalog.connection-password=test
iceberg.jdbc-catalog.default-warehouse-dir=s3://bucket
The JDBC catalog does not support view management or materialized view management.
Nessie catalog#
In order to use a Nessie catalog, configure the catalog type with
iceberg.catalog.type=nessie
and provide further details with the following
properties:
Property name |
Description |
---|---|
|
Nessie API endpoint URI (required). Example:
|
|
The branch/tag to use for Nessie. Defaults to |
|
Default warehouse directory for schemas created without an explicit
|
|
The read timeout duration for requests to the Nessie
server. Defaults to |
|
The connection timeout duration for connection
requests to the Nessie server. Defaults to |
|
Configure whether compression should be enabled or not for requests to the
Nessie server. Defaults to |
|
The authentication type to use. Available value is |
|
The token to use with |
connector.name=iceberg
iceberg.catalog.type=nessie
iceberg.nessie-catalog.uri=https://localhost:19120/api/v1
iceberg.nessie-catalog.default-warehouse-dir=/tmp
The Nessie catalog does not support view management or materialized view management.
Access tables with Athena partition projection metadata#
Partition projection is a feature of AWS Athena often used to speed up query processing with highly partitioned tables when using the Hive connector.
Trino supports partition projection table properties stored in the Hive
metastore or Glue catalog, and it reimplements this functionality. Currently,
there is a limitation in comparison to AWS Athena for date projection, as it
only supports intervals of DAYS
, HOURS
, MINUTES
, and SECONDS
.
If there are any compatibility issues blocking access to a requested table when
partition projection is enabled, set the
partition_projection_ignore
table property to true
for a table to bypass
any errors.
Refer to Table properties and Column properties for configuration of partition projection.
Configure metastore for Avro#
For catalogs using the Hive connector, you must add the following property
definition to the Hive metastore configuration file hive-site.xml
and
restart the metastore service to enable first-class support for Avro tables when
using Hive 3.x:
<property>
<!-- https://community.hortonworks.com/content/supportkb/247055/errorjavalangunsupportedoperationexception-storage.html -->
<name>metastore.storage.schema.reader.impl</name>
<value>org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader</value>
</property>