Object storage file formats#

Примечание

Ниже приведена оригинальная документация Trino. Скоро мы ее переведем на русский язык и дополним полезными примерами.

Object storage connectors support one or more file formats specified by the underlying data source.

ORC format configuration properties#

The following properties are used to configure the read and write operations with ORC files performed by supported object storage connectors:

ORC format configuration properties#

Property Name

Description

Default

orc.time-zone

Sets the default time zone for legacy ORC files that did not declare a time zone.

JVM default

orc.bloom-filters.enabled

Enable bloom filters for predicate pushdown.

false

orc.read-legacy-short-zone-id

Allow reads on ORC files with short zone ID in the stripe footer.

false

File compression and decompression is automatically performed and some details can be configured.

Parquet format configuration properties#

The following properties are used to configure the read and write operations with Parquet files performed by supported object storage connectors:

Parquet format configuration properties#

Property Name

Description

Default

parquet.time-zone

Adjusts timestamp values to a specific time zone. For Hive 3.1+, set this to UTC.

JVM default

parquet.writer.validation-percentage

Percentage of parquet files to validate after write by re-reading the whole file. The equivalent catalog session property is parquet_optimized_writer_validation_percentage. Validation can be turned off by setting this property to 0.

5

parquet.writer.page-size

Maximum size of pages written by Parquet writer.

1 MB

parquet.writer.page-value-count

Maximum values count of pages written by Parquet writer.

80000

parquet.writer.block-size

Maximum size of row groups written by Parquet writer.

128 MB

parquet.writer.batch-size

Maximum number of rows processed by the parquet writer in a batch.

10000

parquet.use-bloom-filter

Whether bloom filters are used for predicate pushdown when reading Parquet files. Set this property to false to disable the usage of bloom filters by default. The equivalent catalog session property is parquet_use_bloom_filter.

true

parquet.use-column-index

Skip reading Parquet pages by using Parquet column indices. The equivalent catalog session property is parquet_use_column_index. Only supported by the Delta Lake and Hive connectors.

true

parquet.ignore-statistics

Ignore statistics from Parquet to allow querying files with corrupted or incorrect statistics. The equivalent catalog session property is parquet_ignore_statistics.

false

parquet.max-read-block-row-count

Sets the maximum number of rows read in a batch. The equivalent catalog session property is named parquet_max_read_block_row_count and supported by the Delta Lake, Hive, and Iceberg connectors.

8192

parquet.small-file-threshold

Data size below which a Parquet file is read entirely. The equivalent catalog session property is named parquet_small_file_threshold.

3MB

parquet.experimental.vectorized-decoding.enabled

Enable using Java Vector API (SIMD) for faster decoding of parquet files. The equivalent catalog session property is parquet_vectorized_decoding_enabled.

true

File compression and decompression is automatically performed and some details can be configured.