BigQuery коннектор#

Примечание

Ниже приведена оригинальная документация Trino. Скоро мы ее переведем на русский язык и дополним полезными примерами.

The BigQuery connector allows querying the data stored in BigQuery. This can be used to join data between different systems like BigQuery and Hive. The connector uses the BigQuery Storage API to read the data from the tables.

BigQuery Storage API#

The Storage API streams data in parallel directly from BigQuery via gRPC without using Google Cloud Storage as an intermediary. It has a number of advantages over using the previous export-based read flow that should generally lead to better read performance:

Direct Streaming

It does not leave any temporary files in Google Cloud Storage. Rows are read directly from BigQuery servers using an Avro wire format.

Column Filtering

The new API allows column filtering to only read the data you are interested in. Backed by a columnar datastore, it can efficiently stream data without reading all columns.

Dynamic Sharding

The API rebalances records between readers until they all complete. This means that all Map phases will finish nearly concurrently. See this blog article on how dynamic sharding is similarly used in Google Cloud Dataflow.

Requirements#

To connect to BigQuery, you need:

  • To enable the BigQuery Storage Read API.

  • Network access from your Trino coordinator and workers to the Google Cloud API service endpoint. This endpoint uses HTTPS, or port 443.

  • To configure BigQuery so that the Trino coordinator and workers have permissions in BigQuery.

  • To set up authentication. Your authentiation options differ depending on whether you are using Dataproc/Google Compute Engine (GCE) or not.

    On Dataproc/GCE the authentication is done from the machine’s role.

    Outside Dataproc/GCE you have 3 options:

    • Use a service account JSON key and GOOGLE_APPLICATION_CREDENTIALS as described in the Google Cloud authentication getting started guide.

    • Set bigquery.credentials-key in the catalog properties file. It should contain the contents of the JSON file, encoded using base64.

    • Set bigquery.credentials-file in the catalog properties file. It should point to the location of the JSON file.

Конфигурация#

To configure the BigQuery connector, create a catalog properties file in etc/catalog named, for example, bigquery.properties, to mount the BigQuery connector as the bigquery catalog. Create the file with the following contents, replacing the connection properties as appropriate for your setup:

connector.name=bigquery
bigquery.project-id=<your Google Cloud Platform project id>

Multiple GCP projects#

The BigQuery connector can only access a single GCP project.Thus, if you have data in multiple GCP projects, You need to create several catalogs, each pointing to a different GCP project. For example, if you have two GCP projects, one for the sales and one for analytics, you can create two properties files in etc/catalog named sales.properties and analytics.properties, both having connector.name=bigquery but with different project-id. This will create the two catalogs, sales and analytics respectively.

Configuring partitioning#

By default the connector creates one partition per 400MB in the table being read (before filtering). This should roughly correspond to the maximum number of readers supported by the BigQuery Storage API. This can be configured explicitly with the bigquery.parallelism property. BigQuery may limit the number of partitions based on server constraints.

Arrow serialization support#

This is an experimental feature which introduces support for using Apache Arrow as the serialization format when reading from BigQuery. Please note there are a few caveats:

  • Using Apache Arrow serialization is disabled by default. In order to enable it, set the bigquery.experimental.arrow-serialization.enabled configuration property to true and add --add-opens=java.base/java.nio=ALL-UNNAMED to the Trino JVM config.

Reading from views#

The connector has a preliminary support for reading from BigQuery views. Please note there are a few caveats:

  • Reading from views is disabled by default. In order to enable it, set the bigquery.views-enabled configuration property to true.

  • BigQuery views are not materialized by default, which means that the connector needs to materialize them before it can read them. This process affects the read performance.

  • The materialization process can also incur additional costs to your BigQuery bill.

  • By default, the materialized views are created in the same project and dataset. Those can be configured by the optional bigquery.view-materialization-project and bigquery.view-materialization-dataset properties, respectively. The service account must have write permission to the project and the dataset in order to materialize the view.

Configuration properties#

Property

Description

Default

bigquery.project-id

The Google Cloud Project ID where the data reside

Taken from the service account

bigquery.parent-project-id

The project ID Google Cloud Project to bill for the export

Taken from the service account

bigquery.parallelism

The number of partitions to split the data into

The number of executors

bigquery.views-enabled

Enables the connector to read from views and not only tables. Please read this section before enabling this feature.

false

bigquery.view-expire-duration

Expire duration for the materialized view.

24h

bigquery.view-materialization-project

The project where the materialized view is going to be created

The view’s project

bigquery.view-materialization-dataset

The dataset where the materialized view is going to be created

The view’s dataset

bigquery.skip-view-materialization

Use REST API to access views instead of Storage API. BigQuery BIGNUMERIC and TIMESTAMP types are unsupported.

false

bigquery.views-cache-ttl

Duration for which the materialization of a view will be cached and reused. Set to 0ms to disable the cache.

15m

bigquery.metadata.cache-ttl

Duration for which metadata retrieved from BigQuery is cached and reused. Set to 0ms to disable the cache.

0ms

bigquery.max-read-rows-retries

The number of retries in case of retryable server issues

3

bigquery.credentials-key

The base64 encoded credentials key

None. See the requirements section.

bigquery.credentials-file

The path to the JSON credentials file

None. See the requirements section.

bigquery.case-insensitive-name-matching

Match dataset and table names case-insensitively

false

bigquery.query-results-cache.enabled

Enable query results cache

false

bigquery.experimental.arrow-serialization.enabled

Enable using Apache Arrow serialization when reading data from BigQuery. Please read this section before enabling this feature.

false

Type mapping#

Because Trino and BigQuery each support types that the other does not, this connector modifies some types when reading or writing data. Data types may not map the same way in both directions between Trino and the data source. Refer to the following sections for type mapping in each direction.

BigQuery type to Trino type mapping#

The connector maps BigQuery types to the corresponding Trino types according to the following table:

BigQuery type to Trino type mapping#

BigQuery type

Trino type

Notes

BOOLEAN

BOOLEAN

INT64

BIGINT

INT, SMALLINT, INTEGER, BIGINT, TINYINT, and BYTEINT are aliases for INT64 in BigQuery.

FLOAT64

DOUBLE

NUMERIC

DECIMAL(P,S)

The default precision and scale of NUMERIC is (38, 9).

BIGNUMERIC

DECIMAL(P,S)

Precision > 38 is not supported. The default precision and scale of BIGNUMERIC is (77, 38).

DATE

DATE

DATETIME

TIMESTAMP(6)

STRING

VARCHAR

BYTES

VARBINARY

TIME

TIME(6)

TIMESTAMP

TIMESTAMP(6) WITH TIME ZONE

Time zone is UTC

GEOGRAPHY

VARCHAR

In Well-known text (WKT) format

ARRAY

ARRAY

RECORD

ROW

No other types are supported.

Trino type to BigQuery type mapping#

The connector maps Trino types to the corresponding BigQuery types according to the following table:

Trino type to BigQuery type mapping#

Trino type

BigQuery type

Notes

BOOLEAN

BOOLEAN

VARBINARY

BYTES

DATE

DATE

DOUBLE

FLOAT

BIGINT

INT64

INT, SMALLINT, INTEGER, BIGINT, TINYINT, and BYTEINT are aliases for INT64 in BigQuery.

VARCHAR

STRING

TIMESTAMP(6)

DATETIME

No other types are supported.

System tables#

For each Trino table which maps to BigQuery view there exists a system table which exposes BigQuery view definition. Given a BigQuery view customer_view you can send query SELECT * customer_view$view_definition to see the SQL which defines view in BigQuery.

Special columns#

In addition to the defined columns, the BigQuery connector exposes partition information in a number of hidden columns:

  • $partition_date: Equivalent to _PARTITIONDATE pseudo-column in BigQuery

  • $partition_time: Equivalent to _PARTITIONTIME pseudo-column in BigQuery

You can use these columns in your SQL statements like any other column. They can be selected directly, or used in conditional statements. For example, you can inspect the partition date and time for each record:

SELECT *, "$partition_date", "$partition_time"
FROM bigquery.web.page_views;

Retrieve all records stored in the partition _PARTITIONDATE = '2022-04-07':

SELECT *
FROM bigquery.web.page_views
WHERE "$partition_date" = date '2022-04-07';

Примечание

Two special partitions __NULL__ and __UNPARTITIONED__ are not supported.

SQL support#

The connector provides read and write access to data and metadata in the BigQuery database. In addition to the globally available and read operation statements, the connector supports the following features:

Table functions#

The connector provides specific table functions to access BigQuery.

query(varchar) -> table#

The query function allows you to query the underlying BigQuery directly. It requires syntax native to BigQuery, because the full query is pushed down and processed by BigQuery. This can be useful for accessing native features which are not available in Trino or for improving query performance in situations where running a query natively may be faster.

Примечание

Polymorphic table functions may not preserve the order of the query result. If the table function contains a query with an ORDER BY clause, the function result may not be ordered as expected.

For example, group and concatenate all employee IDs by manager ID:

SELECT
  *
FROM
  TABLE(
    bigquery.system.query(
      query => 'SELECT
        manager_id, STRING_AGG(employee_id)
      FROM
        company.employees
      GROUP BY
        manager_id'
    )
  );

FAQ#

What is the Pricing for the Storage API?#

See the BigQuery pricing documentation.