Trino on Kubernetes with Helm#
Примечание
Ниже приведена оригинальная документация Trino. Скоро мы ее переведем на русский язык и дополним полезными примерами.
Kubernetes is a container orchestration platform that allows you to deploy Trino and other applications in a repeatable manner across different types of infrastructure. This can range from deploying on your laptop using tools like kind, to running on a managed Kubernetes service on cloud services like Amazon Elastic Kubernetes Service, Google Kubernetes Engine, Azure Kubernetes Service, and others.
The fastest way to run Trino on Kubernetes is to use the Trino Helm chart. Helm is a package manager for Kubernetes applications that allows for simpler installation and versioning by templating Kubernetes configuration files. This allows you to prototype on your local or on-premise cluster and use the same deployment mechanism to deploy to the cloud to scale up.
Requirements#
A Kubernetes cluster with a supported version of Kubernetes.
If you don’t have a Kubernetes cluster, you can run one locally using kind.
kubectl with a version that adheres to the Kubernetes version skew policy installed on the machine managing the Kubernetes deployment.
helm with a version that adheres to the Helm version skew policy installed on the machine managing the Kubernetes deployment.
Running Trino using Helm#
Run the following commands from the system with helm
and kubectl
installed and configured to connect to your running Kubernetes cluster:
Validate
kubectl
is pointing to the correct cluster by running the command:kubectl cluster-info
You should see output that shows the correct Kubernetes control plane address.
Add the Trino Helm chart repository to Helm if you haven’t done so already. This tells Helm where to find the Trino charts. You can name the repository whatever you want,
trino
is a good choice.helm repo add trino https://trinodb.github.io/charts
Install Trino on the Kubernetes cluster using the Helm chart. Start by running the
install
command to use all default values and create a cluster calledexample-trino-cluster
.helm install example-trino-cluster trino/trino
This generates the Kubernetes configuration files by inserting properties into helm templates. The Helm chart contains default values that can be overridden by a YAML file to update default settings.
(Optional) To override the default values, create your own YAML configuration to define the parameters of your deployment. To run the install command using the
example.yaml
, add thef
parameter in youinstall
command. Be sure to follow best practices and naming conventions for your configuration files.helm install -f example.yaml example-trino-cluster trino/trino
You should see output as follows:
NAME: example-trino-cluster LAST DEPLOYED: Tue Sep 13 14:12:09 2022 NAMESPACE: default STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: Get the application URL by running these commands: export POD_NAME=$(kubectl get pods --namespace default -l "app=trino,release=example-trino-cluster,component=coordinator" -o jsonpath="{.items[0].metadata.name}") echo "Visit http://127.0.0.1:8080 to use your application" kubectl port-forward $POD_NAME 8080:8080
This output depends on your configuration and cluster name. For example, the port
8080
is set by the.service.port
in theexample.yaml
.Run the following command to check that all pods, deployments, and services are running properly.
kubectl get all
You should expect to see output that shows running pods, deployments, and replica sets. A good indicator that everything is running properly is to see all pods are returning a ready status in the
READY
column.NAME READY STATUS RESTARTS AGE pod/example-trino-cluster-coordinator-bfb74c98d-rnrxd 1/1 Running 0 161m pod/example-trino-cluster-worker-76f6bf54d6-hvl8n 1/1 Running 0 161m pod/example-trino-cluster-worker-76f6bf54d6-tcqgb 1/1 Running 0 161m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/example-trino-cluster ClusterIP 10.96.25.35 <none> 8080/TCP 161m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/example-trino-cluster-coordinator 1/1 1 1 161m deployment.apps/example-trino-cluster-worker 2/2 2 2 161m NAME DESIRED CURRENT READY AGE replicaset.apps/example-trino-cluster-coordinator-bfb74c98d 1 1 1 161m replicaset.apps/example-trino-cluster-worker-76f6bf54d6 2 2 2 161m
The output shows running pods. These include the actual Trino containers. To better understand this output, check out the following resources:
If all pods, deployments, and replica sets are running and in the ready state, Trino has been successfully deployed.
Примечание
Unlike some Kubernetes applications, where it’s better to have many small pods, Trino works best with fewer pods each having more resources available. We strongly recommend to avoid having multiple Trino pods on a single physical host to avoid contention for resources.
Executing queries#
The pods running the Trino containers are all running on a private network internal to Kubernetes. In order to access them, specifically the coordinator, you need to create a tunnel to the coordinator pod and your computer. You can do this by running the commands generated upon installation.
Store the coordinator pod name in a shell variable called
POD_NAME
.POD_NAME=$(kubectl get pods -l "app=trino,release=example-trino-cluster,component=coordinator" -o name)
Create the tunnel from the coordinator pod to the client.
kubectl port-forward $POD_NAME 8080:8080
Now you can connect to the Trino coordinator at
http://localhost:8080
.To connect to Trino, you can use the command-line interface, a JDBC client, or any of the other clients. For this example, install the command-line interface, and connect to Trino in a new console session.
trino --server http://localhost:8080
Using the sample data in the
tpch
catalog, type and execute a query on thenation
table using thetiny
schema:trino> select count(*) from tpch.tiny.nation; _col0 ------- 25 (1 row) Query 20181105_001601_00002_e6r6y, FINISHED, 1 node Splits: 21 total, 21 done (100.00%) 0:06 [25 rows, 0B] [4 rows/s, 0B/s]
Try other SQL queries to explore the data set and test your cluster.
Once you are done with your exploration, enter the
quit
command in the CLI.Kill the tunnel to the coordinator pod. The is only available while the
kubectl
process is running, so you can just kill thekubectl
process that’s forwarding the port. In most cases that means pressingCTRL
+C
in the terminal where the port-forward command is running.
Configuration#
The Helm chart uses the Trino container image. The Docker image already contains a default configuration to get started, and some catalogs to allow you to explore Trino. Kubernetes allows you to mimic a traditional deployment by supplying configuration in YAML files. It’s important to understand how files such as the Trino configuration, JVM, and various catalog properties are configured in Trino before updating the values.
Creating your own YAML configuration#
When you use your own YAML Kubernetes configuration, you only override the values you specify.
The remaining properties use their default values. Add an example.yaml
with
the following configuration:
image:
tag: "458"
server:
workers: 3
coordinator:
jvm:
maxHeapSize: "8G"
worker:
jvm:
maxHeapSize: "8G"
These values are higher than the defaults and allow Trino to use more memory and run more demanding queries. If the values are too high, Kubernetes might not be able to schedule some Trino pods, depending on other applications deployed in this cluster and the size of the cluster nodes.
.image.tag
is set to the current version, 458. Set this value if you need to use a specific version of Trino. The default islatest
, which is not recommended. Usinglatest
will publish a new version of Trino with each release and a following Kubernetes deployment..server.workers
is set to3
. This value sets the number of workers, in this case, a coordinator and three worker nodes are deployed..coordinator.jvm.maxHeapSize
is set to8GB
. This sets the maximum heap size in the JVM of the coordinator. See JVM config..worker.jvm.maxHeapSize
is set to8GB
. This sets the maximum heap size in the JVM of the worker. See JVM config.
Предупреждение
Some memory settings need to be tuned carefully as setting some values outside of the range of the maximum heap size will cause Trino startup to fail. See the warnings listed on Resource management properties.
Reference the full list of properties that can be overridden in the Helm chart.
Примечание
Although example.yaml
is used to refer to the Kubernetes configuration
file in this document, you should use clear naming guidelines for the cluster
and deployment you are managing. For example,
cluster-example-trino-etl.yaml
might refer to a Trino deployment for a
cluster used primarily for extract-transform-load queries deployed on the
example
Kubernetes cluster. See
Configuration Best Practices
for more tips on configuring Kubernetes deployments.
Adding catalogs#
A common use-case is to add custom catalogs. You can do this by adding values to
the additionalCatalogs
property in the example.yaml
file.
additionalCatalogs:
lakehouse: |-
connector.name=iceberg
hive.metastore.uri=thrift://example.net:9083
rdbms: |-
connector.name=postgresql
connection-url=jdbc:postgresql://example.net:5432/database
connection-user=root
connection-password=secret
This adds both lakehouse
and rdbms
catalogs to the Kubernetes deployment
configuration.
Running a local Kubernetes cluster with kind#
For local deployments, you can use
kind (Kubernetes in Docker). Follow the steps
below to run kind
on your system.
kind
runs on Docker, so first check if Docker is installed:docker --version
If this command fails, install Docker by following Docker installation instructions.
Install
kind
by following the kind installation instructions.Run a Kubernetes cluster in
kind
by running the command:kind create cluster --name trino
Примечание
The
name
parameter is optional but is used to showcase how the namespace is applied in future commands. The cluster name defaults tokind
if no parameter is added. Usetrino
to make the application on this cluster obvious.Verify that
kubectl
is running against the correct Kubernetes cluster.kubectl cluster-info --context kind-trino
If you have multiple Kubernetes clusters already configured within
~/.kube/config
, you need to pass thecontext
parameter to thekubectl
commands to operate with the localkind
cluster.kubectl
uses the default context if this parameter isn’t supplied. Notice the context is the name of the cluster with thekind-
prefix added. Now you can look at all the Kubernetes objects running on yourkind
cluster.Set up Trino by folling the Running Trino using Helm steps. When running the
kubectl get all
command, add thecontext
parameter.kubectl get all --context kind-trino
Run some queries by following the Executing queries steps.
Once you are done with the cluster using kind, you can delete the cluster.
kind delete cluster -n trino
Cleaning up#
To uninstall Trino from the Kubernetes cluster, run the following command:
helm uninstall my-trino-cluster
You should expect to see the following output:
release "my-trino-cluster" uninstalled
To validate that this worked, you can run this kubectl
command to make sure
there are no remaining Kubernetes objects related to the Trino cluster.
kubectl get all