5

I am working on a project on Kubernetes where I use Spark SQL to create tables and I would like to add partitions and schemas to an Hive Metastore. However, I did not found any proper documentation to install Hive Metastore on Kubernetes. Is it something possible knowing that I have already a PostGreSQL database installed ? If yes, could you please help me with any official documentation ?

Thanks in advance.

Yassir S
  • 1,032
  • 3
  • 21
  • 44

2 Answers2

3

Hive on MR3 allows the user to run Metastore in a Pod on Kubernetes. The instruction may look complicated, but once the Pod is properly configured, it's easy to start Metastore on Kubernetes. You can also find the pre-built Docker image at Docker Hub. Helm chart is also provided.

https://mr3docs.datamonad.com/docs/k8s/guide/run-metastore/

https://mr3docs.datamonad.com/docs/k8s/helm/run-metastore/

The documentation assumes MySQL, but we have tested it with PostgreSQL as well.

glapark
  • 86
  • 3
0

To install Hive Metastore on Kubernetes, you will need a Docker image that runs the Metastore service and a Kubernetes deployment configuration. The Metastore service connects to a relational database for storing metadata. Here are the steps:

  1. Setting up Hive Metastore on Kubernetes:

    • Create a Kubernetes Pod for metastore-standalone using a YAML file. This Pod should include:
      • An Init Containers to download the necessary dependencies for your application. This container uses the busybox:1.28 image and runs a shell command to download the hadoop-aws and aws-java-sdk-bundle JAR files from the Maven repository.
      • The Main Container which runs the apache/hive:3.1.3 image. This container is configured to run the Hive Metastore service. The Metastore service manages metadata for Hive tables and partitions.
      • Several Environment Variables for the Hive Metastore service. The SERVICE_NAME is set to metastore. The AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are also set, which will be used for accessing AWS S3.
      • A Command that first moves the downloaded JAR files from the /jars directory to /opt/hadoop/share/hadoop/common directory. Then it initializes the Metastore schema using Derby as the database. Finally, it runs the Hive Metastore service.
  2. Using a remote database: The Hive Metastore service in this setup uses a Local/Embedded Metastore Database (Derby) which is fine for testing or development purposes. However, for production use, it is recommended to use a Remote Metastore Database. The Metastore service supports several types of databases, including MySQL, Postgres, Oracle, and MS SQL Server. You would need to adjust the schematool command and provide additional configurations to use a remote database.

  3. Configuring Spark to Use the Remote Hive Metastore and S3 as a Warehouse:

    • To point Spark to your Hive Metastore service, provide certain configurations when submitting your Spark application. This can be done using Spark configurations. These configurations include your AWS credentials, the S3 bucket name, and the hadoop-aws package, which allows Spark to interact with S3.
    • The configurations spark.hadoop.fs.s3a.access.key and spark.hadoop.fs.s3a.secret.key provide the access keys needed to interact with AWS S3. The spark.hadoop.fs.s3a.endpoint configuration sets the endpoint to access the S3 service.
    • The spark.sql.catalogImplementation configuration is set to hive, which means the application will use Hive's catalog implementation. The spark.hadoop.hive.metastore.uris configuration sets the URI for connecting to the Hive Metastore.
  4. Solving common issues:

    • Issue 1: If there's a bug in the Docker image entrypoint, the most effective workaround is to overwrite the default command with a custom one in the containers section of your Kubernetes configuration. This command initializes the schema for the Derby database and then starts the Hive Metastore service.
    • Issue 2: If the default Docker image for Hive does not include AWS-related JAR files which are crucial for connecting the Metastore service to an S3 bucket, add an init container in your Kubernetes configuration. This init container, based on a busybox image, downloads the necessary AWS JAR files from a Maven repository and stores them in a shared volume. Then, add these JARs to the classpath of the main Hive Metastore container. Keep in mind that these instructions are quite general and you may need to adjust them to fit your specific needs and environment.

Also, remember to replace the AWS Access Key and Secret Key placeholders in the YAML configuration and Spark-submit command with your actual AWS credentials.

Here is simple example:

apiVersion: v1
kind: Pod
spec:
  initContainers:
  - name: download-dependencies
    image: busybox:1.28
    command:
    - /bin/sh
    - -c
    - |
      wget -P /jars https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.1.0/hadoop-aws-3.1.0.jar https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.271/aws-java-sdk-bundle-1.11.271.jar
    volumeMounts:
    - name: jar-volume
      mountPath: /jars
  containers:
  - name: hive-metastore
    image: apache/hive:3.1.3
    ports:
    - containerPort: 9083
    command: ["/bin/bash", "-c", "mv /jars/* /opt/hadoop/share/hadoop/common && /opt/hive/bin/schematool -dbType derby -initSchema && exec /opt/hive/bin/hive --skiphadoopversion --skiphbasecp --service metastore --hiveconf fs.s3a.endpoint=your-s3-endpoint"]
    volumeMounts:
    - name: jar-volume
      mountPath: /jars
  volumes:
  - name: jar-volume
    emptyDir: {}

I hope this helps! Let me know if you have any questions.

Eugene Lopatkin
  • 2,351
  • 1
  • 22
  • 34