3

The Apache Drill features list mentions that it can query data from Google Cloud Storage, but I can't find any information on how to do that. I've got it working fine with S3, but suspect i'm missing something very simple in terms of Google Cloud Storage.

Does anyone have an example Storage Plugin configuration for Google Cloud Storage?

Thanks

M

MJM
  • 357
  • 1
  • 4
  • 16
  • 1
    Have you found out how to do this? I've tried the mailing list and twitter but no answer :/ – Johan Apr 07 '16 at 08:03
  • Did you find any solution for this? I am trying to run a drill cluster on compute instances which i need to connect to google cloud storage? – sanjay Oct 05 '16 at 13:44

2 Answers2

3

This is quite an old question, so I imagine you either found a solution or moved on with your life, but for anyone looking for a solution without using Dataproc, here's a solution:

  1. Add the JAR file from the GCP connectors to the jars/3rdparty directory.
  2. Add the following to the site-core.xml file in the conf directory (change the upper-case values such as YOUR_PROJECT_ID to your own details):
<property>
    <name>fs.gs.project.id</name>
    <value>YOUR_PROJECT_ID</value>
    <description>
      Optional. Google Cloud Project ID with access to GCS buckets.
      Required only for list buckets and create bucket operations.
    </description>
  </property>
  <property>
    <name>fs.gs.auth.service.account.private.key.id</name>
    <value>YOUR_PRIVATE_KEY_ID</value>
  </property>
    <property>
        <name>fs.gs.auth.service.account.private.key</name>
        <value>-----BEGIN PRIVATE KEY-----\nYOUR_PRIVATE_KEY\n-----END PRIVATE KEY-----\n</value>
    </property>
  <property>
    <name>fs.gs.auth.service.account.email</name>
    <value>YOUR_SERVICE_ACCOUNT_EMAIL/value>
    <description>
      The email address is associated with the service account used for GCS
      access when fs.gs.auth.service.account.enable is true. Required
      when authentication key specified in the Configuration file (Method 1)
      or a PKCS12 certificate (Method 3) is being used.
    </description>
  </property>
  <property>
    <name>fs.gs.working.dir</name>
    <value>/</value>
    <description>
      The directory relative gs: uris resolve in inside of the default bucket.
    </description>
  </property>
   <property>
    <name>fs.gs.implicit.dir.repair.enable</name>
    <value>true</value>
    <description>
      Whether or not to create objects for the parent directories of objects
      with / in their path e.g. creating gs://bucket/foo/ upon deleting or
      renaming gs://bucket/foo/bar.
    </description>
  </property>
   <property>
    <name>fs.gs.glob.flatlist.enable</name>
    <value>true</value>
    <description>
      Whether or not to prepopulate potential glob matches in a single list
      request to minimize calls to GCS in nested glob cases.
    </description>
  </property>
   <property>
    <name>fs.gs.copy.with.rewrite.enable</name>
    <value>true</value>
    <description>
      Whether or not to perform copy operation using Rewrite requests. Allows
      to copy files between different locations and storage classes.
    </description>
  </property>

Start Apache Drill.

Add a custom storage to Drill.

You're good to go.

The solution is from here, where I detail some more about what we do around data exploration with Apache Drill.

user1369384
  • 111
  • 5
  • 1
    As I shared in a comment to this [answer](https://stackoverflow.com/a/60744100/6678903), this answer and the linked blog worked for me w/o the need to use DataProc. The only caveat was that when I tried to specify the service account email and private key ID/value, as shown above, I would get the following error: `Error: SYSTEM ERROR: IOException: Invalid PKCS8 data.` However, copying the JSON file provided by GCP to my Drill instance and specifying the `fs.gs.auth.service.account.json.keyfile` instead, per the blog's reference to the format specified in the Hadoop connector repo, worked great – dev8675309 Mar 31 '20 at 14:33
1

I managed to query parquet data in Google Cloud Storage (GCS) using Apache Drill (1.6.0) running on a Google Dataproc cluster. In order to set that up, I took the following steps:

  1. Install Drill and make the GCS connector accessible (this can be used as an init-script for dataproc, just note it wasn't really tested and relies on a local zookeeper instance):

    #!/bin/sh
    set -x -e
    BASEDIR="/opt/apache-drill-1.6.0"
    mkdir -p ${BASEDIR}
    cd ${BASEDIR}
    wget http://apache.mesi.com.ar/drill/drill-1.6.0/apache-drill-1.6.0.tar.gz
    tar -xzvf apache-drill-1.6.0.tar.gz
    mv apache-drill-1.6.0/* .
    rm -rf apache-drill-1.6.0 apache-drill-1.6.0.tar.gz
    
    ln -s /usr/lib/hadoop/lib/gcs-connector-1.4.5-hadoop2.jar ${BASEDIR}/jars/gcs-connector-1.4.5-hadoop2.jar
    mv ${BASEDIR}/conf/core-site.xml ${BASEDIR}/conf/core-site.xml.old
    ln -s /etc/hadoop/conf/core-site.xml ${BASEDIR}/conf/core-site.xml
    
    drillbit.sh start
    
    set +x +e
    
  2. Connect to the Drill console, create a new storage plugin (call it, say, gcs), and use the following configuration (note I copied most of it from the s3 config, made minor changes):

    {
      "type": "file",
      "enabled": true,
      "connection": "gs://myBucketName",
      "config": null,
      "workspaces": {
        "root": {
          "location": "/",
          "writable": false,
          "defaultInputFormat": null
        },
        "tmp": {
          "location": "/tmp",
          "writable": true,
          "defaultInputFormat": null
        }
      },
      "formats": {
        "psv": {
          "type": "text",
          "extensions": [
            "tbl"
          ],
          "delimiter": "|"
        },
        "csv": {
          "type": "text",
          "extensions": [
            "csv"
          ],
          "delimiter": ","
        },
        "tsv": {
          "type": "text",
          "extensions": [
            "tsv"
          ],
          "delimiter": "\t"
        },
        "parquet": {
          "type": "parquet"
        },
        "json": {
          "type": "json",
          "extensions": [
            "json"
          ]
        },
        "avro": {
          "type": "avro"
        },
        "sequencefile": {
          "type": "sequencefile",
          "extensions": [
            "seq"
          ]
        },
        "csvh": {
          "type": "text",
          "extensions": [
            "csvh"
          ],
          "extractHeader": true,
          "delimiter": ","
        }
      }
    }
    
  3. Query using the following syntax (note the backticks):

    select * from gs.`root`.`path/to/data/*` limit 10;
    
  • How can we do if I create a drill cluster on compute instances and try to access google storage. – sanjay Oct 05 '16 at 13:48
  • 1
    I believe you need the [GCS connector](https://cloud.google.com/hadoop/google-cloud-storage-connector). It is pre-installed in Dataproc, and I haven't tried to utilize it in other scenarios such as "bare" compute instances with Drill. If you use Dataproc it gets a lot of the management headache done for you, with the downside that other unnecessary services run on the cluster, but in my experience the overhead is negligible. – Dan Edelstein Oct 06 '16 at 19:35