3

I have create a dataproc cluster with an updated init action to install datalab.

All works fine, except that when I query a Hive table from the Datalab notebook, i run into

hc.sql(“””select * from invoices limit 10”””)

"java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found" exception

Create cluster

gcloud beta dataproc clusters create ds-cluster \
--project my-exercise-project \
--region us-west1 \
--zone us-west1-b \
--bucket dataproc-datalab \
--scopes cloud-platform  \
--num-workers 2  \
--enable-component-gateway  \
--initialization-actions gs://dataproc_mybucket/datalab-updated.sh,gs://dataproc-initialization-actions/connectors/connectors.sh  \
--metadata 'CONDA_PACKAGES="python==3.5"'  \
--metadata gcs-connector-version=1.9.11  

datalab-updated.sh

  -v "${DATALAB_DIR}:/content/datalab" ${VOLUME_FLAGS} datalab-pyspark; then
    mkdir -p ${HOME}/datalab
    gcloud source repos clone datalab-notebooks ${HOME}/datalab/notebooks

In the datalab notebook

from pyspark.sql import HiveContext
hc=HiveContext(sc)
hc.sql("""show tables in default""").show()
hc.sql(“””CREATE EXTERNAL TABLE IF NOT EXISTS INVOICES
      (SubmissionDate DATE, TransactionAmount DOUBLE, TransactionType STRING)
      STORED AS PARQUET
      LOCATION 'gs://my-exercise-project-ds-team/datasets/invoices’”””)
hc.sql(“””select * from invoices limit 10”””)

UPDATE

spark._jsc.hadoopConfiguration().set('fs.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem')
spark._jsc.hadoopConfiguration().set('fs.gs.auth.service.account.enable', 'true')
spark._jsc.hadoopConfiguration().set('google.cloud.auth.service.account.json.keyfile', "~/Downloads/my-exercise-project-f47054fc6fd8.json")

UPDATE 2 ( datalab-updated.sh )

function run_datalab(){
  if docker run -d --restart always --net=host  \
      -v "${DATALAB_DIR}:/content/datalab" ${VOLUME_FLAGS} datalab-pyspark; then
    mkdir -p ${HOME}/datalab
    gcloud source repos clone datalab-notebooks ${HOME}/datalab/notebooks
    echo 'Cloud Datalab Jupyter server successfully deployed.'
  else
    err 'Failed to run Cloud Datalab'
  fi
}
  • Can you post the entire contents of datalab-updated.sh? Are you conda installing or pip installing anything else? FYI when running on dataproc you shouldn't need to run the `spark._jsc.hadoopConfiguration()` commands, and in fact they might just cause problems – Dennis Huo May 02 '19 at 20:18
  • Thank you Dennis. I just updated ( UPDATE 2) my original post with the changes on datalab.sh. I am not installing any new packages either through conda or through pip. – GCPEnthusiast May 03 '19 at 00:12

2 Answers2

3

You should use Datalab initialization action to install Datalab on Dataproc cluster:

gcloud dataproc clusters create ${CLUSTER} \
    --image-version=1.3 \
    --scopes cloud-platform \
    --initialization-actions=gs://dataproc-initialization-actions/datalab/datalab.sh

After this Hive works with GCS out of the box in Datalab:

from pyspark.sql import HiveContext
hc=HiveContext(sc)
hc.sql("""SHOW TABLES IN default""").show()

Output:

+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+

Creating external table on GCS using Hive in Datalab:

hc.sql("""CREATE EXTERNAL TABLE IF NOT EXISTS INVOICES
      (SubmissionDate DATE, TransactionAmount DOUBLE, TransactionType STRING)
      STORED AS PARQUET
      LOCATION 'gs://<BUCKET>/datasets/invoices'""")

Output:

DataFrame[]

Querying GCS table using Hive in Datalab:

hc.sql("""SELECT * FROM invoices LIMIT 10""")

Output:

DataFrame[SubmissionDate: date, TransactionAmount: double, TransactionType: string]
Igor Dvorzhak
  • 4,360
  • 3
  • 17
  • 31
  • Igor..Thank you for your response. Like i said, creating the external table is not an issue, querying the table is. The below does not work . **hc.sql(“””select * from invoices limit 10”””).show() "java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found" exception** – GCPEnthusiast May 02 '19 at 12:02
  • Added result of query execution - it still works out of the box if you are using Datalab init action. – Igor Dvorzhak May 02 '19 at 14:28
  • Thank you Igor. I see that you are using version 1.3. Can you please check if it works for version "1.4-debian9" ? – GCPEnthusiast May 02 '19 at 22:49
  • Just checked, it works for Dataproc 1.4 too. In your case you may want to specify `connectors.sh` init action before `datalab-updated.sh`, because connectors init action will change connectors jar name. – Igor Dvorzhak May 02 '19 at 22:56
  • Yes, I had the connectors.sh speicified before my datalab script, however it failed with the same exception. I am using gcs-connector-version=1.9.11, should i be using a different version? please advise. – GCPEnthusiast May 02 '19 at 23:38
  • You should use latest Dataproc image that already has latest GCS connector pre-installed (unless you are using Dataproc 1.0-1.2) - in this case you don't need to use connectors init action at all. Why you don't want to use default [Datalab init action](https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/datalab) that works out of the box? – Igor Dvorzhak May 03 '19 at 01:11
  • Yes, the regular dataproc works ( even version 1.4) and I am able to query the hive table . What I've noticed is that when I use the "beta dataproc" which is required for cluster auto-scaling, is when i encounter this issue. Do we currently have this functionality or am I missing something here? Please advise. – GCPEnthusiast May 03 '19 at 03:18
  • I just tested it with `gcloud beta dataproc clusters create ${CLUSTER} --image-version=1.4 --metadata CONDA_PACKAGES=python==3.5 --initialization-actions=gs://dataproc-initialization-actions/conda/bootstrap-conda.sh,gs://dataproc-initialization-actions/conda/install-conda-env.sh,gs://dataproc-initialization-actions/datalab/datalab.sh` and querying GCS table using Hive in Datalab works out of the box. – Igor Dvorzhak May 03 '19 at 03:41
  • Thank you Igor. Can you please add the below and see if it works for you? ``--autoscaling-policy=your-autoscaling-policy --max-idle 2h `` – GCPEnthusiast May 03 '19 at 04:16
  • Yes, even with `--autoscaling-policy=your-autoscaling-policy --max-idle 2h` properties it still works out of the box. – Igor Dvorzhak May 03 '19 at 05:22
  • It does not work for me for some reason. Appreciate all your support. Let us lay this to rest now. Thanks again!! – GCPEnthusiast May 03 '19 at 17:04
  • It doesn't work for you with exactly the same command (uses default Datalab init action) that I have used, or with your own Datalab init action? – Igor Dvorzhak May 03 '19 at 17:23
  • 1
    Thank you for your help. I would have to say that your commands worked. I am having issues with the customizations on my datalab script. – GCPEnthusiast May 03 '19 at 23:06
  • In this case you can take a look at Datalab init action [source code](https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/blob/master/datalab/datalab.sh) for inspiration. – Igor Dvorzhak May 03 '19 at 23:36
  • Thank you Igor. Can you please point me to the code where the image-version of dataproc is coded ? eg: 1.3 vs 1.4 Thanks in advance – GCPEnthusiast May 04 '19 at 17:30
  • The only Dataproc 1.4-specific part is [this one](https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/blob/08d043a48bf52bf5288f68bac7f9cfa23b5edf92/datalab/datalab.sh#L40) - in Dataproc 1.4 GCS connector was moved to the `/usr/local/share/google/dataproc/lib` directory. – Igor Dvorzhak May 05 '19 at 02:23
0

If you want to use Hive in datalab, you have to enable hive metastore

--properties hive:hive.metastore.warehouse.dir=gs://$PROJECT-warehouse/datasets \
--metadata "hive-metastore-instance=$PROJECT:$REGION:hive-metastore"

In your case will be


--properties hive:hive.metastore.warehouse.dir=gs://$PROJECT-warehouse/datasets \
--metadata "hive-metastore-instance=$PROJECT:$REGION:hive-metastore"

hc.sql(“””CREATE EXTERNAL TABLE IF NOT EXISTS INVOICES
      (SubmissionDate DATE, TransactionAmount DOUBLE, TransactionType STRING)
      STORED AS PARQUET
      LOCATION 'gs://$PROJECT-warehouse/datasets/invoices’”””)

And make sure add following setting to enable GCS

sc._jsc.hadoopConfiguration().set('fs.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem')

# This is required if you are using service account and set true, 
sc._jsc.hadoopConfiguration().set('fs.gs.auth.service.account.enable', 'false')
sc._jsc.hadoopConfiguration().set('google.cloud.auth.service.account.json.keyfile', "/path/to/keyfile")

# Following are required if you are using oAuth
sc._jsc.hadoopConfiguration().set('fs.gs.auth.client.id', 'YOUR_OAUTH_CLIENT_ID')
sc._jsc.hadoopConfiguration().set('fs.gs.auth.client.secret', 'OAUTH_SECRET')
howie
  • 2,587
  • 3
  • 27
  • 43
  • Howie, thank you for your response. However, it did not work for me, even after setting the above GCS settings as well as the your other metadata/properties. I downloaded the json service account file. Where does this JSON key file need to be? I am assuming on my local machine..correct? – GCPEnthusiast May 02 '19 at 03:06
  • Still same exception ? – howie May 02 '19 at 03:07