I am trying to access Google Storage bucket from a Hadoop cluster deployed in Google Cloud using the bdutil
script. It fails if bucket access is read-only.
What am I doing:
Deploy a cluster with
bdutil deploy -e datastore_env.sh
On the master:
vgorelik@vgorelik-hadoop-m:~$ hadoop fs -ls gs://pgp-harvard-data-public 2>&1 | head -10 14/08/14 14:34:21 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.2.8-hadoop1 14/08/14 14:34:25 WARN gcsio.GoogleCloudStorage: Repairing batch of 174 missing directories. 14/08/14 14:34:26 ERROR gcsio.GoogleCloudStorage: Failed to repair some missing directories. java.io.IOException: Multiple IOExceptions. java.io.IOException: Multiple IOExceptions. at com.google.cloud.hadoop.gcsio.GoogleCloudStorageExceptions.createCompositeException(GoogleCloudStorageExceptions.java:61) at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.createEmptyObjects(GoogleCloudStorageImpl.java:361) at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.createEmptyObjects(GoogleCloudStorageImpl.java:372) at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.listObjectInfo(GoogleCloudStorageImpl.java:914) at com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage.listObjectInfo(CacheSupplementedGoogleCloudStorage.java:455)
Looking at GCS Java source code, it seems that Google Cloud Storage Connector for Hadoop needs empty "directory" objects, which it can create by its own if the bucket is writeable; otherwise it fails. Setting fs.gs.implicit.dir.repair.enable=false
leads to "Error retrieving object" error.
Is it possible to use read-only buckets as MR job input somehow?
I use gsutil
for files upload. Can it be forced to create these empty objects on file upload?