we are trying to load data, which is saved as sequence files, into BQ using Google DataFLow SDK.
At The entry point , we are trying to read the data into the pipeline using the following code
Read.Bounded<KV<LongWritable, BytesWritable>> resuls = HadoopFileSource.readFrom("gs://raw-data/topic-name/dt=2017-02-28/1_0_00000000002956516884.gz",
org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.class, LongWritable.class, BytesWritable.class);
[1] we are using "gcs-connector" to enable hadoop notion
[2] The HadoopFileSource is from com.google.cloud.dataflow.contrib.hadoop
our core-sites.xml file looks like that:
<configuration>
<property>
<name>fs.gs.impl</name>
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
<description>The FileSystem for gs: (GCS) uris.</description>
</property>
<property>
<name>fs.AbstractFileSystem.gs.impl</name>
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
<description>
The AbstractFileSystem for gs: (GCS) uris. Only necessary for use with Hadoop 2.
</description>
</property>
but we keep getting "java.net.UnknownHostException: metadata"
i event added GOOGLE_APPLICATION_CREDENTIALS="/path/to/key.json" to the environment variables, but still we are getting the same exception
just need easy way to read sequence files into Google DataFlow pipeline from GCS
will appreciate your help here
Thanks, Avi