4

Is there a possibility to read a GRIB2 file from HDFS into RDD via Spark API? I found JavaContext.binaryFiles, but the returned RDD contains cryptic data (not human readable). I'm using Spark 1.6.1 and the Java API. Thank you!

String inputFile = "hdfs://hdfs:8020/data/testdata.bin";
SparkConf sparkConf = SparkConfFactory.createSparkConf("WeatherData");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
JavaPairRDD<String, PortableDataStream> inputRdd = sc.binaryFiles(inputFile);

List<Tuple2<String, PortableDataStream>> asList = inputRdd.collect();       
for(Tuple2<String, PortableDataStream> a : asList) {
    System.out.println(a._1());                                             // Key = File path
    DataInputStream in = new DataInputStream(a._2().open()); 
    BufferedReader d = new BufferedReader(new InputStreamReader(in));

    while(d.ready()) {
        System.out.println(d.readLine());                                   // Cryptic output
    }
}
D. Müller
  • 3,336
  • 4
  • 36
  • 84
  • You can use `sc.binaryRecords()` maybe? it loads data from a flat binary file, assuming the length of each record is constant. – Avihoo Mamka Jul 21 '16 at 08:59
  • Good idea, just tried out, but is the same result, still in binary format... – D. Müller Jul 21 '16 at 09:40
  • After trying a few options, I found the best solution was to use cdo to convert grib2 to netcdf, and then use spark-xarray to read it into an rdd. https://ncar.github.io/PySpark4Climate/ – Paul Bendevis Apr 29 '20 at 15:27

0 Answers0