0

I have an RDD transformed into a dataFrame of the following structure:

+-------------+--------------------+
|          key|               value|
+-------------+--------------------+
|1556110998000|{"eventId":"55108...|
|1556110998000|{"eventId":"558ac...|
|1556110998000|{"eventId":"553c0...|
|1556111001600|{"eventId":"56886...|
|1556111001600|{"eventId":"569ad...|
|1556111001600|{"eventId":"56b34...|
|1556110998000|{"eventId":"55d1b...|
...

Key is a timestamp rounded down to one hour, value is a json String.

What I want is to store the values into different buckets according to the timestamp. So basically I want the structure as follows:

...
/datalake/2019/03/31/03
/datalake/2019/03/31/04
/datalake/2019/03/31/05
...
/datalake/2019/04/25/08
/datalake/2019/04/25/09
...

Storing the actual rdd just with eventsRdd.saveAsTextFile("/datalake"); does not do the trick, as all events end up in one single file. Additionally this file is overwritten in the next "round".

So how would I go about this? I read some articales about partitioning but they didn't really help. I am actually thinking about switching to Kafka Connect and not using Spark at all for this.

Below is some code I tried to store the events (just on local fs for now)

private void saveToDatalake(JavaRDD<E> eventsRdd) {
        JavaPairRDD<Long, String> longEJavaPairRdd = eventsRdd
                .mapToPair(event -> new Tuple2<>(calculateRoundedDownTimestampFromSeconds(event.getTimestamp()), serialize(event)));

        SparkSession sparkSession = SparkSession.builder().appName("Build a DataFrame from Scratch").master("local[*]").getOrCreate();
        StructType dataFrameSchema = DataTypes
                .createStructType(new StructField[]
                        {DataTypes.createStructField("key", DataTypes.LongType, false),
                                DataTypes.createStructField("value", DataTypes.StringType, false),

                        });
        JavaRDD<Row> rowRdd = longEJavaPairRdd.map(pair -> RowFactory.create(pair._1, pair._2));

        Dataset<Row> dataFrame = sparkSession.sqlContext().createDataFrame(rowRdd, dataFrameSchema);

        Dataset<Row> buckets = dataFrame.select("key").dropDuplicates();
        //buckets.show();
        buckets.foreach(bucket -> {
            Dataset<Row> valuesPerBucket = dataFrame.where(dataFrame.col("key").equalTo(bucket)).select("value");
            //valuesPerBucket.show();
            long timestamp = bucket.getLong(0);
            valuesPerBucket.rdd().saveAsTextFile("/data/datalake/" + calculateSubpathFromTimestamp(timestamp));
        });

    }
private String calculateSubpathFromTimestamp(long timestamp) {
        String FORMAT = "yyyy/MM/dd/HH";
        ZoneId zone = ZoneId.systemDefault();
        DateTimeFormatter df = DateTimeFormatter.ofPattern(FORMAT).withZone(zone);
        String time = df.format(Instant.ofEpochMilli(timestamp));
        System.out.println("Formatted Date " + time);
        return time;
    }
  • Have you considered using Kafka Connect HDFS to do this? – Robin Moffatt Apr 25 '19 at 10:47
  • 1
    What kind of datalake? Azure? If so then it is possible using Kafka Connect HDFS. I managed to implement Kafka Connect HDFS for Azure Data Lake a couple of weeks ago. It is non-trivial, but you would not have to write any code. – sil Apr 25 '19 at 11:16
  • At a high level, you'd have to do `dataFrame.withColumn()` a few times, converting the timestamp to years, months, days. Then you'd use a method like `saveAsTable` instead... Or create the table and `dataFrame.sql("INSERT INTO...")` – OneCricketeer Apr 25 '19 at 14:09
  • Indeed I am pursuing the Kafka Connect HDFS approach now. So far I managed to read the data from Kafka in Protobuf format and store in HDFS in the right folders (YYYY/MM/DD/HH) in Protobuf format as well. What's left is now to convert / de-serialze the protobuf events into proper json and then storing. For now HDFS will be sufficient, but later on we will move to S3 and I know there are connectors available as well, so that should not be too big of an issue. – El Shotodore Apr 26 '19 at 09:30

1 Answers1

0

We got it done by using Kafka Connect HDFS Connector and providing a custom Serializer class to convert Protobuf Messages from Kafka into JSON.