Spark - Create HFile for one rowKey with multiple columns

Question

JavaRDD<String> hbaseFile = jsc.textFile(HDFS_MASTER+HBASE_FILE);
JavaPairRDD<ImmutableBytesWritable, KeyValue> putJavaRDD = hbaseFile.mapToPair(line -> convertToKVCol1(line, COLUMN_AGE));
putJavaRDD.sortByKey(true);
putJavaRDD.saveAsNewAPIHadoopFile(stagingFolder, ImmutableBytesWritable.class, KeyValue.class, HFileOutputFormat2.class, conf);

private static Tuple2<ImmutableBytesWritable, KeyValue> convertToKVCol1(String beanString, byte[] column) {
    InspurUserEntity inspurUserEntity = gson.fromJson(beanString, InspurUserEntity.class);
    String rowKey = inspurUserEntity.getDepartment_level1()+"_"+inspurUserEntity.getDepartment_level2()+"_"+inspurUserEntity.getId();
    return new Tuple2<>(new ImmutableBytesWritable(Bytes.toBytes(rowKey)),
            new KeyValue(Bytes.toBytes(rowKey), COLUMN_FAMILY, column, Bytes.toBytes(inspurUserEntity.getAge())));
}

The above is my code, it only works for a single column for a row key. Any ideas to create an HFile with multiple columns for one row key?

score 0 · Answer 1 · answered Sep 22 '17 at 07:00

0

You must use an array instead of ImmutableBytesWritable in declaration.

answered Sep 22 '17 at 07:00

Umais Jan

28
4

thanks for help me. i am newbee on mapreduce and spark. do you have example about how to use array instead of ImmutableBytesWritable ? thanks alot – 徐琮杰 Sep 22 '17 at 07:27
this is my code : return new Tuple2<>(new ImmutableBytesWritable(rowKyeBytes), new KeyValue(xxxx)); how to use array? – 徐琮杰 Sep 22 '17 at 07:28

score 0 · Accepted Answer · answered Nov 09 '18 at 07:31

You can create multiple Tuple2<ImmutableBytesWritable, KeyValue> for one row, where the key stays the same and KeyValues represent individual cell values. Make sure to order your columns lexicographically as well. So you should invoke saveAsNewAPIHadoopFile on a JavaPairRDD<ImmutableBytesWritable, KeyValue>.

    final JavaPairRDD<ImmutableBytesWritable, KeyValue> writables = myRdd.flatMapToPair(record -> {
     final List<Tuple2<ImmutableBytesWritable, KeyValue>> listToReturn = new ArrayList<>();
     // Add first column to the collection
     listToReturn.add(new Tuple2<ImmutableBytesWritable, KeyValue>(
                            new ImmutableBytesWritable(Bytes.toBytes(record.getRowKey())),
                            new KeyValue(Bytes.toBytes(record.getRowKey()), Bytes.toBytes("CF"),
                                    Bytes.toBytes("COL1"), System.currentTimeMillis(),
                                    Bytes.toBytes(record.getCol1()))));
    // Add subsequent columns
    listToReturn.add(new Tuple2<ImmutableBytesWritable, KeyValue>(
                            new ImmutableBytesWritable(Bytes.toBytes(record.getRowKey())),
                            new KeyValue(Bytes.toBytes(record.getRowKey()), Bytes.toBytes("CF"),
                                    Bytes.toBytes("COL2"), System.currentTimeMillis(),
                                    Bytes.toBytes(record.getCol2()))));
});

NOTE: This is a major gotcha, you must add your columns to the RDD lexicographically as well.

Essentially this combination: row key + column family + column qualifier should be sorted before you proceeed to push out the HFiles.

Spark - Create HFile for one rowKey with multiple columns

2 Answers2