We are using Scalding to do ETL and generate the output as a Hive table with partitions. Consequently, we want the directory names for partitions to be something like "state=CA" for example. We are using TemplatedTsv as follows:
pipe
// some other ETL
.map('STATE -> 'hdfs_state) { state: Int => "State=" + state }
.groupBy('hdfs_state) { _.pass }
.write(TemplatedTsv(baseOutputPath, "%s", 'hdfs_state,
writeHeader = false,
sinkMode = SinkMode.UPDATE,
fields = ('all except 'hdfs_state)))
We adopt the code sample from How to bucket outputs in Scalding. Here are two issues we have:
- except can't be resolved by IntelliJ: Am I missing some imports? We don't want to explicitly enter all the fields within the "fields = ()" statement as fields are derived from the code inside the groupBy statement. If entering explicitly, they could be easily out of sync.
- This approach looks too hacky as we are creating an extra column so that the directory names can be processed by Hive/Hcatalog. We are wondering what should be the right way to accomplish it?
Many thanks!