2

We are using Scalding to do ETL and generate the output as a Hive table with partitions. Consequently, we want the directory names for partitions to be something like "state=CA" for example. We are using TemplatedTsv as follows:

pipe
   // some other ETL
   .map('STATE -> 'hdfs_state) { state: Int => "State=" + state }
   .groupBy('hdfs_state) { _.pass }
   .write(TemplatedTsv(baseOutputPath, "%s", 'hdfs_state,
          writeHeader = false,
          sinkMode = SinkMode.UPDATE,
          fields = ('all except 'hdfs_state)))

We adopt the code sample from How to bucket outputs in Scalding. Here are two issues we have:

  • except can't be resolved by IntelliJ: Am I missing some imports? We don't want to explicitly enter all the fields within the "fields = ()" statement as fields are derived from the code inside the groupBy statement. If entering explicitly, they could be easily out of sync.
  • This approach looks too hacky as we are creating an extra column so that the directory names can be processed by Hive/Hcatalog. We are wondering what should be the right way to accomplish it?

Many thanks!

Community
  • 1
  • 1
Chung
  • 21
  • 2

1 Answers1

0

Sorry previous example was a pseudocode. Below I will give a small code with input data example.

Please note that this only works with Scalding version 0.12.0 or above

Let's image we have input as below which define some purchase data,

user1   1384034400  6   75
user1   1384038000  6   175
user2   1383984000  48  3
user3   1383958800  48  281
user3   1384027200  9   7
user3   1384027200  9   11
user4   1383955200  37  705
user4   1383955200  37  15
user4   1383969600  36  41
user4   1383969600  36  21

Tab separated and the 3rd column is a State number. Here we have integer but for string based States you can easily adapt.

This code will read the input and put them in 'State=stateid' output folder buckets.

class TemplatedTsvExample(args: Args) extends Job(args) {

  val purchasesPath = args("purchases")
  val outputPath    = args("output")

  // defines both input & output schema, you can also make separate for each of them
  val ioSchema = ('USERID, 'TIMESTAMP, 'STATE, 'PURCHASE)

  val Purchases =
     Tsv(purchasesPath, ioSchema)
     .read
     .map('STATE -> 'STATENAME) { state: Int => "State=" + state } // here you can make necessary changes
     .groupBy('STATENAME) { _.pass } // this is optional
     .write(TemplatedTsv(outputPath, "%s", 'STATENAME, false, SinkMode.REPLACE, ioSchema))
} 

I hope this is helpful. Please ask me if anything is not clear.

You can find full code here.

morazow
  • 251
  • 1
  • 8
  • Thanks morazow! I need the hive directory structure following its partition naming convention, without listing all fields explicitly. If I do, ` .groupBy('STATENAME) {_.size('user_count)..max('PURCHASE -> 'max_purchase)} ` Please note that we need 'STATE value in the file and 'STATENAME is the directory. We can list out explicitly as shown in your example. But I'm looking for the code, simply excluding the fields created only for directory naming purpose, i.e. 'STATENAME in this case. – Chung Feb 27 '15 at 01:52
  • In that case, just add 'STATE to your grouping, ` .groupBy('STATENAME, 'STATE){_.size('user_count)..max('PURCHASE -> 'max_purchase)} .write(TemplatedTsv(outputPath, "%s", 'STATENAME, false, SinkMode.REPLACE, ('STATE, 'user_count, 'max_purchase)))` Grouping will only keep grouping and aggregate fields, so you can add 'STATE to grouping. Because 'STATE and 'STATENAME are one-to-one mapping it will not change the logic. – morazow Feb 27 '15 at 08:32
  • Thanks morazow! Is there a way minic your pseudocode: fields = ('all except 'STATENAME) without listing the fields explicitly? – Chung Mar 02 '15 at 18:48
  • hey, you welcome! I do not know. Even if there is one, I am not aware. – morazow Mar 05 '15 at 10:32