1

I have a RichPipe with 3 fields: name: String, time: Long and value: Int. I need to get the value for a specific name, time pair. How can I do it? I can't figure it out from scalding documentation, as it is very cryptic and can't find any examples that do this.

Savage Reader
  • 387
  • 1
  • 4
  • 16

1 Answers1

3

Well a RichPipe is not a Key-Value store, that's why there is no documentation on using as a key-value store :) A RichPipe should be thought of as a pipe - so you can't get at data in the middle without first going in at one end and traversing the pipe till you find the element your looking for. Furthermore this is a little painful in Scalding because you have to write your results to disk (because it's built on top of Hadoop) and then read the result from disk in order to use it in your application. So the code will be something like:

myPipe.filter[String, Long](('name, 'time))(_ == (specificName, specificTime))
.write(Tsv("tmp/location"))

Then you'll need some higher level code to run the job and read the data back into memory to get at the result. Rather than write out all the code to do this (it's pretty straightforward), why don't you give some more context about what your use case is and what you are trying to do - maybe you can solve your problem under the Map-Reduce programming model.

Alternatively, use Spark, you'll have the same problem of having to traverse a distributed dataset, but you don't have the faff of writting to disk and reading back again. Furthermore you can use custom partitioner is Spark that could result in near key-value store like behaviour. But anyway naively, the code would be:

val theValueYouWant = 
  myRDD.filter {
    case (`specificName`, `specificTime`, _) => true
    case _ => false
  }
  .toArray.head._3
samthebest
  • 30,803
  • 25
  • 102
  • 142
  • I have several rich pipes with fields: name: String, time: Long, value: Int. The times are of different granularity: 1h, 2h, 4h and so on. I need to find the values for all the name-time pairs of corresponding time and get the maximum value from all of them. So, for one rich pipe with the 4h granularity there will be one value and for the corresponding 1h and 4h pipes there will be 4 and 2 values respectively. After I have the maximum from those RichPipes I need to assign to add them to a different rich pipe with name and time, where time is in milliseconds. I was thinking of using join. – Savage Reader Jul 17 '14 at 13:12
  • ... I was thinking of using join, but can't figure out how to join without comparing the values directly, but using a function, and also how to get the maximum during join. I figured not to try to get the value directly after what you have said:) – Savage Reader Jul 17 '14 at 13:14
  • By replicating rows with multiple keys (the 1h granularity as the key) you could perform the join and do what you want. For example a 4h gran record (a, b, c), which spans, lets say hours 3,4,5,6 get's mapped to `(3 (a, b, c)), (4, (a, b, c)), (5, (a, b, c), (6, (a, b, c))`, then flatten that out and perform the join on the 1h key. You might want another field saying from which pipe it originally came. If you don't get the idea from this comment, post a new question and I'll give a fuller answer. – samthebest Jul 17 '14 at 13:51
  • Thank you, I posted a question about the join. – Savage Reader Jul 17 '14 at 13:54