I know that Scalding's default serialization uses Kryo. So for this example, lets say I have a pipe of student objects.
case class Student(name:String, id:String)
val pipe: Pipe[Student] = //....
Then I write that pipe to a TextDelimited file…
We have many small files that need combining. In Scalding you can use TextLine to read files as text lines. The problem is we get 1 mapper per file, but we want to combine multiple files so that they are processed by 1 mapper.
I understand we need…
I'm using Scalding to process records with many (> 22) fields. At the end of the process, I'd like to write out the final Pipe's field names to a file. I know this is possible as Mapper and Reducer logs show this information. I'd like to get this…
I am trying to write Scalding jobs which have to connect to HBase, but I have trouble using the HBase tap. I have tried using the tap provided by Twitter Maple, following this example project, but it seems that there is some incompatibility between…
I need to join 2 pipes with same set of fields, i.e ('id, 'groupName, 'name), same way as SQL UNION works. How it is possible to do it in Twitter Scalding?
I experience an issue these days, i am trying to read from multiple files using scalding and create an output with a single file. My code is this:
def getFilesSource (paths: Seq[String]) = {
new MultipleTextLineFiles(paths: _*) {
override…
I have a copy of Programming MapReduce with Scalding by Antonios Chalkiopoulos. In the book he discusses the External Operations design pattern for Scalding code. You can see an example on his website here. I have made a choice to use the Type…
I need to read in an Avro file in Scalding but have no idea how to work with it. I have worked with straightforward avro files but this one is a little more complicated. The schema looks like this:
{"type":"record",
"name":"features",
…
I am trying to figure out how to create an build.sbt file for my own Scalding-based project.
Scalding source structure has no build.sbt file. Instead it has project/Build.scala build definition.
What would be the right way to integrate my own sbt…
I'm doing a groupBy for calculating a value, but it seems that when I group by, I lose all the fields that are not in the aggregation keys:
filtered.filterNot('site) {s:String => ...}
.filterNot('date) {s:String => ...}
aggr =…
Could someone point me to a link that explains how to read and write simple case classes in scalding? Is there some default serialization scheme?
For example, I have jobs that create pipes of com.twitter.algebird.Moments.
I wish to write the pipes…
I'm trying to build a far with sbt of a simple hadoop job I'm trying to run in an attempt to run it on Amazon EMR. However when I run sbt assembly I get the following error:
[error] (*:assembly) deduplicate: different file contents found in the…
Could anybody please recommend good solution (framework) to access HBase on Hadoop cluster from Scala (or Java) application?
By now I'm moving in scalding direction. Prototypes I obtained allowed me to combine scalding library with Maven and…
I am using CDH (Cloudera Hadoop) version 5.12.0 (which uses: Hadoop 2.6.0 Oozie 4.1.0) and Scalding 2.11
I am using a shaded jar with my dependencies built in.
I can run all my jobs properly without any error using a hadoop jar command as…
I have a Spark job whose final output is an Algebird bloom filter, and I'd need to reuse this bloom filter in another Spark job.
Is there a way to store this bloom filter in a kv store (eg: redis) using Twitter Storehaus and retrieve it in the…