consider the following code in Scalding:
Let's say I have the following tuples in a scalding TypedPipe[(Int, Int)]:
(1, 2)
(1, 3)
(2, 1)
(2, 2)
On this pipe I can call groupBy(t => t._1) to generate a Grouped[Int, (Int, Int)] , which will still…
I have an Elastic MapReduce job which uses elasticsearch-hadoop via scalding-taps to transfer data from Amazon S3 to Amazon Elasticsearch Service. For a long time this job ran successfully. However, it has recently started failing with the following…
We have a Scalding job which I want to run on the AWS Elastic MapReduce using release label 4.2.0.
This job ran successfully on AMI 2.4.2. When we upgraded it to AMI 3.7.0, we ran into a java.lang.VerifyError caused by incompatible jars. Our project…
Is there a way in scalding to write to a SQL table that has greater than 22 columns? The problem I am facing is as follows. I have a table which has 28 columns, each row of which I am representing using a case class. Something like
case class…
In the following example, I was trying to create an implicit conversion between MySource and TypedPipe[T]. I own MySource, in fact I have a lot of such sources, so I wanted to use a Porable[T] trait to mark what type argument T I want for the output…
Scalding has a great utility to run an integration test for the job flow.
In this way the inputs and outputs are the in-memory buffer
val input = List("0" -> "This a a day")
val expectedOutput = List(("This", 1),("a", 2),("day", 1))
…
I have this job:
import com.twitter.scalding.{Args, Csv, Job}
class ManagersAndTeams(args: Args) extends Job(args)
{
val managersPipe = Csv(args("managers"), skipHeader = true)
.project('managerID, 'teamID)
val teamsPipe =…
We are using Scalding to do ETL and generate the output as a Hive table with partitions. Consequently, we want the directory names for partitions to be something like "state=CA" for example. We are using TemplatedTsv as follows:
pipe
// some…
I have a simple scalding program to transform some data which I execute using com.twitter.scalding.Tool in local mode.
val start = System.nanoTime
val inputPaths = args("input").split(",").toList
val pipe = Tsv(inputPaths(0))
// standard pipe…
I see this:
Scalding: How to retain the other field, after a groupBy('field){.size}?
it's a real pain and a mess comparing to Apache Pig... What do I do wrong? Can I do the same like GENERATE(FLATTEN()) pig?
I'm confused. Here is my scalding code:
…
How can I create a Scalding Source that will handle conversions between avro and parquet.
The solution should:
1. Read from parquet format and convert to avro memory representation
2. Write avro objects into a parquet file
Note: I noticed…
There are obvious speed benefits from not having to read records that would fail a filter. I see Spark support for it, but I haven't found any documentation on how to do it w/Scalding.
I have data in format:
"header1","header2","header3",...
"value11","value12","value13",...
"value21","value22","value23",...
....
What is the best way to parse it in Scalding? I have over 50 columns altogether, but I am only interested in some of…
Is there a way to run a scalding job that needs class-path without using libjars and writing each jar explicitly comma separated.
I would like to put all my jars in a lib and than just write -libjars=./lib/* and not all the jars.
Is there a classic…
I have bunch of sequence files that I want to read using Scalding and I am having some troubles. This is my code:
class ReadSequenceFileApp(args:Args) extends ConfiguredJob(args) {
SequenceFile(args("in"), ('_, 'wbytes))
.read
…