Questions tagged [scalding]

Scalding is a scala DSL for Cascading, running on Hadoop.

Scalding is a scala DSL for Cascading, running on Hadoop.

See https://github.com/twitter/scalding

181 questions
2
votes
0 answers

Is there a class to signify "grouped by and reduced"?

consider the following code in Scalding: Let's say I have the following tuples in a scalding TypedPipe[(Int, Int)]: (1, 2) (1, 3) (2, 1) (2, 2) On this pipe I can call groupBy(t => t._1) to generate a Grouped[Int, (Int, Int)] , which will still…
lezebulon
  • 7,607
  • 11
  • 42
  • 73
2
votes
0 answers

What is causing "org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: null"?

I have an Elastic MapReduce job which uses elasticsearch-hadoop via scalding-taps to transfer data from Amazon S3 to Amazon Elasticsearch Service. For a long time this job ran successfully. However, it has recently started failing with the following…
2
votes
1 answer

Scalding job failing with VerifyError on EMR version 4.2.0

We have a Scalding job which I want to run on the AWS Elastic MapReduce using release label 4.2.0. This job ran successfully on AMI 2.4.2. When we upgraded it to AMI 3.7.0, we ran into a java.lang.VerifyError caused by incompatible jars. Our project…
fblundun
  • 987
  • 7
  • 19
2
votes
0 answers

Scalding write to JDBCSource having greater than 22 columns

Is there a way in scalding to write to a SQL table that has greater than 22 columns? The problem I am facing is as follows. I have a table which has 28 columns, each row of which I am representing using a case class. Something like case class…
rmathews7
  • 175
  • 8
2
votes
2 answers

Why Scala cannot infer type argument when it's obvious?

In the following example, I was trying to create an implicit conversion between MySource and TypedPipe[T]. I own MySource, in fact I have a lot of such sources, so I wanted to use a Porable[T] trait to mark what type argument T I want for the output…
Roy
  • 880
  • 1
  • 12
  • 27
2
votes
1 answer

how to run scalding test in local mode with local input file

Scalding has a great utility to run an integration test for the job flow. In this way the inputs and outputs are the in-memory buffer val input = List("0" -> "This a a day") val expectedOutput = List(("This", 1),("a", 2),("day", 1)) …
Julias
  • 5,752
  • 17
  • 59
  • 84
2
votes
1 answer

scalding testing job with JobTest and Csv(skipHeader = true) input

I have this job: import com.twitter.scalding.{Args, Csv, Job} class ManagersAndTeams(args: Args) extends Job(args) { val managersPipe = Csv(args("managers"), skipHeader = true) .project('managerID, 'teamID) val teamsPipe =…
kostas.kougios
  • 945
  • 10
  • 21
2
votes
1 answer

How to output data with Hive-style directory structure in Scalding?

We are using Scalding to do ETL and generate the output as a Hive table with partitions. Consequently, we want the directory names for partitions to be something like "state=CA" for example. We are using TemplatedTsv as follows: pipe // some…
Chung
  • 21
  • 2
2
votes
2 answers

How to measure the running time of a scala scalding program?

I have a simple scalding program to transform some data which I execute using com.twitter.scalding.Tool in local mode. val start = System.nanoTime val inputPaths = args("input").split(",").toList val pipe = Tsv(inputPaths(0)) // standard pipe…
Yuri Brovman
  • 1,093
  • 2
  • 12
  • 17
2
votes
1 answer

Scalding, flatten fields after groupBy

I see this: Scalding: How to retain the other field, after a groupBy('field){.size}? it's a real pain and a mess comparing to Apache Pig... What do I do wrong? Can I do the same like GENERATE(FLATTEN()) pig? I'm confused. Here is my scalding code: …
Capacytron
  • 3,425
  • 6
  • 47
  • 80
2
votes
1 answer

Adding parquet-avro support to scalding

How can I create a Scalding Source that will handle conversions between avro and parquet. The solution should: 1. Read from parquet format and convert to avro memory representation 2. Write avro objects into a parquet file Note: I noticed…
beefyhalo
  • 1,691
  • 2
  • 21
  • 33
2
votes
2 answers

Does Scalding support record filtering via predicate pushdown w/Parquet?

There are obvious speed benefits from not having to read records that would fail a filter. I see Spark support for it, but I haven't found any documentation on how to do it w/Scalding.
Nick
  • 1,012
  • 2
  • 13
  • 29
2
votes
2 answers

Scalding: parsing comma-separated data with header

I have data in format: "header1","header2","header3",... "value11","value12","value13",... "value21","value22","value23",... .... What is the best way to parse it in Scalding? I have over 50 columns altogether, but I am only interested in some of…
Savage Reader
  • 387
  • 1
  • 4
  • 16
2
votes
1 answer

How to run slim jar in scalding / hadoop job without writing the full classpath in libjars

Is there a way to run a scalding job that needs class-path without using libjars and writing each jar explicitly comma separated. I would like to put all my jars in a lib and than just write -libjars=./lib/* and not all the jars. Is there a classic…
Ehud Lev
  • 2,461
  • 26
  • 38
2
votes
0 answers

Reading SequenceFile written by Spark

I have bunch of sequence files that I want to read using Scalding and I am having some troubles. This is my code: class ReadSequenceFileApp(args:Args) extends ConfiguredJob(args) { SequenceFile(args("in"), ('_, 'wbytes)) .read …
Rob Schneider
  • 679
  • 4
  • 13
  • 27
1 2
3
12 13