Questions tagged [scalding]

Scalding is a scala DSL for Cascading, running on Hadoop.

Scalding is a scala DSL for Cascading, running on Hadoop.

See https://github.com/twitter/scalding

181 questions
0
votes
1 answer

Scalding (older versions) counters based on cascading

In older versions of scalding there were still no counters introduced in its API. Hadoop Counters In Scalding suggests how to fallback to cascading counters in scalding def addCounter(pipe : Pipe, group : String, counter : String) = { …
Jas
  • 14,493
  • 27
  • 97
  • 148
0
votes
1 answer

Scala/Scalding: Pivoting data

I have a dataset which is the output of a pipe in scalding that looks like this: 'Var1, 'Var2, 'Var3, 'Var4 = a,x,1,2 a,y,3,4 b,x,1,2 b,y,3,4 I'm trying to turn it into something like: 'Var1, 'Var3x, 'Var4x, 'Var3y, 'Var4y…
J Calbreath
  • 2,665
  • 4
  • 22
  • 31
0
votes
1 answer

Reading ctrl a delimiter in scalding

I'm trying to read a ctrl-a delimited file in scalding. I'm getting an error that says it found the wrong number of fields (expecting 166, found 142) and then it displays the line it is trying to read. For some reason, it does not read the…
J Calbreath
  • 2,665
  • 4
  • 22
  • 31
0
votes
1 answer

How do I log to file in Scalding?

In my Scalding map reduce code, I want to log out certain steps that are happening so that I can debug the map-reduce jobs if something goes wrong. How can I add logging to my scalding job? E.g. import com.twitter.scalding._ class WordCountJob(args:…
jcm
  • 5,499
  • 11
  • 49
  • 78
0
votes
1 answer

Hadoop-Cascading: Partial directory source tap

My data have structure like this: +data |-2014080700_00.txt |-2014080700_01.txt |-2014080701_00.txt |- ... |-2014080723_00.txt |-2014080800_00.txt |- ... |-2014090800_00.txt I know I can use all the file inside data directory with Tap like…
dieend
  • 2,231
  • 1
  • 24
  • 29
0
votes
1 answer

Scalding overriding args in subclass

I have two Scalding jobs, where one inherits from the other. Something like this class BaseJob(args : Args) extends Job(args) { val verbose = args.boolean("verbose") if(verbose){ // do stuff }else{ // do other stuff } } class…
arno_v
  • 18,410
  • 3
  • 29
  • 34
0
votes
1 answer

How is mapTo more efficient than map in Scalding

The Scalding reference on Github (https://github.com/twitter/scalding/wiki/Fields-based-API-Reference#map-functions) says the following: MapTo is equivalent to mapping and then projecting to the new fields, but is more efficient. Thus, the …
Chidu
  • 330
  • 2
  • 10
0
votes
1 answer

Scalding, can't use more than one trait in Job

I have a scalding job. I've create two traits A, B each trait has companion object A, B with implict wrap for trait and Pipe. Job compiles successfully, when I use only one trait. When I import both traits, compilation fails. It says that all…
Capacytron
  • 3,425
  • 6
  • 47
  • 80
0
votes
0 answers

Loading extremely long lines with TextLine in Cascading

I'm using TextLine in Cascading to load files with very large lines in Cascading. The lines are very long - around 30Mb on average, some much longer. When I run the job locally to test it it runs fine, but when I run it on the cluster it fails after…
Savage Reader
  • 387
  • 1
  • 4
  • 16
0
votes
0 answers

Scalding: How to reduce in-memory computations on lists?

With Scalding I try to find edit-distances between pairs of similar strings. All in all I have 10 000 000 strings in a CSV file. To reduce computations I use the following algorithm: Split all strings in groups by using first three chars as a…
DarqMoth
  • 603
  • 1
  • 13
  • 31
0
votes
0 answers

Selecting max value when joining RichPipes

I have a list of RichPipes with the following fields: name: String joinTime: Long value: Int I want to join them sequentially using reduce. When joining the RichPipes I only want to retain one field, value, and I want it to contain the max value…
Savage Reader
  • 387
  • 1
  • 4
  • 16
0
votes
1 answer

How Scalding DSL translates into regular Scala code?

Please help to find out how Scalding DSL translates into regular Scala code. https://github.com/twitter/scalding/wiki/Fields-based-API-Reference#sortBy For example: val fasterBirds = birds.map('speed -> 'doubledSpeed) { speed : Int => speed * 2…
DarqMoth
  • 603
  • 1
  • 13
  • 31
0
votes
1 answer

Scalding: How to change default tuple comparison function?

Doing Scalding MapReduce operations I need to compare tuples using my own comparison function on tuple fields. Questions: How to define my own tuple comparison function? What are the rules to extend Scalding with custome Scala code in general?…
DarqMoth
  • 603
  • 1
  • 13
  • 31
0
votes
2 answers

Scalding Tutorial with HDFS: Data is missing from one or more paths in: List(tutorial/data/hello.txt)

After configuring ssh and rsync when I try to run Scalding tutorial (https://github.com/Cascading/scalding-tutorial/) with command: $ scripts/scald.rb --hdfs tutorial/Tutorial0.scala I get the following…
DarqMoth
  • 603
  • 1
  • 13
  • 31
0
votes
1 answer

Scalding Tutorial: HDFS rsync errors

Please help to understand output of unsucessfull Scalding run on Hadoop. I got latest Scalding distribution from git: git clone https://github.com/twitter/scalding.git After sbt assembly from scalding directory I tried to run tutorial with…
DarqMoth
  • 603
  • 1
  • 13
  • 31