In older versions of scalding there were still no counters introduced in its API. Hadoop Counters In Scalding suggests how to fallback to cascading counters in scalding
def addCounter(pipe : Pipe, group : String, counter : String) = {
…
I have a dataset which is the output of a pipe in scalding that looks like this:
'Var1, 'Var2, 'Var3, 'Var4 =
a,x,1,2
a,y,3,4
b,x,1,2
b,y,3,4
I'm trying to turn it into something like:
'Var1, 'Var3x, 'Var4x, 'Var3y, 'Var4y…
I'm trying to read a ctrl-a delimited file in scalding. I'm getting an error that says it found the wrong number of fields (expecting 166, found 142) and then it displays the line it is trying to read. For some reason, it does not read the…
In my Scalding map reduce code, I want to log out certain steps that are happening so that I can debug the map-reduce jobs if something goes wrong.
How can I add logging to my scalding job?
E.g.
import com.twitter.scalding._
class WordCountJob(args:…
My data have structure like this:
+data
|-2014080700_00.txt
|-2014080700_01.txt
|-2014080701_00.txt
|- ...
|-2014080723_00.txt
|-2014080800_00.txt
|- ...
|-2014090800_00.txt
I know I can use all the file inside data directory with Tap like…
I have two Scalding jobs, where one inherits from the other. Something like this
class BaseJob(args : Args) extends Job(args) {
val verbose = args.boolean("verbose")
if(verbose){
// do stuff
}else{
// do other stuff
}
}
class…
The Scalding reference on Github (https://github.com/twitter/scalding/wiki/Fields-based-API-Reference#map-functions) says the following:
MapTo is equivalent to mapping and then projecting to the new fields, but is more efficient. Thus, the …
I have a scalding job. I've create two traits A, B each trait has companion object A, B with implict wrap for trait and Pipe.
Job compiles successfully, when I use only one trait. When I import both traits, compilation fails. It says that all…
I'm using TextLine in Cascading to load files with very large lines in Cascading. The lines are very long - around 30Mb on average, some much longer. When I run the job locally to test it it runs fine, but when I run it on the cluster it fails after…
With Scalding I try to find edit-distances between pairs of similar strings. All in all I have 10 000 000 strings in a CSV file. To reduce computations I use the following algorithm:
Split all strings in groups by using first three chars as a…
I have a list of RichPipes with the following fields:
name: String
joinTime: Long
value: Int
I want to join them sequentially using reduce. When joining the RichPipes I only want to retain one field, value, and I want it to contain the max value…
Please help to find out how Scalding DSL translates into regular Scala code.
https://github.com/twitter/scalding/wiki/Fields-based-API-Reference#sortBy
For example:
val fasterBirds = birds.map('speed -> 'doubledSpeed) { speed : Int => speed * 2…
Doing Scalding MapReduce operations I need to compare tuples using my own comparison function on tuple fields.
Questions:
How to define my own tuple comparison function?
What are the rules to extend Scalding with custome Scala code in general?…
After configuring ssh and rsync when I try to run Scalding tutorial (https://github.com/Cascading/scalding-tutorial/) with command:
$ scripts/scald.rb --hdfs tutorial/Tutorial0.scala
I get the following…
Please help to understand output of unsucessfull Scalding run on Hadoop.
I got latest Scalding distribution from git:
git clone https://github.com/twitter/scalding.git
After sbt assembly from scalding directory I tried to run tutorial with…