Questions tagged [scalding]

Scalding is a scala DSL for Cascading, running on Hadoop.

Scalding is a scala DSL for Cascading, running on Hadoop.

See https://github.com/twitter/scalding

181 questions
2
votes
2 answers

java.lang.NullPointerException when reading s3 with Hadoop (Scalding)

Getting strange NPE when trying to read s3 with Scalding / Hadoop. The paths are 100% correct. Asking this question because it's surprisingly hard to Google and everytime I get this error I forget how I solved it. So posting on SO so I can Google…
samthebest
  • 30,803
  • 25
  • 102
  • 142
2
votes
2 answers

Compress Output Scalding / Cascading TsvCompressed

So people have been having problems compressing the output of Scalding Jobs including myself. After googling I get the odd hiff of an answer in a some obscure forum somewhere but nothing suitable for peoples copy and paste needs. I would like an…
samthebest
  • 30,803
  • 25
  • 102
  • 142
2
votes
3 answers

scalding how to map on all fields with '* keyword?

I want to apply an operation to all fields of my Pipe. I saw on https://github.com/twitter/scalding/wiki/Fields-based-API-Reference that "You can use '* (here and elsewhere) to mean all fields." but somehow I do not succeed to make it work. Would…
Mr Renard
  • 43
  • 4
2
votes
3 answers

Scalding Sample WordCount local mode

I am trying to run Scalding sample word count example. I have followed this github link for steps:- https://github.com/twitter/scalding/wiki/Getting-Started But I am getting ClassNotFoundException. Below is my StackTrace:- [cloudera@localhost…
neham
  • 341
  • 5
  • 18
2
votes
1 answer

Scalding MongoDB connector

I am using Scalding for ETL implementation and I am looking for a simple way to forward Scalding output to MongoDB instead of HDFS. Any suggestions appreciated. Thanks.
2
votes
1 answer

scalding compare consecutive records

Does anyone know how to compare consecutive records in scalding when creating a schema. I am looking at tutorial 6 and suppose that I want to print the age of the person if data in record #2 is greater than record #1 (for all records) for…
CruncherBigData
  • 1,112
  • 3
  • 14
  • 34
2
votes
2 answers

How does scalding pass user functions to remote MapReduce nodes

When working with Scalding, you have the ability to provide a function. I was wondering how scalding passes these functions to the remote map/reduce tasks? Is this using something in scala or something generic that can be done with anonymous…
ekaqu
  • 2,038
  • 3
  • 24
  • 38
2
votes
2 answers

Calculate sums of even/odd pairs on Hadoop?

I want to create a parallel scanLeft(computes prefix sums for an associative operator) function for Hadoop (scalding in particular; see below for how this is done). Given a sequence of numbers in a hdfs file (one per line) I want to calculate a new…
John Salvatier
  • 3,077
  • 4
  • 26
  • 31
2
votes
5 answers

How to implement OR join in hadoop(scalding/cascading)

It is easy to join datasets by single key simply by sending join field as a reducer key. But joining records by several keys where at least one shoud be the same is not that easy for me. Example I have logs and I want to group them by user…
yura
  • 14,489
  • 21
  • 77
  • 126
2
votes
1 answer

Reading from HBase with scalding

I'm very new to Cascading/Scalding, and cannot figure out, hot to read data from HBase. I have a table in HBase, where the hand history of poker games is stored (in a very straightforward manner: id -> hand, serialized with ProtoBuf). The job below…
Vasil Remeniuk
  • 20,519
  • 6
  • 71
  • 81
1
vote
1 answer

How to mock a TextLine for Scalding using the type safe API?

I am trying to mock a TextLine for a Scalding job, but the offset appears to be getting mixed in with the line, whether I express the offset explicitly or implicitly. Here is my job: package changed import com.twitter.scalding._ import…
Ellen Spertus
  • 6,576
  • 9
  • 50
  • 101
1
vote
0 answers

Scalding Execution Monad - What is it & how to use it

I am working on Big Data technologies using MR based on Java. But recently my company has moved to Scalding framework. I am not able get my head around the Scalding Execution Monad. What it is and how it works. Cannot find much material on it on…
1
vote
0 answers

scald.rb results in error (could not find or load main class)

I am trying to run the tutorial files from https://github.com/twitter/scalding/tree/develop/tutorial. I cloned the 0.17.x branch and current develop branch and haven't had much success with either. I have also already ran "sbt update" and "sbt…
DBD
  • 11
  • 4
1
vote
1 answer

How do I use HyperLogLogMonoid from Algebird to carry out arbitrary intersections and unions

I'd like to aggregate a bunch of values that belong to a particular category into an HLL data structure so I can carry out intersections and unions later and count resulting cardinality of such computations. I was able to get to the point where I…
harshsinghal
  • 3,720
  • 8
  • 35
  • 32
1
vote
2 answers

How to override setup and cleanup methods in spark map function

Suppose there is following map reduce job Mapper: setup() initializes some state map() add data to state, no output cleanup() ouput state to context Reducer: aggregare all states into one output How such job could be implemented in spark?…
Julias
  • 5,752
  • 17
  • 59
  • 84