Questions tagged [scalding]

Scalding is a scala DSL for Cascading, running on Hadoop.

Scalding is a scala DSL for Cascading, running on Hadoop.

See https://github.com/twitter/scalding

181 questions
4
votes
0 answers

How do you deserialize Kryo into case classes using Scalding?

I know that Scalding's default serialization uses Kryo. So for this example, lets say I have a pipe of student objects. case class Student(name:String, id:String) val pipe: Pipe[Student] = //.... Then I write that pipe to a TextDelimited file…
user3335040
  • 649
  • 1
  • 7
  • 17
4
votes
2 answers

Create Scalding Source like TextLine that combines multiple files into single mappers

We have many small files that need combining. In Scalding you can use TextLine to read files as text lines. The problem is we get 1 mapper per file, but we want to combine multiple files so that they are processed by 1 mapper. I understand we need…
samthebest
  • 30,803
  • 25
  • 102
  • 142
4
votes
1 answer

Programmatically determine Field names of Scalding/Cascading Pipe

I'm using Scalding to process records with many (> 22) fields. At the end of the process, I'd like to write out the final Pipe's field names to a file. I know this is possible as Mapper and Reducer logs show this information. I'd like to get this…
Ben Sidhom
  • 1,548
  • 16
  • 25
4
votes
1 answer

Cascading HBase Tap

I am trying to write Scalding jobs which have to connect to HBase, but I have trouble using the HBase tap. I have tried using the tap provided by Twitter Maple, following this example project, but it seems that there is some incompatibility between…
Andrea
  • 20,253
  • 23
  • 114
  • 183
4
votes
3 answers

SQL Union equivalent in Twitter Scalding

I need to join 2 pipes with same set of fields, i.e ('id, 'groupName, 'name), same way as SQL UNION works. How it is possible to do it in Twitter Scalding?
3
votes
1 answer

Read multiple files using scalding and output a SINGLE file

I experience an issue these days, i am trying to read from multiple files using scalding and create an output with a single file. My code is this: def getFilesSource (paths: Seq[String]) = { new MultipleTextLineFiles(paths: _*) { override…
George Lica
  • 1,798
  • 1
  • 12
  • 23
3
votes
2 answers

Scalding TypedPipe API External Operations pattern

I have a copy of Programming MapReduce with Scalding by Antonios Chalkiopoulos. In the book he discusses the External Operations design pattern for Scalding code. You can see an example on his website here. I have made a choice to use the Type…
PhillipAMann
  • 887
  • 1
  • 10
  • 19
3
votes
0 answers

Scalding: Trouble reading avro file with nested structure

I need to read in an Avro file in Scalding but have no idea how to work with it. I have worked with straightforward avro files but this one is a little more complicated. The schema looks like this: {"type":"record", "name":"features", …
J Calbreath
  • 2,665
  • 4
  • 22
  • 31
3
votes
1 answer

How to declare dependency on Scalding in sbt project?

I am trying to figure out how to create an build.sbt file for my own Scalding-based project. Scalding source structure has no build.sbt file. Instead it has project/Build.scala build definition. What would be the right way to integrate my own sbt…
DarqMoth
  • 603
  • 1
  • 13
  • 31
3
votes
1 answer

Scalding: retaining all fields after groupBy

I'm doing a groupBy for calculating a value, but it seems that when I group by, I lose all the fields that are not in the aggregation keys: filtered.filterNot('site) {s:String => ...} .filterNot('date) {s:String => ...} aggr =…
Miguel Ping
  • 18,082
  • 23
  • 88
  • 136
3
votes
1 answer

Reading and Writing Case Classes in Scalding

Could someone point me to a link that explains how to read and write simple case classes in scalding? Is there some default serialization scheme? For example, I have jobs that create pipes of com.twitter.algebird.Moments. I wish to write the pipes…
3
votes
1 answer

Dependency issue with Scalding and Hadoop with sbt-assembly

I'm trying to build a far with sbt of a simple hadoop job I'm trying to run in an attempt to run it on Amazon EMR. However when I run sbt assembly I get the following error: [error] (*:assembly) deduplicate: different file contents found in the…
tshauck
  • 20,746
  • 8
  • 36
  • 36
3
votes
5 answers

Alternatives to scalding for HBase access from Scala (or Java)

Could anybody please recommend good solution (framework) to access HBase on Hadoop cluster from Scala (or Java) application? By now I'm moving in scalding direction. Prototypes I obtained allowed me to combine scalding library with Maven and…
Roman Nikitchenko
  • 12,800
  • 7
  • 74
  • 110
2
votes
0 answers

oozie wofklow intermittently fails on java action for scalding

I am using CDH (Cloudera Hadoop) version 5.12.0 (which uses: Hadoop 2.6.0 Oozie 4.1.0) and Scalding 2.11 I am using a shaded jar with my dependencies built in. I can run all my jobs properly without any error using a hadoop jar command as…
Murium
  • 183
  • 7
2
votes
1 answer

Store algebird Bloom Filter with Storehaus

I have a Spark job whose final output is an Algebird bloom filter, and I'd need to reuse this bloom filter in another Spark job. Is there a way to store this bloom filter in a kv store (eg: redis) using Twitter Storehaus and retrieve it in the…
arnaud briche
  • 1,479
  • 3
  • 20
  • 25
1
2
3
12 13