Questions tagged [cascading]

Cascading is a Query API, Query Planner, and Process Scheduler used for defining and executing complex, scale-free, and fault tolerant data processing workflows on a Hadoop cluster.

Cascading is a Query API, Query Planner, and Process Scheduler used for defining and executing complex, scale-free, and fault tolerant data processing workflows on a Hadoop cluster.

Cascading is a thin Java library that sits on top of Hadoop's MapReduce layer and is executed from the command line like any other Hadoop application. It is not a new text based query syntax (like Pig) or another complex system that must be installed on a cluster and maintained (like Hive). Though Cascading is both complimentary to and is a valid alternative to either application.

Cascading lets the developer quickly assemble complex distributed data-processing applications without having to "think" in MapReduce. And to efficiently schedule them based on their dependencies. Obviously simple data processing applications are supported as well, as complex applications tend to start simple.

Cascading is Open Source and dual licensed under the GPL and OEM/Commercial Licenses. OEM/Commercial Licenses and Developer Support can be obtained through Concurrent, Inc.

Cascading has a strong community of users and contributors, see our Cascading modules page for related projects and extensions.

Cascading, extensions, and related libraries are also hosted in the Conjars maven repository maintained by Concurrent, Inc. The repository is open to the public.

Cascading application-stack overview: enter image description here

Links:

364 questions
0
votes
1 answer

Loading data from Hadoop Cascading Source into MySQL Sink

I'm trying to integrate writing data in from a Cascading source into MySQL so I wonder if there's an easy sink available to take the tab delimited data that's coming from the source and just doing a couple SQL statements to update a table. I'm new…
0
votes
2 answers

Cascading + libjars = ClassNotFoundException. Sometimes

I am running Cascading (actually Scalding) hadoop job that uses DistributedCache for dependent jars. Fist time it works fine (meaning that the classpath is set up correctly) but then it starts failing with…
Sasha O
  • 3,710
  • 2
  • 35
  • 45
0
votes
1 answer

How can I read and write binary files in Cascading?

I want to load some files in binary format (for example jpegs, but could be any binary format), manipulate it somehow and write it back. I want to do that on hadoop, and I would like to write it over Cascading framework. Are there binary sinks /…
polo
  • 1,352
  • 2
  • 16
  • 35
0
votes
1 answer

How can I pass cascading parameters from ASP.NET to SSRS

I am trying to build web application (ASP.NET) that will be used to display an SSRS report. My report has 4 cascading parameters - A,B,C and D. C and D "depend" logically on the value of A (this means that the DataSets of C and D are filtered based…
0
votes
1 answer

Hadoop Cascading framework to Update specific column data

I have a mongodb collection which looks like this Id Name createTime updateTime Age Country verificationStatus Id1 Abc 10-7-2013 10-7-2013 21 Xxxx INITIAL_MAIL Id2 Efg 9-7-2013 10-7-2013 22 Xxxx FIRST_REMINDER Id3 Hij…
vinoth
  • 11
  • 4
0
votes
1 answer

is Cascading function executed in single thread as a hadoop mapper function?

I'm reading cascading documentation chapter 5.2 Functions and I wonder what will happen with the following code. Should it work OK in multithreaded environment? The more general question is is the Function could be multithreaded? as I know the…
Julias
  • 5,752
  • 17
  • 59
  • 84
0
votes
2 answers

Combining outputs in Cascading

I am analyzing log files with various domain names using Cascading. Here is an example of the output report after it has been filtered: www.google.nl 3 www.google.it 3 www.google.com.co 3 www.google.com.hk 3 www.google.co.jp 3 I would like to group…
cevallos.valtira
  • 191
  • 1
  • 1
  • 8
0
votes
1 answer

What tools exist for benchmarking Cascading for Hadoop routines?

I have been given a multi-step Cascading program that runs in about ten times the amount of time that an equivalent M/R job runs. How do I go about figuring out which of the steps is running the slowest so I can target it for optimization?
Robert Rapplean
  • 672
  • 1
  • 9
  • 30
0
votes
1 answer

Ignoring outputs in Cascading

I am analyzing log files with various domain names. I want to exclude/ignore from the output report any domain that has the word "macys". Here is an example output: l.macys.com 87516 www.google.com 3016 search.yahoo.com 584 www.bing.com…
cevallos.valtira
  • 191
  • 1
  • 1
  • 8
0
votes
1 answer

Cascading - regex parser - wrong number of fields

Starting to play with Cascading on Amazon EMR, have managed to get it running BUT falling at a fairly simple hurdle and I was hoping someone could shed some light on it. My code: import java.util.Properties; import cascading.flow.Flow; import…
Duncan
  • 10,218
  • 14
  • 64
  • 96
0
votes
2 answers

hadoop cascading how to get top N tuples

New to cascading, trying to find out a way to get top N tuples based on a sort/order. for example, I'd like to know the top 100 first names people are using. here's what I can do similar in teradata sql: select top 100 first_name, num_records …
Kartrace
  • 1
  • 1
0
votes
2 answers

Getting cascading.tap.hadoop.io.MultiInputSplit class not found exception while running hadoop program using cascading framework

Here is my code that connects to hadoop machine and perform set of validation and write on another directory. public class Main{ public static void main(String...strings){ System.setProperty("HADOOP_USER_NAME", "root"); …
Mohammad Adnan
  • 6,527
  • 6
  • 29
  • 47
0
votes
1 answer

How to rename Pipe fields in cascading?

In two separate occasions, I've had to rename all the fields in a Pipe to join (using Merge or CoGroup). What I have done recently is: //These two pipes contain similar values but different Field Names Pipe papa = new Retain(papa, fieldsFrom); Pipe…
Engineiro
  • 1,146
  • 7
  • 10
0
votes
2 answers

Cascading(buffer) implementation

I need to create a buffer in cascading hadoop. Suppose i have fields : member_id,amountpaid,diadnosis_id,diagnosis_description,superGrouper_id,superGrouper_descriptiion,grouperId,grouperDescription I need to group the fields from member_id and…
Rach
  • 3
  • 1
0
votes
0 answers

Prevent cascading refreshes

I have a header.js that includes in its ready section the following: var auto_refresh = setInterval(function () { var theToken = $('#token').text(); $('#error-div').text(''); …
John Wooten
  • 685
  • 1
  • 6
  • 21