Questions tagged [apache-crunch]

Simple and Efficient MapReduce Pipelines

Running on top of Hadoop MapReduce and Apache Spark, the Apache Crunch™ library is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns. For Scala users, there is the Scrunch API, which is built on top of the Java APIs and includes a REPL (read-eval-print loop) for creating MapReduce pipelines.

http://crunch.apache.org/

52 questions
0
votes
0 answers

Testing DoFn Apache Crunch

I am very new to Apache Crunch. This is the first test case I have written. Currently I am writing test cases for DoFn but it says NullPointerException. import static org.mockito.Mockito.verify; import static…
Spell Blade
  • 109
  • 2
  • 6
0
votes
1 answer

Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)

*which credentials should be provided in kerebros for this exception to resolve when running apache crunch mapreduce pipleline? No difference after logged in through kinit command. * Logs are as follows: WARN [main]…
0
votes
2 answers

How to execute one particular workflow action in Oozie. If I killed Oozie workflow manually?

I have below Oozie workflow,Suppose manually I killed the job when action "Do_task1" was executing, but still I want to execute action "Do_task2" in spite of killing oozie job manually(when action "Do_task1" was running). How can I do that? …
0
votes
1 answer

Hadoop java.lang.RuntimeException: java.lang.NoSuchMethodException

I am using Apache Crunch to write some map-reduce code. I have the following class which holds some data that is passed around in the map-reduce code, but I get an exception - not sure why. Here is the class interface package…
Rookie
  • 5,179
  • 13
  • 41
  • 65
0
votes
1 answer

Apache crunch unable to write output

Might be oversight but I am unable to spot why Apache Crunch won't write out output to a file for a very simple program I am writing to learn Crunch.. Here's the code: import org.apache.crunch.Pipeline; import org.apache.hadoop.conf.Configuration; …
Rookie
  • 5,179
  • 13
  • 41
  • 65
0
votes
0 answers

Using enum, Error: org.apache.crunch.CrunchRuntimeException: java.lang.NoSuchMethodException:

When I use custom enum in the crunch parallelDo (Avros.reflects(TestEnumType.class)) map function, I am getting the below error. Error: org.apache.crunch.CrunchRuntimeException: java.lang.NoSuchMethodException:EntityChangeType.() at…
Shravan Ramamurthy
  • 3,896
  • 5
  • 30
  • 44
0
votes
0 answers

Migrating hive collect_set query to apache crunch

How can I write apache crunch job equivalent to this hive query select A, collect_set(B) as C from table group by A ?
0
votes
1 answer

Apache Crunch: How to set multiple input paths?

I have a problem: I can't set the multiple input paths when I use the Apache Crunch. How can I solve this problem?
杜少云
  • 1
  • 1
0
votes
0 answers

Stopping scanner timeout when large number of cells

I have a crunch job where a cell can contain hundreds of thousands of cells (the data is split into rows keyed by location+time. For certain locations and times there can be lots of cells). The job processes each cell, but I get a scanner timeout…
Tam Toucan
  • 137
  • 10
0
votes
1 answer

What happens when calling Apache Crunch pipeline read twice on two different sources?

When making the following call: PCollection data1 = pipeline.read(source1); PCollection data2 = pipeline.read(source2); PCollection data3 = data1.union(data2); According to Apache Crunch read…
0
votes
1 answer

How to run Apache Crunch application without a Hadoop?

I heard, that Apache Crunch is a facade and it can run applications without a Hadoop. Is this true? If yes, then how to do that? In Apache Crunch Getting Started the very first example includes hadoop command: $ hadoop jar…
Dims
  • 47,675
  • 117
  • 331
  • 600
0
votes
1 answer

Iterating over PTable in crunch

I have following PTables, PTable somePTable1 = somePCollection1.parallelDo(new SomeClass(), Writables.tableOf(Writables.strings(), Writables.strings())); PTable> somePTable2 =…
Vivek Rai
  • 73
  • 5
0
votes
0 answers

Scaling Oozie Map Reduce Job: Does splitting into smaller jobs reduce overall runtime and memory usage?

I have a Oozie workflow that runs a Map-reduce job within a particular queue on the cluster. I have to add more input sources/clients to this job, so this job will be processing n times more data than what it does today. My question is If instead of…
Gadam
  • 2,674
  • 8
  • 37
  • 56
0
votes
1 answer

java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/mapreduce/MultiTableInputFormat

While running test for mapReduce job on a Hadoop minicluster, I am getting error: java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/mapreduce/MultiTableInputFormat at …
0
votes
1 answer

java.lang.UnsatisfiedLinkError when writing using crunch MemPipeline

I am using com.cloudera.crunch version: '0.3.0-3-cdh-5.2.1'. I have a small program that reads some AVROs and filters out invalid data based on some criteria. I am using pipeline.write(PCollection, AvroFileTarget) to write the invalid data output.…
Yogesh
  • 63
  • 1
  • 10