Questions tagged [apache-crunch]

Simple and Efficient MapReduce Pipelines

Running on top of Hadoop MapReduce and Apache Spark, the Apache Crunch™ library is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns. For Scala users, there is the Scrunch API, which is built on top of the Java APIs and includes a REPL (read-eval-print loop) for creating MapReduce pipelines.

http://crunch.apache.org/

52 questions

votes

0 answers

Testing DoFn Apache Crunch

I am very new to Apache Crunch. This is the first test case I have written. Currently I am writing test cases for DoFn but it says NullPointerException. import static org.mockito.Mockito.verify; import static…

java hadoop apache-crunch

asked Feb 11 '20 at 05:29

Spell Blade

votes

1 answer

Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)

*which credentials should be provided in kerebros for this exception to resolve when running apache crunch mapreduce pipleline? No difference after logged in through kinit command. * Logs are as follows: WARN [main]…

java sasl apache-crunch klist

asked Dec 09 '19 at 10:05

Manoj Manjunath

votes

2 answers

How to execute one particular workflow action in Oozie. If I killed Oozie workflow manually?

I have below Oozie workflow,Suppose manually I killed the job when action "Do_task1" was executing, but still I want to execute action "Do_task2" in spite of killing oozie job manually(when action "Do_task1" was running). How can I do that? …

hadoop oozie oozie-coordinator oozie-workflow apache-crunch

asked Jan 30 '19 at 17:36

Farid Ahmad

votes

1 answer

Hadoop java.lang.RuntimeException: java.lang.NoSuchMethodException

I am using Apache Crunch to write some map-reduce code. I have the following class which holds some data that is passed around in the map-reduce code, but I get an exception - not sure why. Here is the class interface package…

java hadoop mapreduce apache-crunch

asked Jan 10 '19 at 18:49

Rookie

5,179
13
41
65

votes

1 answer

Apache crunch unable to write output

Might be oversight but I am unable to spot why Apache Crunch won't write out output to a file for a very simple program I am writing to learn Crunch.. Here's the code: import org.apache.crunch.Pipeline; import org.apache.hadoop.conf.Configuration; …

java hadoop bigdata apache-crunch

asked Dec 31 '18 at 15:51

Rookie

5,179
13
41
65

votes

0 answers

Using enum, Error: org.apache.crunch.CrunchRuntimeException: java.lang.NoSuchMethodException:

When I use custom enum in the crunch parallelDo (Avros.reflects(TestEnumType.class)) map function, I am getting the below error. Error: org.apache.crunch.CrunchRuntimeException: java.lang.NoSuchMethodException:EntityChangeType.() at…

java bigdata avro apache-crunch

asked Oct 31 '18 at 15:58

Shravan Ramamurthy

3,896
5
30
44

votes

0 answers

Migrating hive collect_set query to apache crunch

How can I write apache crunch job equivalent to this hive query select A, collect_set(B) as C from table group by A ?

java hadoop mapreduce bigdata apache-crunch

asked Aug 20 '18 at 07:05

Ravibhushan Kumar

votes

1 answer

Apache Crunch: How to set multiple input paths?

I have a problem: I can't set the multiple input paths when I use the Apache Crunch. How can I solve this problem?

mapreduce hadoop2 apache-crunch

asked Jul 11 '18 at 03:47

杜少云

votes

0 answers

Stopping scanner timeout when large number of cells

I have a crunch job where a cell can contain hundreds of thousands of cells (the data is split into rows keyed by location+time. For certain locations and times there can be lots of cells). The job processes each cell, but I get a scanner timeout…

hadoop hbase bigdata apache-crunch

asked Jun 21 '18 at 14:15

Tam Toucan

votes

1 answer

What happens when calling Apache Crunch pipeline read twice on two different sources?

When making the following call: PCollection data1 = pipeline.read(source1); PCollection data2 = pipeline.read(source2); PCollection data3 = data1.union(data2); According to Apache Crunch read…

hadoop pipeline apache-crunch

asked May 24 '18 at 06:49

Lulu Li

votes

1 answer

How to run Apache Crunch application without a Hadoop?

I heard, that Apache Crunch is a facade and it can run applications without a Hadoop. Is this true? If yes, then how to do that? In Apache Crunch Getting Started the very first example includes hadoop command: $ hadoop jar…

java hadoop apache-crunch

asked May 23 '18 at 10:28

Dims

47,675
117
331
600

votes

1 answer

Iterating over PTable in crunch

I have following PTables, PTable somePTable1 = somePCollection1.parallelDo(new SomeClass(), Writables.tableOf(Writables.strings(), Writables.strings())); PTable> somePTable2 =…

apache-crunch

asked Aug 17 '17 at 14:21

Vivek Rai

votes

0 answers

Scaling Oozie Map Reduce Job: Does splitting into smaller jobs reduce overall runtime and memory usage?

I have a Oozie workflow that runs a Map-reduce job within a particular queue on the cluster. I have to add more input sources/clients to this job, so this job will be processing n times more data than what it does today. My question is If instead of…

mapreduce distributed-computing oozie apache-crunch bigdata

asked Mar 23 '17 at 15:54

Gadam

2,674
8
37
56

votes

1 answer

java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/mapreduce/MultiTableInputFormat

While running test for mapReduce job on a Hadoop minicluster, I am getting error: java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/mapreduce/MultiTableInputFormat at …

hadoop mapreduce hbase integration-testing apache-crunch

asked Dec 02 '16 at 22:17

Sabhaya Saumil

votes

1 answer

java.lang.UnsatisfiedLinkError when writing using crunch MemPipeline

I am using com.cloudera.crunch version: '0.3.0-3-cdh-5.2.1'. I have a small program that reads some AVROs and filters out invalid data based on some criteria. I am using pipeline.write(PCollection, AvroFileTarget) to write the invalid data output.…

java hadoop mapreduce apache-crunch

asked Aug 02 '16 at 14:19

Yogesh

Prev 1 2

4 Next