Questions tagged [apache-crunch]

Simple and Efficient MapReduce Pipelines

Running on top of Hadoop MapReduce and Apache Spark, the Apache Crunch™ library is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns. For Scala users, there is the Scrunch API, which is built on top of the Java APIs and includes a REPL (read-eval-print loop) for creating MapReduce pipelines.

http://crunch.apache.org/

52 questions

vote

0 answers

Can Apache Crunch be used to create Graph like data structure?

I have two crunch PCollection of Edge and Node. I need to convert the structure into a graph that can facilitate smooth traversal through it. Are there any Apache Crunch methods or classes that I can use to create such a structure? The data is huge,…

asked Feb 01 '21 at 06:05

Faiz Kidwai

vote

1 answer

write a apache crunch Pcollection to multiple output files

I have a crunch dofn which generates a Pcollection currently i m writing the pcollection to a single avro file i want to write the Pcollection to multiple files. PCollection generatedResults = results.parallelDo(new…

apache-crunch

asked Jan 07 '21 at 09:13

Sneha

vote

2 answers

Could not find or load main class while trying to run project from IntelliJ

I have downloaded project git clone http://github.com/jwills/crunch-demo then imported it into IntelliJ as Maven existing project. Now I am trying to run main function, but failing with error message Error: Could not find or load main class…

maven intellij-idea classpath apache-crunch

asked May 23 '18 at 09:58

Dims

47,675
117
331
600

vote

1 answer

How could I define the DoFn in apache crunch having "void" data type?

Basically, I don't need output from DoFn, just want to update some mysql db for each record I am getting in DoFn. So how could I define DoFn having void data type ? Basically I don't want to emit anything form DoFn.

apache-crunch

asked Sep 23 '17 at 13:10

Vivek Rai

vote

0 answers

org.apache.crunch.CrunchRuntimeException: java.io.NotSerializableException

I have a PTable> which is generated at an intermediate stage of program on which i am running transformation job. Sample PTable entry: ["0067b4c054d14fe2-ACC8D37", [{ "unique_id": "0067b4c054d14fe2-ACC8D37", …

java hadoop mapreduce apache-crunch

asked Sep 09 '17 at 05:23

mukul

vote

1 answer

How to use Counters in apache crunch

In Apache Crunch , there is method named increment("any enum"). I used increment(TOTAL_IDS);, but where I can see the result of counters, counters are not coming in logs after completion of job. What am I missing there?

apache-crunch

asked Aug 15 '17 at 13:46

Vivek Rai

vote

1 answer

Link crunch spark pipeline with spark application beginning with SparkSession instance

Crunch pipeline can have Java spark context as parameter, but if the spark application starts with SparkSession instance(as the spark Java program includes Datasets and requires sparkSQL). How do i add another layer of abstraction(crunch pipeline)…

apache-spark apache-crunch

asked Mar 15 '17 at 10:57

devastrix

vote

1 answer

Writing Parquet file in Apache Crunch

I am new in apache crunch and looking for reading and writing Parquet file in apache crunch. I followed the documentation and API but did not get straight approach/method for doing the same. PCollection pipeLine =…

mapreduce hadoop2 parquet apache-crunch

asked Mar 01 '17 at 07:06

Khan F

vote

0 answers

Crunch SparkPipeline does not work as expected

I am trying to migrate our code from Crunch MRPipeline to SparkPipeline. I tried a simple example like this SparkConf sc = new SparkConf().setAppName("Crunch Spark Count").setMaster("local"); JavaSparkContext jsc = new…

hadoop apache-spark apache-crunch

asked Feb 05 '16 at 07:29

qingpan

vote

0 answers

Writable type family resolution in Scrunch vs Crunch

I have a Scrunch Spark pipeline, and when I try to save its output to Avro format using: data.write(to.avroFile(path)) I get the following Exception: java.lang.ClassCastException: org.apache.crunch.types.writable.WritableType cannot be cast to…

scala apache-crunch

asked Dec 09 '15 at 18:25

djsecilla

vote

2 answers

Crunch Debug Logging

Anyone who has used the Crunch pipelines knows that nothing is actually performed until the pipeline.run() or pipeline.done() method is called. Traditionally in most languages, we can put log statements to print out intermediate variable values, but…

logging apache-crunch

asked May 14 '15 at 15:57

Kesh

1,077
2
11
20

vote

1 answer

How to read a hive partition into an Apache Crunch pipeline?

I am able to read text files in hdfs into apache crunch pipeline. But now I need to read the hive partitions. The problem is that as per our design, I am not supposed to directly access the file. Hence, now I need some way by which I can access the…

hadoop hive pipeline hcatalog apache-crunch

asked Oct 20 '14 at 08:20

Jijo Mathew

votes

0 answers

Apache Crunch map reduce job setting input split size not working

I have the following scenario: Multiple map reduce jobs using apache crunch. These jobs are scheduled using Oozie. Lets consider only one job for simplicity. What i want to achieve is reducing the number of mappers of that job. The number of mappers…

hadoop mapreduce apache-crunch input-split

asked Mar 14 '23 at 12:17

Stefan Ss

votes

0 answers

Is it possible to convert a Apache crunch PCollection to a Apache Spark JavaRDD?

I want to perform an operation where I can convert a PCollection to a JavaRDD. Is it possible to do it ? If yes, then how ?

apache-spark apache-crunch

asked Sep 10 '22 at 12:29

Ankit Singh

votes

1 answer

How to write output of Apache Crunch to Amazon S3 bucket

Is there a way through which we can write our Apache Crunch output to S3 bucket. There is a method in crunch pipeline write which takes Target as parameter. Is there a way to add S3 as Target to write method of crunch.

amazon-s3 apache-crunch

asked Jan 27 '21 at 12:22

Sam

Prev 1

3 4 Next