Questions tagged [apache-crunch]

Simple and Efficient MapReduce Pipelines

Running on top of Hadoop MapReduce and Apache Spark, the Apache Crunch™ library is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns. For Scala users, there is the Scrunch API, which is built on top of the Java APIs and includes a REPL (read-eval-print loop) for creating MapReduce pipelines.

http://crunch.apache.org/

52 questions
1
vote
0 answers

Can Apache Crunch be used to create Graph like data structure?

I have two crunch PCollection of Edge and Node. I need to convert the structure into a graph that can facilitate smooth traversal through it. Are there any Apache Crunch methods or classes that I can use to create such a structure? The data is huge,…
Faiz Kidwai
  • 463
  • 5
  • 26
1
vote
1 answer

write a apache crunch Pcollection to multiple output files

I have a crunch dofn which generates a Pcollection currently i m writing the pcollection to a single avro file i want to write the Pcollection to multiple files. PCollection generatedResults = results.parallelDo(new…
Sneha
  • 13
  • 3
1
vote
2 answers

Could not find or load main class while trying to run project from IntelliJ

I have downloaded project git clone http://github.com/jwills/crunch-demo then imported it into IntelliJ as Maven existing project. Now I am trying to run main function, but failing with error message Error: Could not find or load main class…
Dims
  • 47,675
  • 117
  • 331
  • 600
1
vote
1 answer

How could I define the DoFn in apache crunch having "void" data type?

Basically, I don't need output from DoFn, just want to update some mysql db for each record I am getting in DoFn. So how could I define DoFn having void data type ? Basically I don't want to emit anything form DoFn.
Vivek Rai
  • 73
  • 5
1
vote
0 answers

org.apache.crunch.CrunchRuntimeException: java.io.NotSerializableException

I have a PTable> which is generated at an intermediate stage of program on which i am running transformation job. Sample PTable entry: ["0067b4c054d14fe2-ACC8D37", [{ "unique_id": "0067b4c054d14fe2-ACC8D37", …
mukul
  • 433
  • 7
  • 18
1
vote
1 answer

How to use Counters in apache crunch

In Apache Crunch , there is method named increment("any enum"). I used increment(TOTAL_IDS);, but where I can see the result of counters, counters are not coming in logs after completion of job. What am I missing there?
Vivek Rai
  • 73
  • 5
1
vote
1 answer

Link crunch spark pipeline with spark application beginning with SparkSession instance

Crunch pipeline can have Java spark context as parameter, but if the spark application starts with SparkSession instance(as the spark Java program includes Datasets and requires sparkSQL). How do i add another layer of abstraction(crunch pipeline)…
devastrix
  • 91
  • 9
1
vote
1 answer

Writing Parquet file in Apache Crunch

I am new in apache crunch and looking for reading and writing Parquet file in apache crunch. I followed the documentation and API but did not get straight approach/method for doing the same. PCollection pipeLine =…
Khan F
  • 11
  • 3
1
vote
0 answers

Crunch SparkPipeline does not work as expected

I am trying to migrate our code from Crunch MRPipeline to SparkPipeline. I tried a simple example like this SparkConf sc = new SparkConf().setAppName("Crunch Spark Count").setMaster("local"); JavaSparkContext jsc = new…
qingpan
  • 406
  • 1
  • 4
  • 14
1
vote
0 answers

Writable type family resolution in Scrunch vs Crunch

I have a Scrunch Spark pipeline, and when I try to save its output to Avro format using: data.write(to.avroFile(path)) I get the following Exception: java.lang.ClassCastException: org.apache.crunch.types.writable.WritableType cannot be cast to…
djsecilla
  • 390
  • 3
  • 12
1
vote
2 answers

Crunch Debug Logging

Anyone who has used the Crunch pipelines knows that nothing is actually performed until the pipeline.run() or pipeline.done() method is called. Traditionally in most languages, we can put log statements to print out intermediate variable values, but…
Kesh
  • 1,077
  • 2
  • 11
  • 20
1
vote
1 answer

How to read a hive partition into an Apache Crunch pipeline?

I am able to read text files in hdfs into apache crunch pipeline. But now I need to read the hive partitions. The problem is that as per our design, I am not supposed to directly access the file. Hence, now I need some way by which I can access the…
Jijo Mathew
  • 322
  • 2
  • 15
0
votes
0 answers

Apache Crunch map reduce job setting input split size not working

I have the following scenario: Multiple map reduce jobs using apache crunch. These jobs are scheduled using Oozie. Lets consider only one job for simplicity. What i want to achieve is reducing the number of mappers of that job. The number of mappers…
Stefan Ss
  • 45
  • 5
0
votes
0 answers

Is it possible to convert a Apache crunch PCollection to a Apache Spark JavaRDD?

I want to perform an operation where I can convert a PCollection to a JavaRDD. Is it possible to do it ? If yes, then how ?
0
votes
1 answer

How to write output of Apache Crunch to Amazon S3 bucket

Is there a way through which we can write our Apache Crunch output to S3 bucket. There is a method in crunch pipeline write which takes Target as parameter. Is there a way to add S3 as Target to write method of crunch.
Sam
  • 181
  • 2
  • 4
  • 17