Questions tagged [apache-crunch]

Simple and Efficient MapReduce Pipelines

Running on top of Hadoop MapReduce and Apache Spark, the Apache Crunch™ library is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns. For Scala users, there is the Scrunch API, which is built on top of the Java APIs and includes a REPL (read-eval-print loop) for creating MapReduce pipelines.

http://crunch.apache.org/

52 questions
0
votes
1 answer

How does Apache Crunch PTable collectValues work internally

I was going through some documentations related to HDFS architecture and Apache crunch PTable. Based on my understandings, when we generate PTable the data is internally stored across the Data nodes in HDFS. This means, if I have PTable with…
shubh586
  • 23
  • 4
0
votes
1 answer

Hadoop Job: Error injecting constructor, JAXBException

A MapReduce job implemented in an Apache Crunch pipeline is failing with the error message Error injecting constructor, javax.xml.bind.JAXBException: property "retainReferenceToInfo" is not supported. The Crunch pipeline is very similar to other…
Suriname0
  • 527
  • 1
  • 8
  • 21
0
votes
1 answer

How to convert existing MapReduce applications to Crunch?

I have several (about a dozen) MapReduce tasks implemented, each of which functions as part of a workflow executed by a simple bash script. For a variety of reasons, I would like to move the workflow to Apache Crunch. However, it's not clear to…
Suriname0
  • 527
  • 1
  • 8
  • 21
0
votes
1 answer

How to do a Map side full outer join in Apache Crunch ( Join type FULL_OUTER_JOIN not supported by MapsideJoinStrategy )

Hi i am trying to do a mapside join in crunch using MapsideJoinStrategy class. It is working fine for inner join but it gives this error for full outer join :" Join type FULL_OUTER_JOIN not supported by MapsideJoinStrategy"
0
votes
1 answer

Does Apache Crunch come with the Hadoop MapReduce API?

When you download Apache Crunch from their website (it comes as source code), it comes without the related MapReduce classes it's based on. Two questions: 1- How is this possible? Apache Crunch is an abstraction on top of MapReduce. How come it…
Aviv Cohn
  • 15,543
  • 25
  • 68
  • 131
0
votes
2 answers

In Apache Crunch, How to find out if a PCollection or PTable has any elements in it? And if so how many?

I tried to put a break point and do the following in the watch window: check .getSize() which is supposed to return size in bytes. And .materialize() to see if I can look at the java objects. The .getSize() does show a number >0 but I doubt if that…
Gadam
  • 2,674
  • 8
  • 37
  • 56
0
votes
1 answer

Writing data into MongoDB with Crunch

We're going to use Apache Crunch to implement our new solutions. We'd like to extract data from HBase and then apply some logic in order to filter out unqualifying ones and at last write the data in a structured way into MongoDB for further…
wei
  • 3,312
  • 4
  • 23
  • 33
1 2 3
4