Questions tagged [apache-pig]

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Pig runs in two execution modes: Local mode and MapReduce mode. Pig script can be written in two modes: Interactive mode and Batch mode.

At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs for which large-scale parallel implementations already exist (e.g. the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin which has the following key properties:

  • Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, easy to write and understand.
  • Optimization opportunities. The declarative way in which tasks are encoded permits the system to optimize their execution plan automatically, allowing the user to focus on semantics rather than efficiency.
  • Extensibility. Users can create their own functions to do special-purpose processing.

Official Website:

Useful Links:

5199 questions
9
votes
3 answers

In Apache Pig, select DISTINCT rows based on a single column

Let's say I have a table such as the one below, that may or may not contain duplicates for a given field: ID URL --- ------------------ 001 http://example.com/adam 002 http://example.com/beth 002 …
Arel
  • 1,339
  • 17
  • 22
9
votes
1 answer

Group by multiple fields and output tuple

I have a feed in the following format: Hour Key ID Value 1 K1 001 3 1 K1 002 2 2 K1 005 4 1 K2 002 1 2 K2 003 5 2 K2 004 6 and I want to group the feed by (Hour, Key) then sum the Value but…
Rock
  • 2,827
  • 8
  • 35
  • 47
9
votes
2 answers

Self cross-join in pig is disregarded

If one have data like those: A = LOAD 'data' AS (a1:int,a2:int,a3:int); DUMP A; (1,2,3) (4,2,1) And then a cross-join is done on A, A: B = CROSS A, A; DUMP B; (1,2,3) (4,2,1) Why is second A optimized out from the query? info: pig version…
Artem Oboturov
  • 4,344
  • 2
  • 30
  • 48
9
votes
18 answers

What language could I use for fast execution of this database summarization task?

So I wrote a Python program to handle a little data processing task. Here's a very brief specification in a made-up language of the computation I want: parse "%s %lf %s" aa bb cc | group_by aa | quickselect --key=bb 0:5 | \ flatten | format "%s…
9
votes
2 answers

Removing duplicates using PigLatin

I'm using PigLatin to filter some records. User1 8 NYC User1 9 NYC User1 7 LA User2 4 NYC User2 3 DC The script should remove the duplicate for users, and keep one of these records. Something like the unique command in linux. The output…
aalsum
  • 103
  • 1
  • 1
  • 4
8
votes
2 answers

Pig non-aggregated warnings output location?

Pig: 0.8.1-cdh3u2 Hadoop: 0.20.2-cdh3u0 Debugging FIELD_DISCARDED_TYPE_CONVERSION_FAILED warnings, but I can't seem to make individual warnings printed anywhere. Disabling aggregation via -w or aggregate.warnings=false switch removes the summary…
andrew
  • 406
  • 2
  • 7
8
votes
1 answer

How does Pig use Hadoop Globs in a 'load' statement?

As I've noted previously, Pig doesn't cope well with empty (0-byte) files. Unfortunately, there are lots of ways that these files can be created (even within Hadoop utilitities). I thought that I could work around this problem by explicitly loading…
Chris Phillips
  • 11,607
  • 3
  • 34
  • 45
8
votes
1 answer

Max/Min for whole sets of records in PIG

I have a set set of records that I am loading from a file and the first thing I need to do is get the max and min of a column. In SQL I would do this with a subquery like this: select c.state, c.population, (select max(c.population) from…
Winter
  • 1,490
  • 3
  • 13
  • 21
8
votes
5 answers

Can I generate nested bags using nested FOREACH statements in Pig Latin?

Let's say I have a data set of restaurant reviews: User,City,Restaurant,Rating Jim,New York,Mecurials,3 Jim,New York,Whapme,4.5 Jim,London,Pint Size,2 Lisa,London,Pint Size,4 Lisa,London,Rabbit Whole,3.5 And I want to produce a list by user and…
PP.
  • 10,764
  • 7
  • 45
  • 59
8
votes
4 answers

Pig keeps trying to connect to job history server (and fails)

I'm running a Pig job that fails to connect to the Hadoop job history server. The task (usually any task with GROUP BY) runs for a while and then it starts with a message like: 2015-04-21 19:05:22,825 [main] INFO …
badroit
  • 1,316
  • 15
  • 28
8
votes
1 answer

Pig - ERROR 1045: AVG as multiple or none of them fit. Please use an explicit cast

I have a comma seperated .txt file, I want to DUMP the AVG age of all Males. records = LOAD 'file:/home/gautamshaw/Documents/PigDemo_CommaSep.txt' USING PigStorage(',') AS…
user182944
  • 7,897
  • 33
  • 108
  • 174
8
votes
1 answer

Check if an element is present in a bag?

How can I check in piglatin, if a bag contains an element? Example : In a bag of chararray, how can I check if a token is present?
Nitish Upreti
  • 6,312
  • 9
  • 50
  • 92
8
votes
0 answers

Unable to find region for hello_world

Versions: Hadoop 2.2, Hbase 0.96.1, Pig 0.12 Whenever I run this pig script raw_data = LOAD 'sample_data.csv' USING PigStorage( ',' ) AS ( listing_id: chararray, fname: chararray, lname: chararray ); STORE raw_data INTO…
fsi
  • 1,319
  • 1
  • 23
  • 51
8
votes
4 answers

ERROR 1066: Unable to open iterator for alias - Pig

Just started Pig; trying to load the data from a file and dump it henceforth. Loading seems to be proper, no error is thrown. Below is the query: NYSE = LOAD '/root/Desktop/Works/NYSE-2000-2001.tsv' USING PigStorage() AS (exchange:chararray,…
knowone
  • 840
  • 2
  • 16
  • 37
8
votes
2 answers

Hadoop and Stata

Does anyone have any experience using Stata and Hadoop? Stata 13 now has a Java Plugin API, so I think it should be straightforward to get them to play nice. I am particularly interested in being able to parse weblog data to get it into a form…
dimitriy
  • 9,077
  • 2
  • 25
  • 50