Questions tagged [apache-pig]

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Pig runs in two execution modes: Local mode and MapReduce mode. Pig script can be written in two modes: Interactive mode and Batch mode.

At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs for which large-scale parallel implementations already exist (e.g. the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin which has the following key properties:

  • Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, easy to write and understand.
  • Optimization opportunities. The declarative way in which tasks are encoded permits the system to optimize their execution plan automatically, allowing the user to focus on semantics rather than efficiency.
  • Extensibility. Users can create their own functions to do special-purpose processing.

Official Website:

Useful Links:

5199 questions
11
votes
3 answers

how to include external jar file using PIG

When I run a mapreduce job using hadoop command, I use -libjars to setup my jar to the cache and the classpath. How to do something like this in PIG?
root1982
  • 470
  • 2
  • 4
  • 10
10
votes
3 answers

How do I store gzipped files using PigStorage in Apache Pig?

Apache Pig v0.7 can read gzipped files with no extra effort on my part, e.g.: MyData = LOAD '/tmp/data.csv.gz' USING PigStorage(',') AS (timestamp, user, url); I can process that data and output it to disk okay: PerUser = GROUP MyData BY…
PP.
  • 10,764
  • 7
  • 45
  • 59
10
votes
2 answers

Can I split a command over multiple lines in Apache Pig Latin?

I have some very long lines as Apache Pig (Latin) expressions. Is there a way of splitting these over multiple lines? I've tried a trailing backslash to no avail, as soon as I press enter the (incomplete) command executes...
PP.
  • 10,764
  • 7
  • 45
  • 59
10
votes
1 answer

How to remove duplicate columns after a JOIN in Pig?

Let's say I JOIN two relations like: -- part looks like: -- 1,5.3 -- 2,4.9 -- 3,4.9 -- original looks like: -- 1,Anju,3.6,IT,A,1.6,0.3 -- 2,Remya,3.3,EEE,B,1.6,0.3 -- 3,akhila,3.3,IT,C,1.3,0.3 jnd = JOIN part BY $0, original BY $0; The output…
USB
  • 6,019
  • 15
  • 62
  • 93
10
votes
2 answers

how to deploy and run oozie job?

I'm trying to do a simple job using oozie. It will be a one simple Pig Action. I have a file : FirstScript.pig containing: dual = LOAD 'default.dual' USING org.apache.hcatalog.pig.HCatLoader(); store dual into 'dummy_file.txt' using…
psmith
  • 1,769
  • 5
  • 35
  • 60
10
votes
1 answer

Pig Conditional Operators

Consider the below relation test = LOAD 'input' USING PigStorage(',') as (a:chararray, b:chararray); Is there a way to achieve the following if (b == 1) { a = 'abc'; else if (b == 2) { a = 'xyz'; else // retain whatever is there in…
rahul
  • 1,423
  • 3
  • 18
  • 28
10
votes
3 answers

Pig Batch mode: how to set logging level to hide INFO log messages?

Using Apache Pig version 0.10.1.21 (rexported). When I execute a pig script, there are a lots of INFO logging lines which looks like that: 2013-05-18 14:30:12,810 [Thread-28] INFO org.apache.hadoop.mapred.Task - Task…
Polymerase
  • 6,311
  • 11
  • 47
  • 65
10
votes
1 answer

Pig, how to refer to a field after a join and a group by

I have this code in Pig (win, request and response are just tables loaded directly from filesystem): win_request = JOIN win BY bid_id, request BY bid_id; win_request_response = JOIN win_request BY win.bid_id, response BY bid_id; win_group = GROUP…
Jorge González Lorenzo
  • 1,722
  • 1
  • 19
  • 28
10
votes
2 answers

Apache Pig: strip namespace prefix (::) after group operation

A common pattern in my data processing is to group by some set of columns, apply a filter, then flatten again. For example: my_data_grouped = group my_data by some_column; my_data_grouped = filter my_data_grouped by ; my_data =…
Nick
  • 21,555
  • 18
  • 47
  • 50
9
votes
3 answers

Could not infer COUNT function

I'm trying to write a pig latin script to pull the count of a dataset that I've filtered. Here's the script so far: /* scans by title */ scans = LOAD '/hive/scans/*' USING PigStorage(',') AS…
JasonA
  • 314
  • 2
  • 4
  • 11
9
votes
1 answer

How to get the value for a variable key from a pig map?

Is there a way we can get the value of a map for variable keys using the field as the key? Eg : My company data has locale and name fields like this {"en_US", (["en_US" : "English Name"], ["fr_FR" : "French Name"])} What I want essentially is to…
TommyT
  • 1,707
  • 3
  • 17
  • 26
9
votes
1 answer

Using hive table over parquet in Pig

I am trying to create a Hive table with schema string,string,double on a folder containing two Parquet files. The first parquet file schema is string,string,double and the schema of the second file is string,double,string. CREATE EXTERNAL TABLE…
SaurabhG
  • 91
  • 2
9
votes
2 answers

Pig Script without load

I am a newbie to Pig. I am trying to figure out how to define a bag or tuple with hard coded values, without loading data from a file. Every example that I have encountered with starts with: a = LOAD '/file/name' using PigStorage(','); or something…
9
votes
6 answers

PIG: ERROR 1000: Error during parsing

I have installed Pig 0.12 in my machine. when I run darwin$ pig grunt> ls /data/ hdfs://Nmame:10001/data/pg20417.txt 674570 hdfs://Nname:10001/data/pg4300.txt 1573150 hdfs:/Nname:10001/data/pg5000.txt
brain storm
  • 30,124
  • 69
  • 225
  • 393
9
votes
2 answers

Storing data to SequenceFile from Apache Pig

Apache Pig can load data from Hadoop sequence files using the PiggyBank SequenceFileLoader: REGISTER /home/hadoop/pig/contrib/piggybank/java/piggybank.jar; DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader(); log = LOAD…
asquithea
  • 575
  • 5
  • 8