Questions tagged [apache-pig]

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Pig runs in two execution modes: Local mode and MapReduce mode. Pig script can be written in two modes: Interactive mode and Batch mode.

At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs for which large-scale parallel implementations already exist (e.g. the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin which has the following key properties:

  • Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, easy to write and understand.
  • Optimization opportunities. The declarative way in which tasks are encoded permits the system to optimize their execution plan automatically, allowing the user to focus on semantics rather than efficiency.
  • Extensibility. Users can create their own functions to do special-purpose processing.

Official Website:

Useful Links:

5199 questions
6
votes
1 answer

What is the difference between GROUP and COGROUP in PIG?

I understood Group didn't work with multiple tuples and hence we had COGROUP in PIG. However, while checking today the GROUP command works for me. I am using PIG-0.12.0. My commands and outputs are as follows. grunt> grpvar = GROUP C by $2, B by…
proutray
  • 1,943
  • 3
  • 30
  • 48
6
votes
3 answers

StrSplit in Pig functions

Can Some one explain me on getting this below output in Pigscript my input file is below a.txt aaa.kyl,data,data bbb.kkk,data,data cccccc.hj,data,data qa.dff,data,data I am writing the pig script like this A = LOAD 'a.txt' USING PigStorage(',')…
Surender Raja
  • 3,553
  • 8
  • 44
  • 80
6
votes
3 answers

IS it possible to manage NO FILE error in Pig?

I'm trying to load simple file: log = load 'file_1.gz' using TextLoader AS (line:chararray); dump log And I get an error: 2014-04-08 11:46:19,471 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception…
psmith
  • 1,769
  • 5
  • 35
  • 60
6
votes
3 answers

Type conversion pig hcatalog

I use HCatalog version 0.4. I have a table in hive 'abc' which has a column with datatype 'timestamp'. When i try to run a pig script like this "raw_data = load 'abc' using org.apache.hcatalog.pig.HCatLoader();" i get an error saying…
kris433
  • 414
  • 5
  • 17
6
votes
1 answer

How do I add a column, preserving the existing columns, without listing them all?

I want to add a new column to an alias, preserving all the existing ones. A = foreach A generate A.id as id, A.date as date, A.foo as foo, A.bar as bar, A.foo / A.bar as foobar; Can I do that without listing all of them explicitly?
sds
  • 58,617
  • 29
  • 161
  • 278
6
votes
1 answer

Using pig, how do I parse a mixed format line into tuples and a bag of tuples?

I'm new to pig, and I'm having an issue parsing my input and getting it into a format that I can use. The input file contains lines that have both fixed fields and KV pairs as follows: FF1|FF2|FF3|FF4|KVP1|KVP2|...|KVPn My goal here is to count the…
6
votes
2 answers

pig - split, lack of default or if/else

Since there is no else or default statements in pig split operation what would be the most elegant way to do the following? I'm not a big fan of having to copy paste code. SPLIT rawish_data INTO good_rawish_data IF ( (uid > 0L) AND …
warbaque
  • 583
  • 1
  • 8
  • 18
6
votes
3 answers

Pig UDF for iso to yyyy-mm-dd hh:mm:ss.000

Iam looking to convert the ISO time format to yyyy-mm-dd hh:mm:ss.SSS. However Im not able achive the conversion. Iam new to pig and im trying to write a udf to handle the conversion from ISO format to yyyy-mm-dd hh:mm:ss.SSS. Kindly guide me I…
user2667326
  • 505
  • 1
  • 5
  • 7
6
votes
1 answer

Hadoop Pig count number

I am learning how to use Hadoop Pig now. If I have a input file like this: a,b,c,true s,c,v,false a,s,b,true ... The last field is the one I need to count... So I want to know how many 'true' and 'false' in this file. I try: records = LOAD…
user2597504
  • 1,503
  • 3
  • 23
  • 32
6
votes
2 answers

Pig 0.11.1 - Count groups in a time range

I have a dataset, A, that has timestamp, visitor, URL: (2012-07-21T14:00:00.000Z, joe, hxxp:///www.aaa.com) (2012-07-21T14:01:00.000Z, mary, hxxp://www.bbb.com) (2012-07-21T14:02:00.000Z, joe, hxxp:///www.aaa.com) I want to measure number of…
Joe Nate
  • 159
  • 10
6
votes
1 answer

Pig: Force one mapper per input line/row

I have a Pig Streaming job where the number of mappers should equal the number of rows/lines in the input file. I know that setting set mapred.min.split.size 16 set mapred.max.split.size 16 set pig.noSplitCombination true will ensure that each…
sergeyf
  • 1,004
  • 11
  • 10
6
votes
3 answers

Is there a common place to store data schemas in Hadoop?

I've been doing some investigation lately around using Hadoop, Hive, and Pig to do some data transformation. As part of that I've noticed that the schema of data files doesn't seem to attached to files at all. The data files are just flat files…
Bryan Kyle
  • 13,361
  • 4
  • 40
  • 45
6
votes
2 answers

Pig local mode, group, or join = java.lang.OutOfMemoryError: Java heap space

Using Apache Pig version 0.10.1.21 (reported), CentOS release 6.3 (Final), jdk1.6.0_31 (The Hortonworks Sandbox v1.2 on Virtualbox, with 3.5 GB RAM) $ cat data.txt 11,11,22 33,34,35 47,0,21 33,6,51 56,6,11 11,25,67 $ cat GrpTest.pig A = LOAD…
Polymerase
  • 6,311
  • 11
  • 47
  • 65
6
votes
1 answer

Loading json with varying schema into PIG

I ran into an issue loading a set json documents into PIG. What I have is a lot of json documents that all vary in the fields they have, the fields that I need are in most documents and in whare missing I would like to get a null value. I just…
Niels Basjes
  • 10,424
  • 9
  • 50
  • 66
6
votes
2 answers

Hadoop PIG Max of Tuple

How do I find the MAX of a tuple in Pig? My code looks like this: A,20 B,10 C,40 D,5 data = LOAD 'myData.txt' USING PigStorage(',') AS key, value; all = GROUP data ALL; maxKey = FOREACH all GENERATE MAX(data.value); DUMP maxKey; This returns 40,…
supyo
  • 3,017
  • 2
  • 20
  • 35