Questions tagged [apache-pig]

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Pig runs in two execution modes: Local mode and MapReduce mode. Pig script can be written in two modes: Interactive mode and Batch mode.

At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs for which large-scale parallel implementations already exist (e.g. the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin which has the following key properties:

  • Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, easy to write and understand.
  • Optimization opportunities. The declarative way in which tasks are encoded permits the system to optimize their execution plan automatically, allowing the user to focus on semantics rather than efficiency.
  • Extensibility. Users can create their own functions to do special-purpose processing.

Official Website:

Useful Links:

5199 questions
1
vote
1 answer

Cannot use -tagPath and schema at the same time in PigStorage LOAD

I'm having an interesting behaviour with PigStorage and its -tagPath option, where I do not know if I am doing something wrong (wrong schema definition?) or if this is a limitation/bug in Pig. My file looks like this (the most basic, I was able to…
aufziehvogel
  • 7,167
  • 5
  • 34
  • 56
1
vote
2 answers

Run PIG in local mode from oozie

I want to run PIG in local mode, which is very easy pig -x local file.pig My requirement is to run PIG in local mode from OOZIE? Is it possible as i think OOZIE will automatically launch map task first?
1
vote
2 answers

Manipulating a data structure in Pig/Hive

I'm not really sure how to phrase this question, so please redirect me if there is a better place for this question. Right now I have a data structure, more or less organized like this: I want my data to look like this: Sorry for the images,…
wugology
  • 193
  • 1
  • 4
  • 13
1
vote
0 answers

Distributed cache with Pig and Python

I know there are a lot of resources for using distributed cache in Pig scripts with Java-udfs. But I haven't found anything that would explain the same with python udfs. Also, I have not found any detailed explanation of distributed cache usage…
Roger
  • 2,823
  • 3
  • 25
  • 32
1
vote
1 answer

YARN error: TaskAttempt killed because it ran on unusable node ... Container released on a *lost* node

I am using CDH 5.4 with Pig 0.12. I am getting a lot of this error from all nodes: TaskAttempt killed because it ran on unusable nodename:portnumber Container released on a *lost* node What does this mean? In particular what does "lost" mean here?…
kee
  • 10,969
  • 24
  • 107
  • 168
1
vote
1 answer

Pig udf on Filter

I have a use-case in which i need to take in the date of a month to return the previous month's last date. Ex: input:20150331 output:20150228 I will be using this previous month's last date to filter a daily partition(in pig script). B = filter A…
Pratik
  • 1,216
  • 11
  • 18
1
vote
1 answer

Unable to avoid duplicate deletion in Apache Pig

I am new to Apache Pig. I want to split and flatten the following input into my required output like who are all viewed that product. My Input :(UserId, ProductId) 12345 123456,23456,987653 23456 23456,123456,234567 34567 …
Karthick S
  • 25
  • 4
1
vote
1 answer

Pig filter fails due to unexpected data

I am running Cassandra and have about 20k records in it to play with. I am trying to run a filter in pig on this data but am getting the following message back: 2015-07-23 13:02:23,559 [Thread-4] WARN org.apache.hadoop.mapred.LocalJobRunner -…
Brett McLain
  • 2,000
  • 2
  • 14
  • 32
1
vote
1 answer

Loading XML to PIG : Error 2998

I'm using piggybank-0.12.0.jar, and pig version is 0.12 (CDH) pig --version Apache Pig version 0.12.0-cdh5.3.2 (rexported) I am trying to load xml file using XMLLoader of piggybank jar. During that getting below error: REGISTER…
shankar
  • 93
  • 2
  • 8
1
vote
1 answer

Apache Pig - Want to generate 10 gb sample data with known cardinality and sample values for all the columns

I want to generate around 10 GB of sample data where I have columns with sample values and cardinality using PIG script. Example:- A B C 1 10/10/2011 abc-xyz 2 10/11/2012 assd-asd 3 10/12/2011 asd-asd 1 10/13/2013 …
Nikita
  • 13
  • 5
1
vote
2 answers

How to split string data to array using pipe delimiter in pig?

I m trying to write pig script that gets string data like this: abc|def|xyz and tries to put these values into an array of string. How do i split this string to get an array of string like [abc,def,xyz] ? I tried using STRSPLIT function, but the no…
user3335722
  • 82
  • 3
  • 9
1
vote
0 answers

pig - walk hdfs directory and tika-parse documents into hive?

What is the best way to walk a directory structure in HDFS? Is there anyway to do this in Pig? My reason for asking is because I have a HDFS directory tree with multiple sub-directories and many different document types such as xls, doc, docx,…
joefromct
  • 1,506
  • 13
  • 33
1
vote
1 answer

Accessing a File from Distributed Cache in Pig UDF Java class, Amazon EMR

I am trying to access a file (sample.txt) in UDF. I want to put that file in distributed cache and use it from there. I am using amazon EMR to run the Pig job. I am copying the file (sample.txt) to HDFS using EMR bootstrap-action while creating…
nero
  • 143
  • 2
  • 8
1
vote
1 answer

Relative path in absolute URI error while using PIG via RDP in HDInsight

I'm trying to run a pig query using RDP in HDInsight.. The query is LOGS = LOAD 'wasb://containerName@storageAccountName.blob.core.windows.net/' as unparsedString:chararray; where containerName & storageAccountName are my containerName and…
Arnab
  • 2,324
  • 6
  • 36
  • 60
1
vote
0 answers

How to return datetime from Pig python UDF

I'm trying to return a datetime object from my python UDF for use in a Pig script (note I'm simplfying the problem here, my actual UDF does some thing a lot more complex than returning the current time but the object returned is the same): Pig…
undershock
  • 754
  • 1
  • 6
  • 26
1 2 3
99
100