Questions tagged [apache-pig]

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Pig runs in two execution modes: Local mode and MapReduce mode. Pig script can be written in two modes: Interactive mode and Batch mode.

At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs for which large-scale parallel implementations already exist (e.g. the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin which has the following key properties:

  • Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, easy to write and understand.
  • Optimization opportunities. The declarative way in which tasks are encoded permits the system to optimize their execution plan automatically, allowing the user to focus on semantics rather than efficiency.
  • Extensibility. Users can create their own functions to do special-purpose processing.

Official Website:

Useful Links:

5199 questions
1
vote
1 answer

MultiStorage in pig

I have run the below pig script in the grunt shell Register D:\Pig\contrib\piggybank\java\piggybank.jar; a = load '/part' using PigStorage(',') as…
wazza
  • 770
  • 5
  • 17
  • 42
1
vote
2 answers

Partitioning the data based on column values

Hi I have data source as following ID Date Page 100 27-10-2015 google 102 27-10-2015 facebook 102 27-10-2015 instagram 104 28-10-2015 yahoo 105 30-10-2015 bing I want to store this…
wazza
  • 770
  • 5
  • 17
  • 42
1
vote
2 answers

Hive unable to read decimal value from hdfs

My hive version is 0.13. I have a file that contain decimal value and few other data types. This file is obtained after performing some Pig transformations. I created a Hive table on top of this HDFS file. When I try to do a select * from…
Neethu Lalitha
  • 3,031
  • 4
  • 35
  • 60
1
vote
4 answers

Pig Script: STORE command not working

this is my first time posting to StackOverflow and I'm hoping someone can assist. I'm fairly new at pig scripts and have encountered a problem I can't solve. Below is a pig script that fails when I attempt to write results to a file: register…
JaneQDoe
  • 13
  • 4
1
vote
1 answer

Error when executing PIG script

I'm running a pig script in google cloud Hadoop environment pig -useHCatalog -x mapreduce -f profile.pig i have two tables each with 50,000 records that will be crossed and joined with a table with 10,00,000., I ran the same script with less…
Maharaj
  • 87
  • 3
  • 11
1
vote
0 answers

How to load multiple har files on one pig load command?

I have many har files based which is archived by hours, and I would like to analysis the data based on month and day. I tried different wildcard matching methods which supported by pig load and works well for non archived folders and files but none…
KaneWang
  • 11
  • 3
1
vote
1 answer

Sort tuples in a bag based on multiple fileds

I am trying to sort tuples inside a bag based on three fields in descending order.. Example : Suppose I have the following bag created by grouping: {(s,3,my),(w,7,pr),(q,2,je)} I want to sort the tuples in the above grouped bag based on $0,$1,$2…
USY
  • 61
  • 8
1
vote
1 answer

Hadoop Pig: Show entries using STARTSWITH

I am having issues using the STARTSWITH string function. I want to display all records in System_Period that begins with 20040 transactions = LOAD '/home/cloudera/datasets/assignment2/Transactions.csv' USING PigStorage(',') AS (Branch_Number:int,…
Joe
  • 13
  • 3
1
vote
2 answers

Query Regarding PIG- How to put a if like condition in ForEach

I have a query wrt writing pig script RESULT_SOMETYPE = FOREACH SOMETYPE_DATA_GROUPED GENERATE flatten(group) , SUM(SOMETYPEDATA.DURATION) as duration, COUNT(SOMETYPEDATA.DURATION) as cnt; Here I want to replace SUM(SOMETYPEDATA.DURATION) with…
Argho Chatterjee
  • 579
  • 2
  • 9
  • 26
1
vote
1 answer

Pig ReadTimeOut Exception

I've installed hortonworks sandbox on Virtual Box. (6092MB of Ram) I'm following this tutorial. When I try to execute one simple script Using arguments: -useHCatalog Execute on Tez. I got this error: java.net.SocketTimeoutException: Read timed…
user3791321
1
vote
1 answer

How to convert fields into bags and tuples in PIG?

I have a dataset which has comma seperated values as: 10,4,21,9,50,9,4,50 50,78,47,7,4,7,4,50 68,25,43,13,11,68,10,9 I want to convert this into Bags and tuples as shown below: ({(10),(4),(21),(9),(50)},{(9),(4),(50)}) …
Jahar tyagi
  • 91
  • 13
1
vote
2 answers

Apache Pig: How to load a sequence file which is stored in hdfs?

My sequence files are stored directly in hdfs e.g.: grunt> ls grunt> ls /blabla hdfs://namenode1:54310/blabla/0411f03a-db7f-48d0-9542-5203304e3e81.seq 185284523 hdfs://namenode1:54310/blabla/05be8fc0-e967-42e1-b76a-0d7108a69d17.seq
mr.proton
  • 993
  • 2
  • 10
  • 24
1
vote
0 answers

Apache Pig - Transform data bag to set of rows

I have a pig data (123,{(1),(2),(3)},{(0.5),(0.6),(0.7)}) I want to generate records in below format 123,1,0.5 123,2,0.6 123,3,0.7 I am able to do this when above data has one bag but not getting how to generate required output when we have…
Ajay
  • 783
  • 3
  • 16
  • 37
1
vote
1 answer

Pig sum on data

I have a file like - (1950,10) (1951,33) (1952,15) (1953,17) (1954,17) (1955,14) (1956,60) (1957,98) (1958,73) (1959,87) (1960,123) I want to get the sum of the second field through Pig. eg out put should be like (547) Please help
sat
  • 11
  • 2
1
vote
0 answers

PIG - retrieve data from XML using XPATH

I have n number of these type of xml files. abc m 2014 100 100
Ajay
  • 783
  • 3
  • 16
  • 37