1

I am trying to load a datafile in a pig latin script, Data has 2 columns but there is a text qualifier in the 2nd column and sample data is below :

DEVICE_ID,SUPPORTED_TECH
a2334,"GSM900,GSM1500,GSM200"
a54623,"GSM900,GSM1500"
a86646,"GSM1500,GSM200"

When I try loading the date as below, 2nd column is not recognized as 1 column

deviceList = load 'deviceList.csv' Using PigStorage(',') as (DEVICE_ID:chararray, SUPPORTED_TECH:chararray );

How can I define the text qualifier while loading the data set ?

Jørgen R
  • 10,568
  • 7
  • 42
  • 59
FIDIL
  • 117
  • 1
  • 4
  • 14

1 Answers1

1

Try this , let me know if you need different output format

input.txt

DEVICE_ID,SUPPORTED_TECH
a2334,"GSM900,GSM1500,GSM200"
a54623,"GSM900,GSM1500"
a86646,"GSM1500,GSM200

PigScript:

A = LOAD 'input.txt' AS line;
deviceList = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'^(\\w+),(.*)$')) as (DEVICE_ID:chararray, SUPPORTED_TECH:chararray );
DUMP deviceList;

OutPut:

(DEVICE_ID,SUPPORTED_TECH)
(a2334,"GSM900,GSM1500,GSM200")
(a54623,"GSM900,GSM1500")
(a86646,"GSM1500,GSM200")
Sivasakthi Jayaraman
  • 4,724
  • 3
  • 17
  • 27
  • thnx for the answer this works for 2 columns but my original file has 625 columns can you reccomend anything without defining each column ? – FIDIL Nov 12 '14 at 12:48