6

I read in a csv-file that contains fields with numbers like that: "3". Can I convert this fields from "3" to 3 with PigLatin? I need it to use the SUM() - Function.

Thanks for your help!

Christoph
  • 1,113
  • 5
  • 17
  • 35
  • 3
    Areway ouyay uresay it'sway otnay "eethray"? (I know what you are actually referring to, I'm just messing around :P) – BoltClock Dec 08 '10 at 16:09
  • 1
    I wonder if Pig Latin (the "language", not the Hadoop subproject) is primarily an American thing, or primarily something that native English speakers know about? Looks like Christoph is in Germany. Anyway, Christoph, Pig Latin is a play language in which English words are altered by moving the intial sound to the end of the word and adding "ay". – Jacob Mattison Dec 08 '10 at 17:05
  • haha, yes I'm sure it's not "three" :-D – Christoph Dec 08 '10 at 18:45

3 Answers3

9

What about just removing the " with REPLACE?

For example:

data =
    LOAD 'data.txt' AS (num:CHARARRAY);

numbers =
    FOREACH data
    GENERATE
        (INT) REPLACE(num, '\\"', '');

Then you can GROUP and SUM.

One advantage is that you can cast the returned string directly to a number (no need to deal with bags). REGEX_EXTRACT could be used to do the same too.

user3662937
  • 245
  • 7
  • 16
Romain
  • 7,022
  • 3
  • 30
  • 30
  • I had to do a filter where I need to pick all the values which was greater than 1, I did this: `inputData = FILTER inputData BY (INT) REPLACE((chararray)value#'val', '\\"', '')>1;`. Looks good? – zengr Apr 25 '13 at 21:15
6

The TOKENIZE function will split a string on various characters considered to be word separators, one of which is a quote mark. So if you tokenize "3" and take the middle item, it should be just 3.

Jacob Mattison
  • 50,258
  • 9
  • 107
  • 126
3

You could write a UDF that strips the quotes around it OR use JacobM's approach.

However, afterwards, you should cast the chararray '3' to an int: (int)$1 or (int)myvalue. This way you can use sum.

http://pig.apache.org/docs/r0.5.0/piglatin_reference.html#Cast+Operators

Donald Miner
  • 38,889
  • 8
  • 95
  • 118