0

I have a Pig latin related problem:

I have this data below (in one row):

A = LOAD 'records' AS (f1:chararray, f2:chararray,f3:chararray, f4:chararray,f5:chararray, f6:chararray);
DUMP A;

(FITKA,FINVA,FINVU,FEEVA,FETKA,FINVA)

Now I have another dataset:

B = LOAD 'values' AS (f1:chararray, f2:chararray);
Dump B;
(FINVA,0.454535)
(FITKA,0.124411)
(FEEVA,0.123133)

And I would like to get those two dataset joined. I would get corresponding value from dataset B and place that value beside the value from dataset A. So expected output is below:

FITKA 0.123133, FINVA 0.454535 and so on .. 
(They can also be like: FITKA, 0.123133, FINVA, 0.454535 and so on .. )

And then I would be able to multiply values (0.123133 x 0.454535 .. and so on) because they are on the same row now and this is what I want.

Of course I can join column by column but then values appear "end of row" and then I can clean it by using another foreach generate. But, I want some simpler solution without too many joins which may cause performance issues.

Dataset A is text (Sentence in one way..).

So what are my options to achieve this? Any help would be nice.

Manjunath Ballur
  • 6,287
  • 3
  • 37
  • 48

1 Answers1

0

A sentence can be represented as a tuple and contains a bag of tuples (word, count).

Therefore, I suggest you change the way you store your data to the following format:

sentence:tuple(words:bag{wordcount:tuple(word, count)})
glefait
  • 1,651
  • 1
  • 13
  • 11
  • Hello and thanks for your quick reply. Can you give me some short pig script example. The problem is still the fact that dataset with single words and values like (Peter 0.454523) is static and dataset with sentences change quite often. I understand your schema, but I lost my track when in my situation have to insert them to the schema you described .. – Petri Koski Jun 10 '15 at 20:11
  • If you want more than a hint, provide more information and data that can be tested. – glefait Jun 11 '15 at 20:21
  • Ok, here comes the data: – Petri Koski Jun 13 '15 at 06:54
  • provide input datasets and expected output. – glefait Jun 15 '15 at 18:58
  • I have put sample rows from both datasets in original question and also provided wished output. – Petri Koski Jun 15 '15 at 19:43