0

Is there a way to extract certain words from a file in Pig Latin, eg: I want all words in a large file with tweets, that have a # in the beginning.

Input :  What a lovely day! #Sunshine
Output : Sunshine
Himanshu
  • 4,327
  • 16
  • 31
  • 39
Kaizzen
  • 5
  • 4

2 Answers2

0

Okay, using FILTER worked for me: startswithHash = filter <> by <> matches '#.*' ;

Kaizzen
  • 5
  • 4
0

Take a look at REGEX_EXTRACT: http://pig.apache.org/docs/r0.12.1/func.html#regex-extract

This should work (extracts the last word with a # in front of it from your_field):

REGEX_EXTRACT(your_field, '.*#(\\w+)($|\\s.*)', 1)
user2303197
  • 1,271
  • 7
  • 10