Is there a way to extract certain words from a file in Pig Latin, eg: I want all words in a large file with tweets, that have a # in the beginning.
Input : What a lovely day! #Sunshine
Output : Sunshine
Is there a way to extract certain words from a file in Pig Latin, eg: I want all words in a large file with tweets, that have a # in the beginning.
Input : What a lovely day! #Sunshine
Output : Sunshine
Take a look at REGEX_EXTRACT: http://pig.apache.org/docs/r0.12.1/func.html#regex-extract
This should work (extracts the last word with a # in front of it from your_field):
REGEX_EXTRACT(your_field, '.*#(\\w+)($|\\s.*)', 1)