0

I am following the guide of Cascading on its website. I have the following TSV format input:

doc_id  text
doc01   A rain shadow is a dry area on the lee back side of a mountainous area.
doc02   This sinking, dry air produces a rain shadow, or area in the lee of a mountain with less rain and cloudcover.
doc03   A rain shadow is an area of dry land that lies on the leeward (or downwind) side of a mountain.
doc04   This is known as the rain shadow effect and is the primary cause of leeward deserts of mountain ranges, such as California's Death Valley.
doc05   Two Women. Secrets. A Broken Land. [DVD Australia]

I use the following code to process it:

Tap docTap = new Hfs(new TextDelimited(true, "\t"), inPath);
...
Fields token = new Fields("token");
Fields text = new Fields("text");
RegexSplitGenerator splitter = new RegexSplitGenerator(token, "[ \\[\\]\\(\\),.]");
// only returns "token"
Pipe docPipe = new Each("token", text, splitter, Fields.RESULTS);

It looks like just split the second part of each line (ignore doc_id part). How does Cascading ignore the first doc_id part and just process the second part? is that because of TextDelimited ??

user2597504
  • 1,503
  • 3
  • 23
  • 32

2 Answers2

0

If you see the pipe statement

Pipe docPipe = new Each("token", text, splitter, Fields.RESULTS);

The second argument is the only field you are sending to splitter function. Here you are sending 'text' field. SO only the text is sent to splitter and returns the tokens.

Below explains the Each method clearly.

Each

@ConstructorProperties(value={"name","argumentSelector","function","outputSelector"})
public Each(String name,
                                   Fields argumentSelector,
                                   Function function,
                                   Fields outputSelector)

Only pass argumentFields to the given function, only return fields selected by the outputSelector.

Parameters:
    name - name for this branch of Pipes
    argumentSelector - field selector that selects Function arguments from the input Tuple
    function - Function to be applied to each input Tuple
    outputSelector - field selector that selects the output Tuple from the input and Function results Tuples
Naveen
  • 425
  • 12
  • 28
0

The answer is in these 2 lines

1. The way Tap was created, program was told that first line contains header ("true").

Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );    

2. And second, in this line the column name was provided as "text". If you look closely in your input file, "text" is the column name for the data you are trying to base your word count on.

 Fields text = new Fields( "text" );
Unit1
  • 249
  • 1
  • 9