I am following the guide of Cascading on its website. I have the following TSV format input:
doc_id text
doc01 A rain shadow is a dry area on the lee back side of a mountainous area.
doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain with less rain and cloudcover.
doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) side of a mountain.
doc04 This is known as the rain shadow effect and is the primary cause of leeward deserts of mountain ranges, such as California's Death Valley.
doc05 Two Women. Secrets. A Broken Land. [DVD Australia]
I use the following code to process it:
Tap docTap = new Hfs(new TextDelimited(true, "\t"), inPath);
...
Fields token = new Fields("token");
Fields text = new Fields("text");
RegexSplitGenerator splitter = new RegexSplitGenerator(token, "[ \\[\\]\\(\\),.]");
// only returns "token"
Pipe docPipe = new Each("token", text, splitter, Fields.RESULTS);
It looks like just split the second part of each line (ignore doc_id part). How does Cascading ignore the first doc_id part and just process the second part? is that because of TextDelimited ??