How do you extract the line index of any given line in MrJob?
index_words = ["before", "remove"]
class MRWordInvertedIndex(MRJob):
# how to make the key(index) the line index of the corresponding value(line) in the input text file?
def mapper(self, index, line):
words = WORD_RE.findall(line.lower())
for word in words:
# obtain the line index where 'word' occurs
if word in index_words:
yield word.lower(), index # where index is the line number
Is it possible to make the key(aka index
) parameter of the mapper the actual line index of the line in the corresponding input text file or obtain the line index in some other way?
For example, say the input text file is:
# copyright laws for your country before downloading or redistributing
# this or any other Project Gutenberg eBook. BLANK LINE BELOW.
# This header should be the first thing seen when viewing this Project
# Gutenberg file. Please do not remove it.
Then the line index for the index words before
and remove
would be:
"before": 1
"remove": 4
Since before
occurs on line 1 and remove
occurs on line 4.