I would like to create an inverted index using MapReduce techniques with MrJob. The inverted index for a given word x
is defined as the line index or indices where x
occurs in a given input text file. For example, say x
is the word this
and the input text file text.txt
is:
# copyright laws for your country before downloading or redistributing
# this or any other Project Gutenberg eBook. BLANK LINE BELOW.
# This header should be the first thing seen when viewing this Project
# Gutenberg file. Please do not remove it.
Then the inverted index for this
would be:
"this": 2, 4, 4
Since this
occurs on line 2
and twice on line 4
. Noting case insensitivity.