0

I would like to create an inverted index using MapReduce techniques with MrJob. The inverted index for a given word x is defined as the line index or indices where x occurs in a given input text file. For example, say x is the word this and the input text file text.txt is:

# copyright laws for your country before downloading or redistributing 
# this or any other Project Gutenberg eBook. BLANK LINE BELOW.

# This header should be the first thing seen when viewing this Project 
# Gutenberg file.  Please do not remove it.  

Then the inverted index for this would be:

"this": 2, 4, 4

Since this occurs on line 2 and twice on line 4. Noting case insensitivity.

Animo
  • 63
  • 6
  • What is your question? All you have done is to state a general desire -- there is no Stack Overflow question in this. Please repeat the [intro tour](https://stackoverflow.com/tour). SO is neither a coding service nor a tutorial resource. We will help you diagnose a *specific* problem with *your* code -- which you have not posted. – Prune May 01 '20 at 14:51
  • @Prune I don't have any relevant code as I don't even know how to approach the problem in MrJob. I have searched online for possible solutions to no avail. – Animo May 01 '20 at 15:01
  • This typically is not something for which you "find" a solution -- you work through tutorials for the needed programming tools and *create* a solution. That process is well outside the scope of Stack Overflow's stated purpose. In general, when you can't write *any* code towards solving your problem, then it's likely not a SO question. Rather, you need tutorial resources or an open-ended help site. – Prune May 01 '20 at 16:36
  • As side notes: (1) Why do you need to do this with MapReduce? (2) Why do you need to do this through Hadoop? There are many available indexing tools -- look for text-processing and NLP (natural-language processing) packages -- that might do the task for you without resorting to the overhead of learning extra implementation tools. – Prune May 01 '20 at 16:38
  • In response to both 1&2 in order to efficiently compute inverted indices of let's say a large library of documents(on the order of GBs) by using shared computing resources on a cluster as enabled by Hadoop. Also chances are those other indexing tools probably end up interfacing with Hadoop in the background, if there are doing it effieciently anyways. – Animo May 01 '20 at 16:51

0 Answers0