Inverted Index in Python not returning desired results

Question

I'm having trouble returning proper results for an inverted index in python. I'm trying to load a list of strings in the variable 'strlist' and then with my Inverse index looping over the strings to return the word + where it occurs. Here is what I have going so far:

def inverseIndex(strlist):
  d={}
  for x in range(len(strlist)):
    for y in strlist[x].split():
      for index, word in set(enumerate([y])):
        if word in d:
          d=d.update(index)
        else:
          d._setitem_(index,word)
        break
      break
    break
  return d

Now when i run inverseIndex(strlist)

all it returns is {0:'This'} where what I want is a dictionary mapping all the words in 'strlist' to the set d.

Is my initial approach wrong? am i tripping up in the if/else? Any and all help is greatly appreciated. to point me in the right direction.

Are the indentation levels of the `break` correctly shown here? — Sukrit Kalra, Jul 09 '13 at 17:54
Why are you `break`ing out of all your loops after the first iteration? — Tim Pietzcker, Jul 09 '13 at 17:54
Perhaps it would help if you showed some example input and desired output. I have no idea what you're trying to accomplish. — Tim Pietzcker, Jul 09 '13 at 17:56
Please ignore the downvotes - you have obviously given it your best shot here. Downvotes are for people who don't try and expect us to do their homework/jobs for them. — 2rs2ts, Jul 09 '13 at 18:00
For one thing, `d=d.update(index)` is equivalent to `d=None`, because that's what `.update()` returns. But that's not the problem here. — Wooble, Jul 09 '13 at 18:03
Anyway, your code *is* confusing, but I think you are trying to just get all the words in `strlist`, right? For starters, `d={}` isn't making `d` a set, it's making it a dictionary. Secondly, `set(enumerate([y]))` is going to give you something like `set([0, 'foo'])` which I doubt you want. — 2rs2ts, Jul 09 '13 at 18:05
You should also consider the fact that a word can occur in more than one string in `strlist`. — 2rs2ts, Jul 09 '13 at 18:10
Thanks for the comments so far, I'm new to Python (as if that wasn't obvious). The indents levels are shown as to how i have them in my code. as for an example of what strlist contains is as follows strlist=["Here is a test of what is going on", "Here is another string"] This is why i was breaking things apart with my for loops to try and get at the words within each string. — user2565601, Jul 09 '13 at 19:04

score 2 · Accepted Answer · answered Jul 09 '13 at 18:22

Based on what you're saying, I think you're trying to get some data like this:

input = ["hello world", "foo bar", "red cat"]
data_wanted = {
    "foo" : 1,
    "hello" : 0,
    "cat" : 2,
    "world" : 0,
    "red" : 2
    "bar" : 1
}

So what you should be doing is adding the words as keys to a dictionary, and have their values be the index of the substring in strlist in which they are located.

def locateWords(strlist):
d = {}
for i, substr in enumerate(strlist):   # gives you the index and the item itself
    for word in substr.split()
        d[word] = i
return d

If the word occurs in more than one string in strlist, you should change the code to the following:

def locateWords(strlist):
d = {}
for i, substr in enumerate(strlist):
    for word in substr.split()
        if word not in d:
            d[word] = [i]
        else:
            d[word].append(i)
return d

This changes the values to lists, which contain the indices of the substrings in strlist which contain that word.

Some of your code's problems explained

{} is not a set, it's a dictionary.
break forces a loop to terminate immediately - you didn't want to end the loop early because you still had data to process.
d.update(index) will give you a TypeError: 'int' object is not iterable. This method actually takes an iterable object and updates the dictionary with it. Normally you would use a list of tuples for this: [("foo",1), ("hello",0)]. It just adds the data to the dictionary.
You don't normally want to use d.__setitem__ (which you typed wrong anyway). You'd just use d[key] = value.
You can iterate using a "for each" style loop instead, like my code above shows. Looping over the range means you are looping over the indices. (Not exactly a problem, but it could lead to extra bugs if you're not careful to use the indices properly).

It looks like you are coming from another programming language in which braces indicate sets and there is a keyword which ends control blocks (like if, fi). It's easy to confuse syntax when you're first starting - but if you run into trouble running the code, look at the exceptions you get and search them on the web!

P.S. I'm not sure why you wanted a set - if there are duplicates, you probably want to know all of their locations, not just the first or the last one or anything in between. Just my $0.02.

Wow, thank you so much for the break down. You guessed it at end, I am trying to get locations of duplicates as well. I was getting lost along the way and was simply trying to get some sort of result out of the code to help point me in the right direction. — user2565601, Jul 09 '13 at 19:10
I'm re-reading the code you gave, and its so clear, thank you for you help. — user2565601, Jul 09 '13 at 19:18

score 0 · Answer 2 · answered Jul 09 '13 at 18:11

break is not an end-of-block marker; it means "if you hit this line of code, exit the loop immediately". You probably don't want all those break statements.

I'm not sure what you think the update method does.

d.update(index)

will try to treat index as a dict or a sequence of key-value pairs and add all the mappings in index to d. Since index is a number, this doesn't seem to be what you expect update to do. Also, update returns None, which is the Python equivalent of not returning anything, so you probably don't want to assign its value to d.

I'm not sure what you expect

for index, word in set(enumerate([y])):

to do. Let's go over what it does. [y] creates a 1-element list whose only element is y. enumerate([y]) will then return an iterator yielding a single element, the tuple (0, y). set(enumerate([y])) will then take all the items from that iterator (so just one item) and make a set containing those items. Finally, for index, word in set(enumerate([y])): will iterate over that one-item set, executing a single loop iteration with index == 0 and word == y. This is probably not what you were trying to do.

The __setitem__ special method (which has two underscores on each side) is called by Python to implement element assignment.

d.__setitem__(index, word)

is better written as

d[index] = word

If you want to iterate over strlist, then instead of using range(len(strlist)), you can iterate over strlist directly.

  for x in range(len(strlist)):
    for y in strlist[x].split():

is equivalent to

  for string in strlist:
    for y in string.split():

since looping over strlist will give the items of strlist.

I hope that helps.

Thank you for your help, and explaining my many mistakes. Its helpful to have this breakdown and see where i went wrong and how to do things better. — user2565601, Jul 10 '13 at 12:12

Inverted Index in Python not returning desired results

2 Answers2

Some of your code's problems explained

Linked