0

I have a string (or a list of words). I would like to create tuples of every possible word pair combination in order to pass them to a Counter for dictionary creation and frequency calculation. The frequency is calculated in the following manner: if the pair exists in a string (regardless of the order or if there are any other words between them) the frequency = 1 (even the word1 has a frequency of 7 and word2 of 3 the frequency of a pair word1 and word2 is still 1)

I am using loops to create tuples of all pairs but got stuck

tweetList = ('I went to work but got delayed at other work and got stuck in a traffic and I went to drink some coffee but got no money and asked for money from work', 'We went to get our car but the car was not ready. We tried to expedite our car but were told it is not ready')

words = set(tweetList.split())
n = 10
for tweet in tweetList:

    for word1 in words:
        for word2 in words:
            pairW = [(word1, word2)]

            c1 = Counter(pairW for pairW in tweet)

c1.most_common(n)

However, the ouput is very bizzare:

[('k', 1)]

It seems instead of words it is iterating over letters

How can this be addressed? Converting a string into a list of words using split() ?

Another question: how to avoid creating duplicate tuples such as: (word1, word2) and (word2, word1)? Enumerate?

As an Output I expect a dictionary where key = all word pairs (see duplicate comment though), and the value = frequency of a pair in the list

Thank you!

Toly
  • 2,981
  • 8
  • 25
  • 35
  • You should indicate what you are expecting as output. – Anand S Kumar Oct 17 '15 at 16:35
  • `for tweet in tweetlist` would iterate over characters in the original string, which seems pointless. Calling `split` on it doesn't cause it to become a list – John Coleman Oct 17 '15 at 16:54
  • tweetlist is a list of 2 strings – Toly Oct 17 '15 at 17:07
  • what about pairs such as `('work','work')`? That *is* an ordered pair of words -- do you want to count such pairs? – John Coleman Oct 17 '15 at 17:07
  • 1
    @Toly -- I see, I scrolled to the end but missed the comma in the middle. To quibble, it is a *tuple* of strings. But then -- calling `split` on it makes no sense. Tuples don't have a split method. – John Coleman Oct 17 '15 at 17:08
  • A final question -- if you want to count pairs of words regardless of order and spaces in between then why not just count the individual words and multiply the counts? – John Coleman Oct 17 '15 at 17:27
  • @JohnColeman - no same words in a pair should be excluded – Toly Oct 17 '15 at 18:22
  • @JohnColeman - because I need an occurrence for both words in any string. Also this will be a cleaner way – Toly Oct 17 '15 at 18:23

2 Answers2

1

I wonder if that's what you want:

import itertools, collections

tweets = ['I went to work but got delayed at other work and got stuck in a traffic and I went to drink some coffee but got no money and asked for money from work',
          'We went to get our car but the car was not ready. We tried to expedite our car but were told it is not ready']

words = set(word.lower() for tweet in tweets for word in tweet.split())
_pairs = list(itertools.permutations(words, 2))
# We need to clean up similar pairs: sort words in each pair and then convert
# them to tuple so we can convert whole list into set.
pairs = set(map(tuple, map(sorted, _pairs)))

c = collections.Counter()

for tweet in tweets:
    for pair in pairs:
        if pair[0] in tweet and pair[1] in tweet:
            c.update({pair: 1})

print c.most_common(10)

Result is: [(('a', 'went'), 2), (('a', 'the'), 2), (('but', 'i'), 2), (('i', 'the'), 2), (('but', 'the'), 2), (('a', 'i'), 2), (('a', 'we'), 2), (('but', 'we'), 2), (('no', 'went'), 2), (('but', 'went'), 2)]

Alexander Solovyov
  • 1,526
  • 1
  • 13
  • 21
  • it looks right. Since there are only 2 strings the max frequency is 2. Need to check with N = 100 to see if the rest are ones. I am just curious why my way was so much off. – Toly Oct 17 '15 at 18:35
  • Well, compare your and my version and you will know why. For start, `pairW for pairW in tweet` just generated a list of letters of tweet (shadowing `pairW` variable you defined previously). Then `c1` was always replaced with a new version, instead of being the same for the whole loop. You also tried to split a tuple - `tweetList.split()` - while generating a list of words is a bit more involved. – Alexander Solovyov Oct 17 '15 at 18:37
  • Excellent! I checked and it works. The frequencies are from 2 to 0 as it should since there are only 2 strings. Thank you, Sasha:) – Toly Oct 17 '15 at 18:44
0

tweet is a string so Counter(pairW for pairW in tweet) will compute the frequency of the letters in tweet, which is probably not want you want.

Maxime Chéramy
  • 17,761
  • 8
  • 54
  • 75