31

I have just started to learn python. I want to write a program in NLTK that breaks a text into unigrams, bigrams. For example if the input text is...

"I am feeling sad and disappointed due to errors"

... my function should generate text like:

I am-->am feeling-->feeling sad-->sad and-->and disappointed-->disppointed due-->due to-->to errors

I have written code to input text into the program. Here's the function I'm trying:

def gen_bigrams(text):
    token = nltk.word_tokenize(review)
    bigrams = ngrams(token, 2)
    #print Counter(bigrams)
    bigram_list = ""
    for x in range(0, len(bigrams)):
        words = bigrams[x]
        bigram_list = bigram_list + words[0]+ " " + words[1]+"-->"
    return bigram_list

The error I'm getting is...

for x in range(0, len(bigrams)):

TypeError: object of type 'generator' has no len()

As the ngrams function returns a generator, I tried using len(list(bigrams)) but it returns 0 value, so I'm getting the same error. I have referred to other questions on StackExchange but I am still not getting around how to resolve this. I am stuck at this error. Any workaround, suggestion?

smci
  • 32,567
  • 20
  • 113
  • 146
Vishal Kharde
  • 1,553
  • 3
  • 16
  • 34
  • 6
    If `len(list(bigrams))` returns `0`, then that's presumably the issue... you probably need to figure out why `ngrams(token, 2)` isn't returning any values. – jmetz Apr 28 '16 at 11:42
  • ... and update the question title and text accordingly; at the moment the title is misleading – jmetz Apr 28 '16 at 11:43
  • 1
    `for x in bigrams` should work. Then no need for `words = bigrams[x]`. Why? `x` will be your `words` – Marek Apr 28 '16 at 11:43
  • "tried calling a list function on it to extract the content and then using len(list(bigrams))" is (one of) your problem(s), if I understand you correctly. You first exhaust the generator with `list` and then try to call `len(list(gen))` on it again. As the generator is already exhausted, it'll result in an empty list. – Ilja Everilä Apr 28 '16 at 11:45
  • This code is not minimal complete for anyone to reproduce it – MohitC Apr 28 '16 at 11:54
  • @llja Everila ...i hav updated the statement now..i hv used it once only. – Vishal Kharde Apr 28 '16 at 11:56
  • output of Print Counter(bigrams) is :-- Counter({('am', 'feeling'): 1, ('i', 'am'): 1, ('feeling', 'sad'): 1, ('sad', 'and'): 1, ('due', 'to'): 1, ('to', 'errors'): 1, ('disappointed', 'due'): 1, ('and', 'disappointed'): 1}) – Vishal Kharde Apr 28 '16 at 11:59
  • Thanks @Marek Czaplicki.... i made changes according to wat u said and it working...thanks.. – Vishal Kharde Apr 28 '16 at 12:04
  • 1
    @VishalKharde, Welcome. BTW. bigram_list is not a list, but string. Should it be? For string you can use: `bigram_list += words[0]+ " " + words[1]+"-->"` – Marek Apr 28 '16 at 12:12
  • Your code just references an `ngrams` function without saying where it comes from. I think it's `nltk.util.ngrams`, not your own function. Please edit the code to make that clear (MCVE). – smci Jun 28 '20 at 06:17

2 Answers2

7

Constructing strings by concatenating values separated by a separator is best done by str.join:

def gen_bigrams(text):
    token = nltk.word_tokenize(text)
    bigrams = nltk.ngrams(token, 2)
    # instead of " ".join also "{} {}".format would work in the map
    return "-->".join(map(" ".join, bigrams))

Note that there'll be no trailing "-->", so add that, if it's necessary. This way you don't even have to think about the length of the iterable you're using. In general in python that is almost always the case. If you want to iterate through an iterable, use for x in iterable:. If you do need the indexes, use enumerate:

for i, x in enumerate(iterable):
    ...
Ilja Everilä
  • 50,538
  • 7
  • 126
  • 127
1

bigrams is a generator function and bigrams.next() is what gives you the tuple of your tokens. You can do len() on bigrams.next() but not on the generator function. Following is more sophisticated code to do what you are trying to achieve.

>>> review = "i am feeling sad and disappointed due to errors"
>>> token = nltk.word_tokenize(review)
>>> bigrams = nltk.ngrams(token, 2)
>>> output = ""
>>> try:
...   while True:
...     temp = bigrams.next()
...     output += "%s %s-->" % (temp[0], temp[1])
... except StopIteration:
...   pass
... 
>>> output
'i am-->am feeling-->feeling sad-->sad and-->and disappointed-->disappointed due-->due to-->to errors-->'
>>> 
MohitC
  • 4,541
  • 2
  • 34
  • 55
  • 2
    This should be just `for w1, w2 in bigrams:` or `next(bigrams)`, if manually fetching values from an iterator (to get a sentinel value instead, for example). `except StopIteration:` is almost always a sign that something is amiss. – Ilja Everilä Apr 28 '16 at 12:35