1

I am using NLTK to do some analysis of a number of distinct documents. The content of these documents means that they all tend to end and start with the same tokens.

I tokenize the documents into a list of lists and then use BigramCollocationFinder.from_documents to create the finder. When I score the ngrams by raw frequency, I notice that the most common occurence is the end character/start character. This would suggest that it is running all the documents into one and finding ngrams on the whole lot which I don't want.

A sample of the code:

line_tokenizer = nltk.RegexpTokenizer('\{|\}|[^,"}]+')
seqs = ["{B,C}", "{B,A}", "{A,B,C}"]
documents = [line_tokenizer.tokenize(s) for s in seqs]
finder = BigramCollocationFinder.from_documents(documents)
bigram_measures = nltk.collocations.BigramAssocMeasures()
print(finder.score_ngrams(bigram_measures.raw_freq))

This results in the following output:

[(('B', 'C'), 0.15384615384615385), 
 (('C', '}'), 0.15384615384615385), 
 (('{', 'B'), 0.15384615384615385), 
 (('}', '{'), 0.15384615384615385), 
 (('A', 'B'), 0.07692307692307693), 
 (('A', '}'), 0.07692307692307693), 
 (('B', 'A'), 0.07692307692307693), 
 (('{', 'A'), 0.07692307692307693)]

The ngram }{ shows up in the list which it shouldn't as }{ never appear next to each other.

Is there an alternative way to approach this problem to avoid }{ showing up in the list?

alvas
  • 115,346
  • 109
  • 446
  • 738
Jennifer
  • 5,148
  • 2
  • 21
  • 19
  • it sounds crazy but i think you can hack your way out of this, give me a min, while i code the hack =) – alvas Sep 28 '13 at 15:30

1 Answers1

1

I believe you want to keep bigrams like {A and C} because sometimes it's good to know that some words always occurs at the end or start of sentence. And so the hack:

Remove the }{ bigram from the bigram_measure and then recalculate the probability of other bigrams with 1-prob('}{').

import nltk
line_tokenizer = nltk.RegexpTokenizer('\{|\}|[^,"}]+')
seqs = ["{B,C}", "{B,A}", "{A,B,C}"]
documents = [line_tokenizer.tokenize(s) for s in seqs]
finder = nltk.collocations.BigramCollocationFinder.from_documents(documents)
bigram_measures = nltk.collocations.BigramAssocMeasures()
# Put bigram measures into a dict for easy access
x = dict(finder.score_ngrams(bigram_measures.raw_freq))

# Re-adjust such that the score of 
# each bigram is divided by 1-prob('}{')
newmax = 1- x[('}','{')]

# Remove "}{" from bigrams.
del x[('}','{')]

# Recalcuate prob for each bigram with newmax
y =[(i,j/float(newmax)) for i,j in x.iteritems()]
print y

[(('B', 'C'), 0.18181818181818182), (('C', '}'), 0.18181818181818182), (('B', 'A'), 0.09090909090909091), (('{', 'A'), 0.09090909090909091), (('{', 'B'), 0.18181818181818182),  (('A', 'B'), 0.09090909090909091), (('A', '}'), 0.09090909090909091)]
alvas
  • 115,346
  • 109
  • 446
  • 738
  • If i'm right (yeah, dealing the same problem), that decision won't work with wider windows. I mean, you have to change the code manually for each `window_size`, so it's not a good answer, unfortunately – Nick Nov 18 '18 at 02:41