I am using NLTK to do some analysis of a number of distinct documents. The content of these documents means that they all tend to end and start with the same tokens.
I tokenize the documents into a list of lists and then use BigramCollocationFinder.from_documents to create the finder. When I score the ngrams by raw frequency, I notice that the most common occurence is the end character/start character. This would suggest that it is running all the documents into one and finding ngrams on the whole lot which I don't want.
A sample of the code:
line_tokenizer = nltk.RegexpTokenizer('\{|\}|[^,"}]+')
seqs = ["{B,C}", "{B,A}", "{A,B,C}"]
documents = [line_tokenizer.tokenize(s) for s in seqs]
finder = BigramCollocationFinder.from_documents(documents)
bigram_measures = nltk.collocations.BigramAssocMeasures()
print(finder.score_ngrams(bigram_measures.raw_freq))
This results in the following output:
[(('B', 'C'), 0.15384615384615385),
(('C', '}'), 0.15384615384615385),
(('{', 'B'), 0.15384615384615385),
(('}', '{'), 0.15384615384615385),
(('A', 'B'), 0.07692307692307693),
(('A', '}'), 0.07692307692307693),
(('B', 'A'), 0.07692307692307693),
(('{', 'A'), 0.07692307692307693)]
The ngram }{ shows up in the list which it shouldn't as }{ never appear next to each other.
Is there an alternative way to approach this problem to avoid }{ showing up in the list?