Find n-grams especially common in one corpus vs. another one, and compared to English as a whole

Question

I have two corpuses/documents of survey responses (as columns in Pandas dataframes). I would like to understand which n-grams distinguish one corpus from another, among those n-grams which distinguish those corpuses from the English language as a whole. I would like to find the weird words (jargon) which show up in these documents more often than English. That is, how did changing the prompt which generated the survey responses, or the week the participants were surveyed, change the content of what they responded? This would be a similar concept to NLP data drift detection. Another way to look at it would be to detect trending n-grams over time.

I would like to do this in Python or R, or perhaps in AWS. Deep learning approaches are fine.

This is not cosine similarity. This is not the same problem as taking the top 10 words in the corpuses by TF-IDF. That still gives me very common English words, even with stopword removal. Also, that does not compare the two corpuses.

score 0 · Answer 1 · answered Mar 23 '23 at 17:50

Perhaps some more conceptional points: This is something that is sometimes done in context of evaluating political wording/framing in texts (at least the part of determining frequent words/n-grams in one text versus another). There is a relatively popular paper calling it "Fighting Words" which has a few ready-to-use Python implementations, e.g.:

https://github.com/jmhessel/FightingWords

https://pypi.org/project/fightin-words/ (package)

https://convokit.cornell.edu/documentation/fightingwords.html (part of the ConvoKit package)

It presents some ideas on how to find these noticable words differing between two corpora, for example by assuming a certain distribution and then looking at words which have a z-value outside the confidence interval.

I think for your use case the more difficult part is comparing each corpus against "general English" as you would need data how often which N-gram occurs on average in all of English. Perhaps you could try to find language data that is not too biased and relatively large that you can use as this third kind of corpus.

Or you make some kind of simplification and find a large list of English N-grams with their frequencies, and then assume all the frequent N-grams in your two custom corpora that are not in this lists Top X N-grams are special N-grams. E.g. you take the Top 10k English N-grams and the Top 1k N-grams from each of your corpora and find for each of those 1k the ones that are not contained in the 10k. Of course using this kind of exclusion by intersection will not be as precise as looking at the difference in actual frequencies.

Find n-grams especially common in one corpus vs. another one, and compared to English as a whole

1 Answers1