I have two corpuses/documents of survey responses (as columns in Pandas dataframes). I would like to understand which n-grams distinguish one corpus from another, among those n-grams which distinguish those corpuses from the English language as a whole. I would like to find the weird words (jargon) which show up in these documents more often than English. That is, how did changing the prompt which generated the survey responses, or the week the participants were surveyed, change the content of what they responded? This would be a similar concept to NLP data drift detection. Another way to look at it would be to detect trending n-grams over time.
I would like to do this in Python or R, or perhaps in AWS. Deep learning approaches are fine.
This is not cosine similarity. This is not the same problem as taking the top 10 words in the corpuses by TF-IDF. That still gives me very common English words, even with stopword removal. Also, that does not compare the two corpuses.