How to verify if two text datasets are from different distribution?

Question

I have two text datasets. Each dataset consists of multiple sequences and each sequence can have more than one sentence.

How do I measure if both datasets are from same distribution?

The purpose is to verify transfer learning from one distribution to another only if the difference between the distributions is statistically significant.

I am panning to use chi-square test but not sure if it will help for text data considering the high degrees of freedom.

update: Example: Supppose I want to train a sentiment classification model. I train a model on IMDb dataset and evaluate on IMDb and Yelp datasets. I found that my model trained on IMDb still does well on Yelp. But the question is how different these datasets are?

Train Dataset : https://www.kaggle.com/columbine/imdb-dataset-sentiment-analysis-in-csv-format?select=Train.csv

Eval 1: https://www.kaggle.com/columbine/imdb-dataset-sentiment-analysis-in-csv-format?select=Valid.csv

Eval 2: https://www.kaggle.com/omkarsabnis/sentiment-analysis-on-the-yelp-reviews-dataset

Now,

How different are train and eval 1?
How different are train and eval 2?
Is the dissimilarity between train and eval 2 by chance ? What is the statistical significance and p value?

I guess an example would help to understand your problem. But I guess it depends on what distribution you are talking about. I you want to assess whether a sequence could have been generated from a dataset, you can train a language model per dataset and compute the difference between the two distributions. You can also compute the vocabulary of each dataset and compare their intersection. — ygorg, Nov 02 '20 at 09:59

score 4 · Accepted Answer · edited Mar 30 '22 at 09:54

The question "are text A and text B coming from the same distribution?" is somehow poorly defined. For example, these two questions (1,2) can be viewed as generated from the same distribution (distribution of all questions on StackExchange) or from different distributions (distribution of two different subdomains of StackExchange). So it's not clear what is the property that you want to test.

Anyway, you can come up with any test statistic of your choice, approximate its distribution in case of "single source" by simulation, and calculate the p-value of your test.

As a toy example, let's take two small corpora: two random articles from English Wikipedia. I'll do it in Python

import requests
from bs4 import BeautifulSoup
urls = [
    'https://en.wikipedia.org/wiki/Nanjing_(Liao_dynasty)', 
    'https://en.wikipedia.org/wiki/United_States_Passport_Card'
]
texts = [BeautifulSoup(requests.get(u).text).find('div', {'class': 'mw-parser-output'}).text for u in urls]

Now I use a primitive tokenizer to count individual words in texts, and use root mean squared difference in word relative frequencies as my test statistic. You can use any other statistic, as long as you calculate it consistently.

import re
from collections import Counter
from copy import deepcopy
TOKEN = re.compile(r'([^\W\d]+|\d+|[^\w\s])')
counters = [Counter(re.findall(TOKEN, t)) for t in texts]
print([sum(c.values()) for c in counters])  
# [5068, 4053]: texts are of approximately the same size

def word_freq_rmse(c1, c2):
    result = 0
    vocab = set(c1.keys()).union(set(c2.keys()))
    n1, n2 = sum(c1.values()), sum(c2.values())
    n = len(vocab)
    for word in vocab:
        result += (c1[word]/n1 - c2[word]/n2)**2 / n
    return result**0.5

print(word_freq_rmse(*counters))
# rmse is 0.001178, but is this a small or large difference?

I get a value of 0.001178, but I don't know whether it's a large difference. So I need to simulate the distribution of this test statistic under the null hypothesis: when both texts are from the same distribution. To simulate it, I merge two texts into one, and then split them randomly, and calculate my statistic when comparing these two random parts.

import random
tokens = [tok for t in texts for tok in re.findall(TOKEN, t)]
split = sum(counters[0].values())
distribution = []
for i in range(1000):
    random.shuffle(tokens)
    c1 = Counter(tokens[:split])
    c2 = Counter(tokens[split:])
    distribution.append(word_freq_rmse(c1, c2))

Now I can see how unusual is the value of my observed test statistic under the null hypothesis:

observed = word_freq_rmse(*counters)
p_value = sum(x >= observed for x in distribution) / len(distribution)
print(p_value)  # it is 0.0
print(observed, max(distribution), sum(distribution) / len(distribution)) # 0.0011  0.0006 0.0004

We see that when texts are from the same distribution, my test statistic is on average 0.0004 and almost never exceeds 0.0006, so the value of 0.0011 is very unusual, and the null hypothesis that two my texts originate from the same distribution should be rejected.

Thanks David. I have some follow up questions if you don’t mind. What should have been the correct question to ask in this case? Does this algorithm you mentioned have a name? I would like to learn more about it. Can I take relative entropy instead of rms? How do I know which one is better ? What about chi-square test? Will that work here? — Krishan Subudhi, Nov 07 '20 at 16:16
1) I don't know what is the correct question. I just mention that your problem statement is somewhat ambiguous. — David Dale, Nov 07 '20 at 17:16
2) this algorithm doesn't have a name, I just made it up following a general scheme of statistical tests: invent a test statistic, compute it, and compare its observed value with its distribution in case when the null hypothesis is true. I used mean squared difference of word frequencies just because mean squared error works fine in many other cases. By the way, Chi-2 test also computes something like MSE, but with different normalization. — David Dale, Nov 07 '20 at 17:19
3) I don't know how to prove which test is _generally_ better. I know only how to evaulate such test on a particular set of pairs-of-corpora, where some pairs come from the same distribution, and other pairs come from other distributions. If you have such a benchmarking dataset, you can use it to compare tests by evaluating their false accept and false reject rates. — David Dale, Nov 07 '20 at 17:21

Prateek Jain · Answer 2 · 2020-11-05T14:32:46.310

I wrote an article which is similar to your problem but not exactly the same. https://towardsdatascience.com/a-new-way-to-bow-analysis-feature-engineering-part1-e012eba90ef

The problem that I was trying to solve is to check if a word has different (significant) distributions across categories or labels.

There are a few similarities between your problem and the one I had mentioned above.

You want to compare two sources of datasets, which can be taken as two different categories
Also, to compare the data sources, you will have to compare the words as sentences can't be directly compared

So, my proposed solution to this will be as:

Create words features across the two datasets using count-vectorizer and get top X words from each
Let's say you have total distinct words as N, now initialize count=0 and start to compare the distribution for each word and if the differences are significant increment the counter. Also, there could be cases where a word only exists in one of the datasets and that is a good new, by that I mean it shows that it is a distinguishing feature, so, for this also increment the count
Let's say the total count is n. Now, the lower is the n/N ratio, similar two texts are and vice-a-versa

Also, to verify this methodology - Split the data from a single source into two (random sampling) and run the above analysis, if the n/N ratio is closer to 0 which indicates that the two data sources are similar which also is the case.

Please let me know if this approach worked or not, also if you think there are any flaws in this, I would love to think and try evolving it.

How to verify if two text datasets are from different distribution?

2 Answers2