I'm having real hard time finding the correct test for statistical significance. I'm trying to measure if a statement by an influential figure had an effect on a particular discussion on Twitter, one of the ways I do this is by comparing frequencies for certain keywords before and after the statement, i.e. the treatment in my case. Keywords are also grouped under certain themes, such as violence, ethnicity,rights-based etc., and the change in groups is more important than change in particular words, so I'd like to see if the usage of racial terms was effected, rather than the usage of word "arab". My data looks like this
Keywords |
Group |
Pre Treatment - Group Frequency |
Post-Treatment Group Frequency |
---|---|---|---|
[indian, arab, white] |
race |
150 |
100 |
[killing, beating] |
violence |
120 |
140 |
[civil rights, human rights, law, legal] |
rights |
50 |
80 |
So far I've been reporting the simple percent change for group frequencies, which feels lacking. I've been suggested to take the means of each keyword group and use dependent/paired samples t-test, but I'm hesitant as I can't claim that pre-treament and post-treatment tweets come from the exactly same users. Is it fine to do that, or is there a better way to test the treatment?
I've read that ANOVA is also not recommended as the independence assumption is violated, and that's pretty much my knowledge on statistics (ANOVA test on time series data)