-1

In general, what I want to do is to extract common elements in the sharing column of "word" in several csv files. (2008.csv, 2009.csv, 2010.csv .... 2015.csv)


All files are in the same format:'word','count'

'word' contain all frequent words in one document in a particular year.


here is a snapshot of one of files:

file 2008.csv


As long as there are two out of 8 files having common elements, I want to know those sharing elements and whichever files are they in. (this is quite like tfidf calculation...btw)

Anyway, my goal is to know some trends of frequent words appearance in those files. (To my knowledge, one element can be in at most five files.)

And I want to know the words when they first appear, which means, a word in file C but not in both file B and A.

I know for + if might solve the problem here, but it is quite tedious, I need to compare 2 out of 8, 3 out of 8, 4 out of 8... columns, in that case, to find sharing elements.

this is the code I worked out so far... far away from what I need... I just compare elements in two out of 8 files: code

Can anyone help?

ShirleyWang
  • 55
  • 2
  • 8
  • You forgot to post the code you have so far. – Tom Karzes Feb 16 '16 at 02:11
  • 1
    Please provide the relevant information in your question. Links can be removed and we are here to help *you*. We'd appreciate it if you would make it easy. – zondo Feb 16 '16 at 02:22
  • How is this like TFxIDF? You have on file the DF but it ends there. – tripleee Feb 16 '16 at 03:05
  • Please don't post images. We need to be able to copy/paste code and data. – tripleee Feb 16 '16 at 03:12
  • I want to know the tfidf value of each word and at the same time, how many files(years)and which files (or years) the word appears... So that I can know what word can be the keyword of which year and track the keyword trend. Actually, those words are crawled from IBM website, and all of them are about the topic of cloud computing. – ShirleyWang Feb 16 '16 at 03:12
  • If the file names are correct, you only have 50 words per year. It's hardly meaningful to look for "trends" in this tiny amount of data. You could print all of it on a single sheet of paper and "analyze" it with your eyeballs. – tripleee Feb 16 '16 at 03:18
  • @tripleee those 50 words per year are the most frequent words...just thought low-frequent words can't be representative...but I know I can track the frequency change of every each word to see if amount of some word appearance increase dramatically... But it requires coding skills... – ShirleyWang Feb 16 '16 at 03:34

2 Answers2

0

Use set intersection may help

for i in range(len(year_list)):
    datai=set(pd.read_csv('filename_'+year_list[i]+'.csv')['word'])
    tocompare=[]
    for j in range(i+1,len(year_list)):
        dataj=set(pd.read_csv('filename_'+year_list[j]+'.csv')['word'])
        print "Two files:",i,j
        print datai.intersection(dataj)
        tocompare.append(dataj)
    print "All compare:"
    print datai.intersection(*tocompare)
    break
platinhom
  • 139
  • 10
  • Thanks! But this way still limited in comparing two years' (or files) of keywords. Is there anyway of making comparing among all eight files? – ShirleyWang Feb 16 '16 at 03:14
  • `intersection ` method can accept multi arguments! So you just need to read the other files contains as set and put them all to the method, just like: `datai.intersection(dataj,datak,datam....)` – platinhom Feb 16 '16 at 03:18
  • still some problems with the code.. "All compare" in your code can just be made forwardly, which means 2012 can compare to the combined data of 2013 through 2015 but not 2011. This would cause problems when I try to find unique words in a particular year. For example, words appearing in 2011 but not 2013 would be considered as unique for 2012. – ShirleyWang Feb 16 '16 at 20:00
  • yes. If you want to get unique words, you can check it not in the common words. It depends on your need. You'd better use `set` and `set.add`, not a `list` and `list.append`. The latter is slower. – platinhom Feb 16 '16 at 23:32
  • I just add another for loop function to add previous years data in when doing comparison... codes worked but quite long... I should have try set.add maybe – ShirleyWang Feb 16 '16 at 23:43
0

The first answer worked out well generally. But the intersection function does not return the exact results I expected for some reason. So I modified the code provided for the sake of more accuracy and better formatting of printouts.

for i in range(0,8):
otheryears = []
if i>0:
    for y in range(0,i):
        datay = set(pd.read_csv("most_50_common_words_"+year_list[y]+'.csv')["word"])
        for y in list(datay):
            if y not in otheryears:
                otheryears.append(y)     
uniquei = []
datai = set(pd.read_csv("most_50_common_words_"+year_list[i]+'.csv')["word"])
print "\nCompare year %d with:\n" % int(year_list[i])
for j in range(i+1,8):
    dataj = set(pd.read_csv("most_50_common_words_"+year_list[j]+'.csv')['word'])
    print year_list[j],':'
    listj = list(datai.intersection(dataj))
    print list(datai.intersection(dataj)),'\n',"%d common words with year %d" % (len(datai.intersection(dataj)),int(year_list[j]))
    for j in list(dataj):
        if j not in otheryears:
            otheryears.append(j)

common = []
for x in list(datai):
    if x in otheryears:
        common.append(x)   
print "\nAll compare:"
print "%d year has %d words in common with other years. They are as follows:\n%s" % (int(year_list[i]),
                                                                                     len(common),common),'\n'
for x in list(datai):
    if x not in otheryears:
        uniquei.append(x)
print "%d Frequent words unique in year %d:\n%s \n" % (len(uniquei),int(year_list[i]),uniquei)
ShirleyWang
  • 55
  • 2
  • 8