0

I'm trying to create a dictionary file for a big size csv file that is divided into chunks to be processed, but when I'm creating the dictionary its just doing it for one chuck, and when I try to append it it passes epmty dataframe to the new df. this is the code I used

wdata = pd.read_csv(fileinput, nrows=0,).columns[0]
skip = int(wdata.count(' ') == 0)
dic = pd.DataFrame()
for chunk in pd.read_csv(fileinput, names=['sentences'], skiprows=skip, chunksize=1000):
    dic_tmp = (chunk['sentences'].str.split(expand=True).stack().value_counts().rename_axis('word').reset_index(name='freq'))


    dic.append(dic_tmp)
dic.to_csv('newwww.csv', index=False)

if I saved the dic_tmp one is just a dictionary for one chunk not the whole set and dic is taking alot of time to process but returns empty dataframes at the end, any error with my code ?

input csv is like

enter image description here

output csv is like

enter image description here

expected output should be

enter image description here

so its not adding the chunks together its just pasting the new chunk regardless what is in the previous chunk or the csv.

programming freak
  • 859
  • 5
  • 14
  • 34
  • 1
    I'm sorry, you should really be cautious with the python terms (e.g. "dictionary"). What you're doing here is the chunk-wise processing of a dataframe that you merge back together into a single DF. Could you please provide an example of the table that you're loading? – Oleg O Jan 20 '20 at 10:02
  • @OlegO I added some examples hope you can understand me better now – programming freak Jan 20 '20 at 11:21

2 Answers2

1

In order to split the column into words and count the occurrences: df['sentences'].apply(lambda x: pd.value_counts(x.split(" "))).sum(axis=0)

or

from collections import Counter result = Counter(" ".join(df['sentences'].values.tolist()).split(" ")).items()

both seem to be equally slow, but probably better than your approach. Taken from here: Count distinct words from a Pandas Data Frame

Oleg O
  • 1,005
  • 6
  • 11
  • non of the code is doing the required output the first one has value count in it which wont work unless I put str replace and stack in front of it to change the type from dataframe, the second one converts it to a list. what I want is after taking each chunk check the previous chunk if the word is there increase the frequency of the word if its new append it to the end of the file and go to the next chunk – programming freak Jan 21 '20 at 05:53
  • @programmingfreak I checked the first method myself, and it does create a list of available words with the count (unsorted, though, but this is irrelevant, I guess), i.e. exactly what you have in "expected output should be". From your explanation I really don't understand what your issue with it is. – Oleg O Jan 21 '20 at 10:27
0

Couple of problems that I see are

  1. Why read the csv file twice? First time here wdata = pd.read_csv(fileinput, nrows=0,).columns[0] and second time in the for loop.

  2. If you aren't using the combined data frame further. I think it is better to write the chunks to csv file in append mode like shown below

for chunk in pd.read_csv(fileinput, names=['sentences'], skiprows=skip, chunksize=1000):
    dic_tmp = (chunk['sentences'].str.split(expand=True).stack().value_counts().rename_axis('word').reset_index(name='freq'))
    dic_tmp.to_csv('newwww.csv', mode='a', header=False)
abhilb
  • 5,639
  • 2
  • 20
  • 26
  • I'm reading it twice first it checks if it has a header if it has it will remove it then name the header sentences because its user based input so Im not assuming that the user input will be a fixed header – programming freak Jan 20 '20 at 08:32
  • 1
    Ok. Understood. Try the optimization in the second point. May be it helps – abhilb Jan 20 '20 at 08:35
  • it gives wrong output it doesnt unfortunately it doesnt add up to a word if its already available in the previous chunks it will just print it again as a new one. – programming freak Jan 20 '20 at 08:39
  • I have modified the question if you can help more @abhilb – programming freak Jan 20 '20 at 11:34