My problem is to export a dask df with 10 000 000 rows and 11 columns to a .txt file. This is my code:
csv_files = glob.glob("xxx_*.csv")
used_cols = ["word", "word_freq", "doc_freq", "advis_word_freq", "advis_doc_freq", "story_word_freq", "story_doc_freq", "multi_word_freq", "multi_doc_freq", "other_word_freq", "other_doc_freq"]
pattern = re.compile('[\u4E00-\u9FFF]+_[\u4E00-\u9FFF]+_[\u4E00-\u9FFF]+')
dfs = []
for file in csv_files:
df = dd.read_csv(file, encoding="UTF-8").drop("Unnamed: 0", axis=1)
df[used_cols] = df.astype({"word_freq":"int16", "doc_freq":"int16", "advis_word_freq":"int16", "advis_doc_freq":"int16", "story_word_freq":"int16", "story_doc_freq":"int16", "multi_word_freq":"int16", "multi_doc_freq":"int16", "other_word_freq":"int16", "other_doc_freq":"int16"})
df = df[df['word'].str.contains(pattern)]
dfs.append(df)
df = dd.concat(dfs)
result = df.groupby("word").sum().reset_index()
result = result.repartition(npartitions=1000)
print(f"result npartition = {result.npartitions}")
gc.collect()
print("Saving files...")
with open("aaa.txt", "w", encoding="UTF-8") as file:
print("Writing data...")
for i in range(len(result)):
row = result.loc[i].compute()
row_list = row.values.tolist()[0]
row_str_list = [str(ele) for ele in row_list]
text = ", ".join(row_str_list) + "\n"
file.write(text)
print(i)
gc.collect()
print("Done!")
I try to sum frequency from 200 files, and then I split the disk into 1000 partitions. I hope I can reduce the memory when computing by doing so. However, I couldn’t execute the code successfully due to pynvml error unknown error. How can I fix it? Also, how can I speed up and reduce my memory usage in my code?
I tried to exporting a 10000000*11 dask dataframe to txt file.