0

My problem is to export a dask df with 10 000 000 rows and 11 columns to a .txt file. This is my code:

csv_files = glob.glob("xxx_*.csv")
used_cols = ["word", "word_freq", "doc_freq", "advis_word_freq", "advis_doc_freq", "story_word_freq", "story_doc_freq", "multi_word_freq", "multi_doc_freq", "other_word_freq", "other_doc_freq"]
pattern = re.compile('[\u4E00-\u9FFF]+_[\u4E00-\u9FFF]+_[\u4E00-\u9FFF]+')

dfs = []
for file in csv_files:
    df = dd.read_csv(file, encoding="UTF-8").drop("Unnamed: 0", axis=1)
    df[used_cols] = df.astype({"word_freq":"int16", "doc_freq":"int16", "advis_word_freq":"int16", "advis_doc_freq":"int16", "story_word_freq":"int16", "story_doc_freq":"int16", "multi_word_freq":"int16", "multi_doc_freq":"int16", "other_word_freq":"int16", "other_doc_freq":"int16"})
    df = df[df['word'].str.contains(pattern)]
    dfs.append(df)

df = dd.concat(dfs)

result = df.groupby("word").sum().reset_index()
result = result.repartition(npartitions=1000)
print(f"result npartition = {result.npartitions}")
gc.collect()

print("Saving files...")
with open("aaa.txt", "w", encoding="UTF-8") as file:

    print("Writing data...")
    for i in range(len(result)):
        row = result.loc[i].compute()
        row_list = row.values.tolist()[0]
        row_str_list = [str(ele) for ele in row_list]
        text = ", ".join(row_str_list) + "\n"
        file.write(text)
        print(i)
        gc.collect()

print("Done!")

I try to sum frequency from 200 files, and then I split the disk into 1000 partitions. I hope I can reduce the memory when computing by doing so. However, I couldn’t execute the code successfully due to pynvml error unknown error. How can I fix it? Also, how can I speed up and reduce my memory usage in my code?

I tried to exporting a 10000000*11 dask dataframe to txt file.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • pynvml error: pynvml.NVMLError_Unknown: Unknown Error – Kevin.M.t Apr 08 '23 at 15:25
  • 1
    Please show your complete exception. – mdurant Apr 08 '23 at 17:15
  • Do ou mind to produce a [mcve]? – rpanai Apr 17 '23 at 19:56
  • df.groupby("word").sum() should be a massive reduction, there's probably no need to to repartitionning after, and anyway this won't change the memory usage in the step before. Also your way of writing looks really not optimized, you should just use a Dask or Pandas built-in function. But in any case, as @mdurant said, we need the comple stack trace. – Guillaume EB Apr 19 '23 at 06:39

0 Answers0