I want to generate an avro file from a pyspark dataframe and currently I am doing coalesce
as below
df = df.coalesce(1)
df.write.format('avro').save('file:///mypath')
But this is leading to memory issues now as all the data will be fetched to memory before writing and my data size is growing consistently everyday. So I want to write the data by each partition so that the data would be written to disk in chunks and doesnot raise OOM issues. I found that toLocalIterator
helps in achieving this. But I am not sure how to use it. I tried the below usage and it returns all rows
iter = df.toLocalIterator()
for i in iter:
print('writing some data')
# write the data into disk/file
The iter is iterating over each row rather than each partition. How should I do this?