On a daily basis i would be calculating some stats and storing it in a file (about 40 rows of data). df below is calculated daily. The issue is when i store it each day it becomes a new file and i do not want to do this as hadoop doesn't deal well with multiple small files. I cannot overrride the file as i need the historic data as well.
How do i make one large file every day - i.e i write to the same master file instead of writing to a new file daily.
I know you can use coalese(1) i think but ive read this is has poor performance so i do not know?
I want to index this file by a time column within the file. How do i achieve this?
df.repartition(1).write.save(mypath, format='parquet',mode='append', header='true')