Hadoop Spark - Store in one Large File instead of Many Small ones and Index

Question

On a daily basis i would be calculating some stats and storing it in a file (about 40 rows of data). df below is calculated daily. The issue is when i store it each day it becomes a new file and i do not want to do this as hadoop doesn't deal well with multiple small files. I cannot overrride the file as i need the historic data as well.

How do i make one large file every day - i.e i write to the same master file instead of writing to a new file daily.
I know you can use coalese(1) i think but ive read this is has poor performance so i do not know?

I want to index this file by a time column within the file. How do i achieve this?

df.repartition(1).write.save(mypath, format='parquet',mode='append', header='true')

40 rows of data is still small. Are you sure you need Hadoop for storing this? — OneCricketeer, Jun 04 '18 at 23:52
Probably not. But i have the rest of the data in hadoop so I still have not figured out how to keep them in separate locations and still use within the same problem easily. If you have any reference architecture I would be happy to read up :) The issue is I am storing stats at micro seconds, seconds and 1 hour. The one hour file is very small however, microseconds file is large so probably need hadoop. So i wanted to keep it in the same file system without using another DB for this. Hence the question — SecretAgent, Jun 05 '18 at 08:02

score 0 · Answer 1 · answered Jun 04 '18 at 19:40

0

You can override same old file daily by doing this. DF.write.mode(SaveMode.Overwrite)

answered Jun 04 '18 at 19:40

Shyam Reddy

51
7

I cannot override the file as i need the data in the old file as well. So on day 1 i will have 40 data points, on day two I have another 40 data points however it needs to be appended to the main file (40 + 40 = 80 data points) – SecretAgent Jun 05 '18 at 08:03
@Secret you can `df.partitonBy("datetime").write...` to get partitioned folders – OneCricketeer Jun 05 '18 at 12:06
@cricket_007 : Thank you. I do want to not partition it but write it in one large file if that makes sense. So on a daily basis i would be appending to the same master file instead of multiple folders for each day? Hence partition by will not work – SecretAgent Jun 05 '18 at 19:10
@SecretAgent No, if you partition by a day folder, then you would be creating and overwriting just that date folder. For example, `data/day=20180530`, then the following day with `data/day=20180531` – OneCricketeer Jun 06 '18 at 00:34
@cricket_007 : Thank you. I am currently doing something similar. But i guess what i want to be doing is appending to the same file. If i do the above i may have several days with files that have around 30 rows of data which isn't going to be very efficient. – SecretAgent Jun 06 '18 at 14:34

Hadoop Spark - Store in one Large File instead of Many Small ones and Index

1 Answers1