I am in a situation similar to the one mentioned here. The question is not answered satisfactorily. Besides, I have less data to handle (around 1G a day).
My situation: I have a certain amount of data (~500G) already available as parquet (that is the "storage format" that was agreed on) and I get periodic incremental updates. I want to be able to handle the ETL part as well as the analytics part afterwards.
In order to be able to also efficiently produce updates on certain "intermediate data products", I see three options:
- save with append mode, keeping a diff dataset around until all data products were created
- save with append mode, adding an extra column
upload_timestamp
save each update to a separate folder, e.g.:
data +- part_001 | +- various_files.parquet +- part_002 | +- various_files.parquet +- ...
This way the entire dataset can be read using
data/*
as path toread.parquet()
.
I am tending towards the last approach, also because there are comments (evidence?) that append mode leads to problems when the amount of partitions grows (see for example this SO question).
So my question: is there some serious drawback in creating a dataset structure like this? Obviously, Spark needs to do "some" merging/sorting when reading over multiple folders, but besides that?
I am using Spark 2.1.0.