Add data to Spark/Parquet Data stored on Disk

Question

I am in a situation similar to the one mentioned here. The question is not answered satisfactorily. Besides, I have less data to handle (around 1G a day).

My situation: I have a certain amount of data (~500G) already available as parquet (that is the "storage format" that was agreed on) and I get periodic incremental updates. I want to be able to handle the ETL part as well as the analytics part afterwards.

In order to be able to also efficiently produce updates on certain "intermediate data products", I see three options:

save with append mode, keeping a diff dataset around until all data products were created
save with append mode, adding an extra column upload_timestamp
save each update to a separate folder, e.g.:
```
data
 +- part_001
 |   +- various_files.parquet
 +- part_002
 |   +- various_files.parquet
 +- ...
```
This way the entire dataset can be read using data/* as path to read.parquet().

I am tending towards the last approach, also because there are comments (evidence?) that append mode leads to problems when the amount of partitions grows (see for example this SO question).

So my question: is there some serious drawback in creating a dataset structure like this? Obviously, Spark needs to do "some" merging/sorting when reading over multiple folders, but besides that?

I am using Spark 2.1.0.

score 1 · Accepted Answer · answered Apr 10 '17 at 23:23

Nathan Marz, formerly of Twitter and author of the Lambda Architecture, describes the process of vertical partitioning for storing data in the Batch Layer, which is the source of truth and contains all data the architecture has ever seen. This master dataset is immutable and append-only. Vertical partitioning is just a fancy name for sorting the data into separate folders.

That is exactly what you describe in your third option.

Doing this makes for significant performance gains because functions performed on the master dataset will only access those data relevant to the computation. This makes batch queries and the creation of indexes in the Serving Layer much faster. The names of the folders are up to you, but typically there is a timestamp component involved.

Whether you're building a Lambda Architecture or not, vertical partitioning will make your analytics a lot more efficient.

well, thanks! doesn't quite answer my question, though. The question rather was/is: is it a problem for Spark if the number of folders (below `data/`) grows large? — akoeltringer, Apr 18 '17 at 13:39
I did answer the question. The short answer is No; the long answer above explains why. The Spark team is aware that people have lots of data and have accounted for it. The vertical partitioning approach helps optimize those analytics for the reasons given. — Vidya, Apr 18 '17 at 14:12

score 0 · Answer 2 · answered Apr 15 '18 at 22:08

I've noticed that the larger number of folders inside the directory, longer it takes for spark.read to execute because spark samples the data/metadata to figure out the schema. But that may be an inevitability that you have to deal with.

if you add an upload-timestamp or even better upload-date-hour and partition by it, it will naturally write to that folder. If there is a chance that multiple set of files could come in a given hour, make sure that the write is accessed through an api that ensures a union is performed on existing data before its written down.

Add data to Spark/Parquet Data stored on Disk

2 Answers2