How to make Spark write by partition and group by a column

Question

I have roughly 100Gb of data that I'm trying to process. The data has the form:

| timestamp | social_profile_id | json_payload     |
|-----------|-------------------|------------------|
| 123       | 1                 | {"json":"stuff"} |
| 124       | 2                 | {"json":"stuff"} |
| 125       | 3                 | {"json":"stuff"} |

I'm trying to split this data frame into folders in S3 by social_profile_id. There are roughly 430,000 social_profile_ids.

I've loaded the data no problem into a Dataset. However when I'm writing it out, and trying to partition it, it takes forever! Here's what I've tried:

messagesDS
      .write
      .partitionBy("socialProfileId")
      .mode(sparkSaveMode)

I don't really care how many files are in each folder at the end of the job. My theory is that each node can group by the social_profile_id, then write out to its respective folder without having to do a shuffle or communicate with other nodes. But this isn't happening as evidenced by the long job time. Ideally the end result would look a little something like this:

├── social_id_1 (only two partitions had id_1 data)
|   ├── partition1_data.parquet
|   └── partition3_data.parquet
├── social_id_2 (more partitions have this data in it)
|   ├── partition3_data.parquet
|   └── partition4_data.parquet
|   ├── etc.
├── social_id_3
|   ├── partition2_data.parquet
|   └── partition4_data.parquet
|   ├── etc.
├── etc.

I've tried increasing the compute resources a few times, both increasing instances sizes and # of instances. What I've been able to see form the spark UI that is the majority of time time is being taken by the write operation. It seems that all of the executors are being used, but they take an absurdly long time to execute (like taking 3-5 hours to write ~150Mb) Any help would be appreciated! Sorry if I mixed up some of the spark terminology.

What are your spark configs now ? Have you tried increasing executor resources? What did you infer from the Spark App UI? — Constantine, Jan 15 '19 at 03:10
I've tried increasing the compute resources a few times, both increasing instances sizes and # of instances. What I've been able to see form the spark UI that is the majority of time time is being taken by the write operation. It seems that all of the executors are being used, but they take an absurdly long time to execute (like taking 3-5 hours to write ~150Mb) — tlanigan, Jan 15 '19 at 18:36
Dated response (https://stackoverflow.com/questions/36927918/using-spark-to-write-a-parquet-file-to-s3-over-s3a-is-very-slow/36992096#36992096), but have you tried setting mapreduce.fileoutputcommitter.algorithm.version to 2? — David, Jan 15 '19 at 18:42
@thebluephantom If I don't add this partitioning, then it takes ~1hour to do the whole job, so I don't think the issue is with S3 — tlanigan, Jan 15 '19 at 18:59

How to make Spark write by partition and group by a column

0 Answers0