I'm relatively new to spark and currently I'm wrapping my head about how to (re-)partitioning by data on importing log files from S3 to spark (parquet files).
I have a bunch of GZipped log files in S3 in the following format {bucket}/{YYYY-MM-DD}/{CustomerId}.log.gz
. The size of the log files are from <1MB up to 500MB.
On importing I'm running a pyspark script that does the following:
# load, unpack and parse file from S3 into an rdd of Rows
# using just python modules: boto, cStringIO and gzip
# then:
rdd = rdd.distinct()
df = sqlContext.createDataFrame(rdd, schema)
df = df.withColumn("timeDay", F.date_format(df.time, "yyyy-MM-dd"))
df.write.parquet("PATH/", mode="append", partitionBy=["timeDay"])
The problems I have are (I think):
distinct
will compare everything with everything. This is logical not an issue but it produces a lot of shuffle. How can remove duplicated rows WITHIN one customer and day?- also on
distinct
it creates (for me) exactly 200 partitions with mixed data per customer and day. So if I import one day I get 200 partitions with very small files but if I try to import one month I also get 200 partitions and run into exceptionsMissing an output location for shuffle 0
- I could define the number of partitions on
distinct
but how would this solve my issue?
- I could define the number of partitions on
- Without
distinct
I get one partition per input file which means I have some very big partitions and some very small ones which is already bad for parallelizing my tasks
It would be really really helpful if someone of you could help me how to split/merge my files on import to get much better parallelizing and how to solve the issue that I need unique rows per Customer and day.
PS: I'm running spark 1.5.2 with python 2.
I would really like to have this partitionBy=["timeDay"]
as the first partition as sometimes I need to re-import (overwrite) only some days later.
Thanks in advance!