1

I'm relatively new to spark and currently I'm wrapping my head about how to (re-)partitioning by data on importing log files from S3 to spark (parquet files).

I have a bunch of GZipped log files in S3 in the following format {bucket}/{YYYY-MM-DD}/{CustomerId}.log.gz. The size of the log files are from <1MB up to 500MB.

On importing I'm running a pyspark script that does the following:

# load, unpack and parse file from S3 into an rdd of Rows
# using just python modules: boto, cStringIO and gzip

# then:
rdd = rdd.distinct()
df = sqlContext.createDataFrame(rdd, schema)
df = df.withColumn("timeDay", F.date_format(df.time, "yyyy-MM-dd"))
df.write.parquet("PATH/", mode="append", partitionBy=["timeDay"])

The problems I have are (I think):

  • distinct will compare everything with everything. This is logical not an issue but it produces a lot of shuffle. How can remove duplicated rows WITHIN one customer and day?
  • also on distinct it creates (for me) exactly 200 partitions with mixed data per customer and day. So if I import one day I get 200 partitions with very small files but if I try to import one month I also get 200 partitions and run into exceptions Missing an output location for shuffle 0
    • I could define the number of partitions on distinct but how would this solve my issue?
  • Without distinct I get one partition per input file which means I have some very big partitions and some very small ones which is already bad for parallelizing my tasks

It would be really really helpful if someone of you could help me how to split/merge my files on import to get much better parallelizing and how to solve the issue that I need unique rows per Customer and day.

PS: I'm running spark 1.5.2 with python 2.

I would really like to have this partitionBy=["timeDay"] as the first partition as sometimes I need to re-import (overwrite) only some days later.

Thanks in advance!

mabe.berlin
  • 1,043
  • 7
  • 22

1 Answers1

0

You can use repartition on your DataFrame to either partition by columns or repartition into a specified number of partitions.

I'm not entirely sure what you're asking for re: distinct. There is no way to remove all duplicate rows without comparing every single row against every single other row.

Galen Long
  • 3,693
  • 1
  • 25
  • 37
  • Thanks for your answer! I noticed that partitioning by columns was added in spark 1.6 :( – mabe.berlin Apr 25 '16 at 08:40
  • With ``distinct`` I mean is it possible to run it on just one input file instead of compare over all input files as rows of different files are never duplicated? Also if I run distinct + repartition with 1.6 it will shuffle a lot of data two times – mabe.berlin Apr 25 '16 at 08:44
  • If you want to just call distinct only on one input file, load it into a separate RDD, call distinct, then union it to the other RDD with the rest of the files. If you want to remove all rows in the other files that occur in the first file, you could probably do some kind of reduceByKey. As for shuffling, is there a reason you want to repartition, too? It might be best to try running your code to see if it runs fast enough that optimization isn't needed. – Galen Long Apr 25 '16 at 15:04