3

I am using Spark 2.4 in AWS EMR. I am using Pyspark and SparkSQL for my ELT/ETL and using DataFrames with Parquet input and output on AWS S3.

As of Spark 2.4, as far as I know, there is no way to tag or to customize the file name of output files (parquet). Please correct me?

When I store parquet output files on S3 I end up with file names which look like this:

part-43130-4fb6c57e-d43b-42bd-afe5-3970b3ae941c.c000.snappy.parquet

The middle part of the file name looks like it has embedded GUID/UUID :

part-43130-4fb6c57e-d43b-42bd-afe5-3970b3ae941c.c000.snappy.parquet

I would like to know if I can obtain this GUID/UUID value from the PySpark or SparkSQL function at run-time, to log/save/display this value in a text file?

I need to log this GUID/UUID value because I may need to later remove the files with this value as part of their names, for a manual rollback purposes (for example, I may discover a day or a week later that this data is somehow corrupt and needs to be deleted, so all files tagged with GUID/UUID can be identified and removed).

I know that I can partition the table manually on a GUID column but then I end up with too many partitions, so it hurts performance. What I need is to somehow tag the files, for each data load job, so I can identify and delete them easily from S3, hence GUID/UUID value seems like one possible solution.

Open for any other suggestions.

Thank you

Acid Rider
  • 1,557
  • 3
  • 17
  • 25

2 Answers2

1

Is this with the new "s3a specific committer"? If so, it means that they're using netflix's code/trick of using a GUID on each file written so as to avoid eventual consistency problems. That doesn't help much though.

  1. consider offering a patch to Spark which lets you add a specific prefix to a file name.
  2. Or for Apache Hadoop & Spark (i.e. not EMR), an option for the S3A committers to put that prefix in when they generate temporary filenames.

Short term: well, you can always list the before-and-after state of the directory tree (tip: use FileSystem.listFiles(path, recursive) for speed), and either remember the new files, or rename them (which will be slow: Remembering the new filenames is better)

stevel
  • 12,567
  • 1
  • 39
  • 50
  • no S3A, just EMR Spark, using S3, looks like there is no good solution, major pity that. – Acid Rider Jan 31 '19 at 22:27
  • guess not. The new S3A committer in the s3a codebase actually saves the list of generated files in the _SUCCESS file. done just for testing, but it turns out to be useful all round once you start doing live updates of existing tables – stevel Feb 01 '19 at 13:56
0
  1. Spark already writes files with UUID in names. Instead of creating too many partitions you can setup customer file naming (e.g. add some id). May be this is solution for you - https://stackoverflow.com/a/43377574/1251549

  2. Not tried yet (but planning) - https://github.com/awslabs/amazon-s3-tagging-spark-util In theory, you can tag with jobid (or whatever) and then run something

Both solutions lead to perform multiple s3 list objects API request check tags/filename and delete file one by one.

Cherry
  • 31,309
  • 66
  • 224
  • 364