1

I'm trying to read a lot of avro files into a spark dataframe. They all share the same s3 filepath prefix, so initially I was running something like:

path = "s3a://bucketname/data-files"
df = spark.read.format("avro").load(path)

which was successfully identifying all the files.

The individual files are something like:

"s3a://bucketname/data-files/timestamp=20201007123000/id=update_account/0324345431234.avro"

Upon attempting to manipulate the data, the code kept errorring out, with a message that one of the files was not an Avro data file. The actual error message received is: org.apache.spark.SparkException: Job aborted due to stage failure: Task 62476 in stage 44102.0 failed 4 times, most recent failure: Lost task 62476.3 in stage 44102.0 (TID 267428, 10.96.134.227, executor 9): java.io.IOException: Not an Avro data file.

To circumvent the problem, I was able to get the explicit filepaths of the avro files I'm interested in. After putting them in a list (file_list), I was successfully able to run spark.read.format("avro").load(file_list).

The issue now is this - I'm interested in adding a number of fields to the dataframe that are part of the filepath (ie. the timestamp and the id from the example above).

While using just the bucket and prefix filepath to find the files (approach #1), these fields were automatically appended to the resulting dataframe. With the explicit filepaths, I don't get that advantage.

I'm wondering if there's a way to include these columns while using spark to read the files.

Sequentially processing the files would look something like:

for file in file_list:
    df = spark.read.format("avro").load(file)
    id, timestamp = parse_filename(file)
    df = df.withColumn("id", lit(id))\
         .withColumn("timestamp", lit(timestamp))

but there are over 500k files and this would take an eternity.

I'm new to Spark, so any help would be much appreciated, thanks!

femi
  • 13
  • 2

2 Answers2

0

Why dont you try to read the files first by using wholetextfiles method and add the path name into the data itself at the beginning. Then you can filter out the file names from the data and add it as a column while creating the dataframe. I agree it's a two step process. But it should work. To get a timestamp of file you will need filesystem object which js not serializable , i.e. it cant be used in sparks parallelized operation , So you will have to create a local collection with file and timestamp and join it somehow with the RDD you created with wholetextfiles.

0

Two separate things to tackle here:

Specifying Files

Spark has built in handling for reading all files of a particular type in a given path. As @Sri_Karthik suggested, try supplying a path like "s3a://bucketname/data-files/*.avro" (if that doesn't work, maybe try "s3a://bucketname/data-files/**/*.avro"... i can't remember the exact pattern matching syntax spark uses), which should grab all avro files only and get rid of that error where you are seeing non-avro files in those paths. In my opinion this is more elegant than manually fetching the file paths and explicitly specifying them.

As an aside, the reason you are seeing this is likely because folders typically get marked with metadata files like .SUCCESS or .COMPLETED to indicate they are are ready for consumption.

Extracting metadata from filepaths

If you check out this stackoverflow question, it shows how you can add the filename as a new column (both for scala and pyspark). You could then use the regexp_extract function to parse out the desired elements from that filename string. I've never used scala in spark so can't help you there, but it should be similar to the pyspark version.

ian-whitestone
  • 126
  • 1
  • 3
  • Only additional comment was that I found that I had to supply the path as such: `"s3a://bucketname/data-files/*/*/.avro"` but I was successfully able to load in all the files and append the filename as a column. The regexp_extract function does seem to be pretty similar, looks like it's smooth sailing from here. Thanks. – femi Oct 10 '20 at 01:57