0

I have a Spark job that processes some data into several individual dataframes. I store these dataframes in a list, i.e. dataframes[]. Eventually, I'd like to combine these dataframes into a hierarchical format and write the output in avro. The avro schema is something like this:

{
    "name": "mydata",
    "type": "record",
    "fields": [
        {"name": "data", "type": {
            "type": "array", "items": {
                "name": "actualData", "type": "record", "fields": [
                    {"name": "metadata1", "type": "int"},
                    {"name": "metadata2", "type": "string"},
                    {"name": "dataframe", "type": {
                        "type": "array", "items": {
                            "name": "dataframeRecord", "type": "record", "fields": [
                                {"name": "field1", "type": "int"},
                                {"name": "field2", "type": "int"},
                                {"name": "field3", "type": ["string", "null"]}]
                            }
                        }
                    }]
                }
            }
        }
    ]
}

As can be inferred, each dataframe has three fields, field1, field2, and field3, which I'd like to write as an array in the avro file. There's also some metadata associated with each dataframe.

My current approach is to, once these data are processed, write the dataframes to S3, and then use a separate program to pull these data from S3, use the avro library to write an avro file, and then upload it to S3 again.

However, as the amount of data grows, this is becoming very slow. I've looked into the databricks library to write avro files directly, but I don't know how I can combine the dataframes together in memory, or how the databricks library could determine the schema I'm using.

Is there an idiomatic way to do this in Spark?

P.S. I'm using EMR with Spark 2.0.0 in Python.

James
  • 43
  • 5

2 Answers2

0

If the schema is the same and you just want to put all the records into the same DataFrame you can use the DataFrame unionAll method.

http://spark.apache.org/docs/1.6.3/api/python/pyspark.sql.html#pyspark.sql.DataFrame.unionAll

This function will take one dataframe and append it to another one. The catch is it assumes that the columns are in the same order between the two though, so you may need to do some work to get them lined up and create empty columns for any that are missing. Here is a python function I use to safely union multiple data frames

def union_multiple_dataframes(iterable_list_df):
    input_dfs = list(iterable_list_df)

    # First figure out all the field names
    field_types = {}
    for df in input_dfs:
        for field in df.schema.fields:
            # Check for type mismatch
            if field in field_types:
                if field.dataType != field_types[field.name]:
                    raise ValueError("Mismatched data types when unioning dataframes for field: {}".format(field))
            else:
                field_types[field.name] = field.dataType

    # First add in empty fields so all df's have the same schema
    fields = set(field_types.keys())
    for i, df in enumerate(input_dfs):
        missing = fields - set(df.schema.names)
        for field in missing:
            df = df.withColumn(field, F.lit(None))

        input_dfs[i] = df

    # Finally put all the df's columns in the same order, and do the actual union
    sorted_dfs = [df.select(*sorted(fields)) for df in iterable_list_df]
    return reduce(lambda x, y: x.unionAll(y), sorted_dfs)

Example usage would be something like this:

input_dfs = [do_something(..) for x in y]
combined_df = union_multiple_dataframes(input_dfs)
combined_df.write.format("com.databricks.spark.avro").save("s3://my-bucket/path")
Ryan Widmaier
  • 7,948
  • 2
  • 30
  • 32
  • Thanks for taking the time to answer. My dataframes do follow the same format, but I need to wrap each dataframe with some metadata before combining them. In my question, I've listed the schema I'm using. In that schema, the data contained in my dataframes are only in the scope of the "dataframeRecord" section. How can I add the metadata fields to my dataframes before combining them? – James Jun 02 '17 at 22:08
  • Is the metadata the same for every row of an input DF? Can you just attach the metadata to each the DF's before doing the union? – Ryan Widmaier Jun 05 '17 at 15:17
  • They're not the same, but with a bit of code I sure can process the data before the union. However, how can I attach the metadata to a dataframe though? – James Jun 06 '17 at 15:43
  • Can you elaborate on how you are generating your metadata? If it's based on your row data you can just use withColumn on your input DFs. Otherwise maybe a dataframe join would suit your needs. Hard to know without a better understanding of what you are trying to do. – Ryan Widmaier Jun 06 '17 at 19:40
  • I've come up with a solution, albeit a bit hacky. I'll post it below. – James Jun 12 '17 at 13:58
0

I've figured out a solution, specific to PySpark:

With each data frame, I used .collect() to get a list of Rows. For each of the Row objects, I called asDict() to get a dictionary. From there, I was able to construct a list of dictionaries with a simple loop. Once I have this list of dictionaries, the data exits Spark and enters pure Python territory, and are "easier" to handle (but less efficient).

Alternatively, had I chosen Scala over Python, I may have been able to convert the data frame into a dataset, which seems to provide a handful of methods to perform the operations that I need, but that's another story altogether.

James
  • 43
  • 5