2

I am trying to read an avro file using pyspark and sort one of the columns based on certain keys. One of the columns in my avro file contains a MapType data which I need to sort based on keys. The test avro contains only one row with the entities column having a MapType data. My intention is to write back the output to an avro file but with the ordering of keys. Unfortunately I am unable to achieve this, not sure if this is possible at all in avro? It is writing back in the same fashion in which the input is appearing. Here is my code (I have created a notebook to test it):

from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, lit, to_json, create_map, from_json
from pyspark.sql import Row
from pyspark import StorageLevel
import json
from pyspark.sql.types import StringType
import shutil
from pyspark.sql.types import MapType, ArrayType, StringType, StructType, StructField

spark = SparkSession     .builder     .appName("AvroTest")     .config("spark.jars.packages", "org.apache.spark:spark-avro_2.11:2.4.0")     .getOrCreate()

df = spark.read.format("avro").load("part-r-00000.avro")
schema = df.select('entities').schema
sch = schema.fields[0].dataType
print(df.schema)

@udf
def udf_func(line):
    for entkey,subdict in line.items():
        subdictnew = subdict.asDict(True)
        sorteddict = dict(sorted(subdictnew['entities'].items(), key=lambda a: int(a[0])))
        subdictnew['entities'] = sorteddict
        line[entkey] = subdictnew
    return str(line)

dfnew = df.withColumn('entities', from_json(udf_func(df['entities']), sch)).persist(StorageLevel.MEMORY_ONLY_SER)
#dfnew.show()
d = dfnew.dtypes
newschema = dfnew.schema

try:
    shutil.rmtree('testavro/sortedData')
except:
    print('folder already removed')
dfnew.write.format('avro').save('ctipavro/sortedData')
dfnew.show(1, False)

The above code writes the avro back but in an unsorted manner. The last line prints the dataframe column record for "entities" in a sorted way.

|37321431529|37321431529|1561020714|[trade -> [trade, [59489777 -> [TRADE_ASSOC_TO_DB_DT -> 2011-09-30, FCBA_IN -> N, ACCT_BALANCE_AM -> 0, CII_BKRPT_CD ->   , CREDIT_AM_EXCP_CD -> 6, FRAUD_IN -> N, ACCT_REPORTED_DT -> 2019-04-01, DATA_USAGE_EXCL_IN -> N, CII_REAFF_CD ->   , DEDUP_RANK_CD -> 0, NY_DISPLAY_RULE_IN -> N, ACCT_HIGH_BALANCE_AM_EXCP_CD -> 6, ACCT_PAYMENT_AM -> 13, EXCLUSION_CD -> 0, KOB_CD -> BB, PAYMENT_GRID_2 -> 0000000-0-0000-00-00000..............

Please note, here I am printing the dataframe output which was already sorted. But when I try to read the saved avro file back into a new dataframe and do a show(), the keys are again unsorted. Please note the first key for trade -> [trade, it should have been 59489777, whereas it is something else - 51237292611. By the way, this key was appearing when I read the input avro for the first time, not sure why after sorting and writing back, its printing the same key first:

dffresh = spark.read.format("avro").load("testavro/sortedData")
schema = dffresh.schema
print(schema)
dffresh.show(1, False)

Output:

|37321431529|37321431529|1561020714|[trade -> [trade, [51237292611 -> [TRADE_ASSOC_TO_DB_DT -> 2014-09-20, FCBA_IN -> N, ACCT_BALANCE_AM -> 0, CII_BKRPT_CD ->   , CREDIT_AM_EXCP_CD -> 6, FRAUD_IN -> N, ACCT_REPORTED_DT -> 2019-05-01, DATA_USAGE_EXCL_IN -> N, CII_REAFF_CD ->   , DEDUP_RANK_CD -> 0, NY_DISPLAY_RULE_IN -> N, ACCT_HIGH_BALANCE_AM_EXCP_CD -> 6, ACCT_PAYMENT_AM -> 0, EXCLUSION_CD -> 0, KOB_CD -> BC, PAYMENT_GRID_2 -> 000000C0000000..................................

I would request anyone to please help me out. I have tried numerous ways and searched across multiple SO questions and could not find a clue on how to achieve it.

ArinCool
  • 1,720
  • 1
  • 13
  • 24

1 Answers1

0

If your source data is in avro format, it is a general best practive too write your processed output in Parquet file format. You get the benefit of predicate pushdown and can always process selective number of columns.

But if writing to avro format again is part of your process, the order of columns is not always guaranteed, since the data structure being used is Map. You can mitigate this, by using a select function and reading the columns in the order of your choice.

Yayati Sule
  • 1,601
  • 13
  • 25
  • Thanks @Yayati. My problem is that one of the columns contains json data and within that there are multiple attributes which I need to sort. My question is if I use a `select` function, will the attributes within the column data appear sorted? – ArinCool Jun 17 '20 at 14:00
  • Yeah they will appear sorted – Yayati Sule Jun 17 '20 at 14:14
  • No it doesnt. I tried out. The problem is when the newly created avro is saved and then read, the data does not appear sorted. This code saves the new avro and prints the data from the dataframe, they appear sorted - `dfnew.write.format('avro').save('ctipavro/sortedData') dfnew.show(1, False)`. While this, does not appear sorted: `dffresh = spark.read.format("avro").load("ctipavro/sortedData") dffresh.select('entities').show(1, False)`. The two outputs appear just like I provided in my question. Maybe during the write operation, the data again gets unsorted? – ArinCool Jun 17 '20 at 14:56
  • No what I meant is you have to explicitly pass the column name in the schema. since entities is a Map/Struct, you need to explicitly explode the column values and Select them. For e.g `explode(col(entities)) as e` and then `df.select($"e.column_name")` – Yayati Sule Jun 17 '20 at 15:58
  • It is still not working, could you please provide a working code snippet just to read the nested Map type column in a sorted way. – ArinCool Jun 17 '20 at 17:21
  • I’ll try to get you one in sometime – Yayati Sule Jun 17 '20 at 17:40
  • Hi, You can follow this example to get a feel of the [explode function](https://sparkbyexamples.com/pyspark/pyspark-explode-array-and-map-columns-to-rows/#:~:text=explode%20%E2%80%93%20PySpark%20explode%20array%20or,it%20contains%20all%20array%20elements.) – Yayati Sule Jun 17 '20 at 17:48