I am trying to read an avro file using pyspark and sort one of the columns based on certain keys. One of the columns in my avro file contains a MapType
data which I need to sort based on keys. The test avro contains only one row with the entities column having a MapType
data. My intention is to write back the output to an avro file but with the ordering of keys. Unfortunately I am unable to achieve this, not sure if this is possible at all in avro? It is writing back in the same fashion in which the input is appearing.
Here is my code (I have created a notebook to test it):
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, lit, to_json, create_map, from_json
from pyspark.sql import Row
from pyspark import StorageLevel
import json
from pyspark.sql.types import StringType
import shutil
from pyspark.sql.types import MapType, ArrayType, StringType, StructType, StructField
spark = SparkSession .builder .appName("AvroTest") .config("spark.jars.packages", "org.apache.spark:spark-avro_2.11:2.4.0") .getOrCreate()
df = spark.read.format("avro").load("part-r-00000.avro")
schema = df.select('entities').schema
sch = schema.fields[0].dataType
print(df.schema)
@udf
def udf_func(line):
for entkey,subdict in line.items():
subdictnew = subdict.asDict(True)
sorteddict = dict(sorted(subdictnew['entities'].items(), key=lambda a: int(a[0])))
subdictnew['entities'] = sorteddict
line[entkey] = subdictnew
return str(line)
dfnew = df.withColumn('entities', from_json(udf_func(df['entities']), sch)).persist(StorageLevel.MEMORY_ONLY_SER)
#dfnew.show()
d = dfnew.dtypes
newschema = dfnew.schema
try:
shutil.rmtree('testavro/sortedData')
except:
print('folder already removed')
dfnew.write.format('avro').save('ctipavro/sortedData')
dfnew.show(1, False)
The above code writes the avro back but in an unsorted manner. The last line prints the dataframe column record for "entities" in a sorted way.
|37321431529|37321431529|1561020714|[trade -> [trade, [59489777 -> [TRADE_ASSOC_TO_DB_DT -> 2011-09-30, FCBA_IN -> N, ACCT_BALANCE_AM -> 0, CII_BKRPT_CD -> , CREDIT_AM_EXCP_CD -> 6, FRAUD_IN -> N, ACCT_REPORTED_DT -> 2019-04-01, DATA_USAGE_EXCL_IN -> N, CII_REAFF_CD -> , DEDUP_RANK_CD -> 0, NY_DISPLAY_RULE_IN -> N, ACCT_HIGH_BALANCE_AM_EXCP_CD -> 6, ACCT_PAYMENT_AM -> 13, EXCLUSION_CD -> 0, KOB_CD -> BB, PAYMENT_GRID_2 -> 0000000-0-0000-00-00000..............
Please note, here I am printing the dataframe output which was already sorted. But when I try to read the saved avro file back into a new dataframe and do a show()
, the keys are again unsorted. Please note the first key for trade -> [trade
, it should have been 59489777
, whereas it is something else - 51237292611
. By the way, this key was appearing when I read the input avro for the first time, not sure why after sorting and writing back, its printing the same key first:
dffresh = spark.read.format("avro").load("testavro/sortedData")
schema = dffresh.schema
print(schema)
dffresh.show(1, False)
Output:
|37321431529|37321431529|1561020714|[trade -> [trade, [51237292611 -> [TRADE_ASSOC_TO_DB_DT -> 2014-09-20, FCBA_IN -> N, ACCT_BALANCE_AM -> 0, CII_BKRPT_CD -> , CREDIT_AM_EXCP_CD -> 6, FRAUD_IN -> N, ACCT_REPORTED_DT -> 2019-05-01, DATA_USAGE_EXCL_IN -> N, CII_REAFF_CD -> , DEDUP_RANK_CD -> 0, NY_DISPLAY_RULE_IN -> N, ACCT_HIGH_BALANCE_AM_EXCP_CD -> 6, ACCT_PAYMENT_AM -> 0, EXCLUSION_CD -> 0, KOB_CD -> BC, PAYMENT_GRID_2 -> 000000C0000000..................................
I would request anyone to please help me out. I have tried numerous ways and searched across multiple SO questions and could not find a clue on how to achieve it.