I am doing some data quality check using PySpark. What I want to achieve is output all results to a txt file. The basic code logic is as follows:
def data_quality_check(df):
output = ''
output += func1(df) # func1 and func2 return check results as strings
output += func2(df)
return output
The challenge I encountered is how to output aggregation results from pySpark dataframe. For example, I want to output a groupBy/count result using the following code:
output += 'Counts group by device type is : ' + str(df.groupBy('DEVICE_TYPE').count().show()) + '\n'
The output below is not what I expected:
Counts group by device type is : None
Thanks for any suggestions in advance!