0

I am doing some data quality check using PySpark. What I want to achieve is output all results to a txt file. The basic code logic is as follows:

def data_quality_check(df):
    output = ''
    output += func1(df) # func1 and func2 return check results as strings
    output += func2(df)
    return output 

The challenge I encountered is how to output aggregation results from pySpark dataframe. For example, I want to output a groupBy/count result using the following code:

output += 'Counts group by device type is : ' + str(df.groupBy('DEVICE_TYPE').count().show()) + '\n'

The output below is not what I expected:

Counts group by device type is : None

Thanks for any suggestions in advance!

CathyQian
  • 1,081
  • 15
  • 30
  • Another way to ask this question is how to save xx.show() results to text file? Thanks! – CathyQian Mar 26 '21 at 22:01
  • df.groupBy('DEVICE_TYPE').count().write.format('csv').save('test', mode="overwrite") should write into a file in csv format. Are you facing any issue with that? – Hussain Bohra Mar 26 '21 at 22:07
  • @HussainBohra Yes I can do that. Is there anyway that I can write multiple such strings into the same csv or txt file as the code runs? Thanks again! – CathyQian Mar 26 '21 at 23:00
  • Can you provide an example of your input data and output file you are looking for? – Hussain Bohra Mar 27 '21 at 01:04
  • 2
    Does this answer your question? [Saving result of DataFrame show() to string in pyspark](https://stackoverflow.com/questions/55653609/saving-result-of-dataframe-show-to-string-in-pyspark) – mck Mar 27 '21 at 07:01
  • @mck Yes, that answered my question. Thank you all! – CathyQian Mar 30 '21 at 17:20

0 Answers0