I'm using pyarrow to read parquet data from s3 and I'd like to be able to parse the schema and convert it to a format suitable for running an mLeap serialized model outside of Spark.
This requires parsing the schema.
If I had a Pyspark dataframe, I could do this:
test_df = spark.read.parquet(test_data_path)
schema = [ { "name" : field.simpleString().split(":")[0], "type" : field.simpleString().split(":")[1] }
for field in test_df.schema ]
How can I achieve the same if I read the data using pyarrow instead ? Also, for the Spark dataframe I can obtain the rows in a suitable format for model evaluation by doing the following:
rows = [[field for field in row] for row in test_df.collect()]
How can I achieve a similar thing using pyarrow ?
Thanks in advance for your help.