0

I'm using pyarrow to read parquet data from s3 and I'd like to be able to parse the schema and convert it to a format suitable for running an mLeap serialized model outside of Spark.

This requires parsing the schema.

If I had a Pyspark dataframe, I could do this:

test_df = spark.read.parquet(test_data_path)
schema = [ { "name" : field.simpleString().split(":")[0], "type" : field.simpleString().split(":")[1] }
for field in test_df.schema ]

How can I achieve the same if I read the data using pyarrow instead ? Also, for the Spark dataframe I can obtain the rows in a suitable format for model evaluation by doing the following:

rows = [[field for field in row] for row in test_df.collect()]

How can I achieve a similar thing using pyarrow ?

Thanks in advance for your help.

femibyte
  • 3,317
  • 7
  • 34
  • 59

1 Answers1

1

If you want to get the schema, you can do the following with pyarrow.parquet:

import pyarrow.parquet as pq
dataset = pq.ParquetDataset(<path to file>).read_pandas()
schema = dataset.schema
schemaDict = {x:y for (x,y) in zip(schema.names, schema.types)}

This will give you a dictionary of column names to datatypes.

Douglas Daly
  • 85
  • 10