I have to convert analytics data in JSON to parquet in two steps. For the large amounts of existing data I am writing a PySpark job and doing
df.repartition(*partitionby).write.partitionBy(partitionby).
mode("append").parquet(output,compression=codec)
however for incremental data I plan to use AWS Lambda. Probably, PySpark would be an overkill for it, and hence I plan to use PyArrow for it (I am aware that it unnecessarily involves Pandas, but I couldn't find a better alternative). So, basically:
import pyarrow.parquet as pq
pq.write_table(table, outputPath, compression='snappy',
use_deprecated_int96_timestamps=True)
I wanted to know if the Parquet files written by both PySpark and PyArrow will be compatible (with respect to Athena)?