I have a parquet file with 11 columns. I tried executing below ways in pyspark to read the file. It still assigns header names like Prop_0, Prop_1, Prop_2 instead of reading the starting header as header row.
1.
spark.read.parquet("/FileStore/tables/Order.parquet").show()
dfpq_new=spark.read.format("parquet").load("/FileStore/tables/Order-1.parquet")
dfpq_new=spark.read.format("parquet").option("header", True).option("inferSchema", True).load("/FileStore/tables/Order-1.parquet")
However, when I create an dataframe and save it as parquet file, and then read it -
data1 = (("Bob", "IT", 4500), \
("Maria", "IT", 4600), \
("James", "IT", 3850), \
("Maria", "HR", 4500), \
("James", "IT", 4500), \
("Sam", "HR", 3300), \
("Jen", "HR", 3900), \
("Jeff", "Marketing", 4500), \
("Anand", "Marketing", 2000),\
("Shaid", "IT", 3850) \
)
col = ["Name", "MBA_Stream", "SEM_MARKS"]
marks_pq_df = spark.createDataFrame(data1, col)
marks_pq_df.write.parquet("/FileStore/table/markspq.parquet", mode='overwrite')
spark.read.format("parquet").load("/FileStore/table/markspq.parquet").show()
I am using databricks community edition.