How to prevent pyspark to read parquet file header record as just another row instead of reading it as header?

Question

I have a parquet file with 11 columns. I tried executing below ways in pyspark to read the file. It still assigns header names like Prop_0, Prop_1, Prop_2 instead of reading the starting header as header row.

1.

spark.read.parquet("/FileStore/tables/Order.parquet").show()

dfpq_new=spark.read.format("parquet").load("/FileStore/tables/Order-1.parquet")

dfpq_new=spark.read.format("parquet").option("header", True).option("inferSchema", True).load("/FileStore/tables/Order-1.parquet")

headers prop_0 prop_1 instead of header names from parquet file

However, when I create an dataframe and save it as parquet file, and then read it -

data1 = (("Bob", "IT", 4500), \
("Maria", "IT", 4600),  \
("James", "IT", 3850),   \
("Maria", "HR", 4500),  \
("James", "IT", 4500),    \
("Sam", "HR", 3300),  \
("Jen", "HR", 3900),    \
("Jeff", "Marketing", 4500), \
("Anand", "Marketing", 2000),\
("Shaid", "IT", 3850) \
)
col = ["Name", "MBA_Stream", "SEM_MARKS"]
marks_pq_df = spark.createDataFrame(data1, col)
marks_pq_df.write.parquet("/FileStore/table/markspq.parquet", mode='overwrite')

spark.read.format("parquet").load("/FileStore/table/markspq.parquet").show()

reads_headers_from_parquet_file

I am using databricks community edition.

There is most likely an issue in how the first file was created. However without a reproducible example we cannot help much further. — ScootCork, May 02 '23 at 20:14
As @ScootCork probably an error on the file. To give you a second idea, maybe open it giving the `.schema(schema)`. — Memristor, May 03 '23 at 15:37
You meant this ? @Memristor Did not work. from pyspark.sql.types import * orderSchema = StructType([StructField("Region", StringType() ,True) ,StructField("Country", StringType() ,True) ,StructField("ItemType", StringType() ,True) ,StructField("SalesChannel", StringType() ,True) ... ]) df=spark.read.parquet("/FileStore/tables/Order-1.parquet",schema=orderSchema) — moonchild, May 13 '23 at 20:59

How to prevent pyspark to read parquet file header record as just another row instead of reading it as header?

0 Answers0