1

I have a parquet file with 11 columns. I tried executing below ways in pyspark to read the file. It still assigns header names like Prop_0, Prop_1, Prop_2 instead of reading the starting header as header row.

1.

spark.read.parquet("/FileStore/tables/Order.parquet").show()
dfpq_new=spark.read.format("parquet").load("/FileStore/tables/Order-1.parquet")
dfpq_new=spark.read.format("parquet").option("header", True).option("inferSchema", True).load("/FileStore/tables/Order-1.parquet")

headers prop_0 prop_1 instead of header names from parquet file

However, when I create an dataframe and save it as parquet file, and then read it -

data1 = (("Bob", "IT", 4500), \
("Maria", "IT", 4600),  \
("James", "IT", 3850),   \
("Maria", "HR", 4500),  \
("James", "IT", 4500),    \
("Sam", "HR", 3300),  \
("Jen", "HR", 3900),    \
("Jeff", "Marketing", 4500), \
("Anand", "Marketing", 2000),\
("Shaid", "IT", 3850) \
)
col = ["Name", "MBA_Stream", "SEM_MARKS"]
marks_pq_df = spark.createDataFrame(data1, col)
marks_pq_df.write.parquet("/FileStore/table/markspq.parquet", mode='overwrite')

spark.read.format("parquet").load("/FileStore/table/markspq.parquet").show()

reads_headers_from_parquet_file

I am using databricks community edition.

ScootCork
  • 3,411
  • 12
  • 22
moonchild
  • 11
  • 1
  • There is most likely an issue in how the first file was created. However without a reproducible example we cannot help much further. – ScootCork May 02 '23 at 20:14
  • As @ScootCork probably an error on the file. To give you a second idea, maybe open it giving the `.schema(schema)`. – Memristor May 03 '23 at 15:37
  • You meant this ? @Memristor Did not work. from pyspark.sql.types import * orderSchema = StructType([StructField("Region", StringType() ,True) ,StructField("Country", StringType() ,True) ,StructField("ItemType", StringType() ,True) ,StructField("SalesChannel", StringType() ,True) ... ]) df=spark.read.parquet("/FileStore/tables/Order-1.parquet",schema=orderSchema) – moonchild May 13 '23 at 20:59

0 Answers0