1

I have the following code:

print(df.show(3))
print(df.columns)



df.select('port', 'key', 'return_b', 'return_a', 'return_c', 'return_d', 'return_g').write.format("parquet").save("qwe.parquet")

For some reason this doesn't write the Dataframe into the parquet file with the headers. The print statement above shows me those columns exist but the parquet file doesn't have those headers.

I have also tried:

df.write.option("header", "true").mode("overwrite").parquet(write_folder)
qwerty
  • 887
  • 11
  • 33

1 Answers1

0

You may find df.to_parquet(...) more convenient.

If you wish to project down to selected columns, do that first, and then write to parquet.

J_H
  • 17,926
  • 4
  • 24
  • 44
  • I am working with pyspark and not pandas. Unless, you're recommending to convert the pyspark dataframe to pandas and then implement this, this won't work – qwerty Oct 18 '22 at 00:56
  • Wow! The pyspark API sounds inconvenient. Whenever I've written parquet format and round-tripped it, I have found it to always Just Work. Ok, you have my sympathies. In the docs ( https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.parquet.html ) I only got as far as the call to df._jdf.write(). I assume that one can "append" many times. Maybe your initial call should specify "overwrite" mode, which emits both column def headers _and_ data rows? It's worth a try. – J_H Oct 18 '22 at 01:27
  • alright, apparently the problem was converting the parquet to csv (I was doing it online using some online tool), but when I read it using pyspark or pandas the columns do come in. It's weird, maybe this is how parquets are supposed to work lol – qwerty Oct 18 '22 at 17:42