Read in an excel file as a csv in pyspark

Question

first question here, so I apologise if something isn't clear. I am new to pySpark, and using databricks I was trying to read in an excel file saved as a csv with the following code

df = spark.read.csv('/FileStore/tables/file.csv', sep = ";", inferSchema = "true", header = "true")

This works fine, except some of the observations get null values, while in the excel file there are no null values. The actual values can be found in other rows. Maybe better explained with an example: If the excel file has the row A B C D Then it becomes in the table (for some rows):

A B null null C D null null

My question is how could I fix this? Thanks in advance

how many columns are there? if you want to replace the `null`, you can find examples here https://stackoverflow.com/questions/42312042/how-to-replace-all-null-values-of-a-dataframe-in-pyspark — Dyno Fu, Nov 26 '19 at 20:53
Most likely the excel file actually has values in those cells that look empty. Probably a space. When you then convert to csv, the space is preserved, so you'll see `A;B;;` in your CSV (or `A;B; ; `). Reading that with Spark will correctly create null values from the empty fields, see e.g. [Spark-17916](https://issues.apache.org/jira/browse/SPARK-17916)). — Oliver W., Nov 27 '19 at 00:07

score 0 · Answer 1 · answered Nov 27 '19 at 14:12

0

Right now you are setting your delimiter to be a ;, however in a CSV file the delimiter is usually a , (Comma Separated Values). If you us the spark CSV reader, the delimiter is automatically set to a comma:

spark.read.format("csv")
           .option("header", "true")
           .option("inferSchema", "true")
           .load("/FileStore/tables/file.csv")

answered Nov 27 '19 at 14:12

Molly

49
4

You’re assuming Ndw’s choice of separator is wrong, which seems unlikely, as Ndw says “the reading works fine”. If you need details like these, ask in the comments. – Oliver W. Dec 01 '19 at 01:17

Read in an excel file as a csv in pyspark

1 Answers1