PySpark - Read CSV and ignore file header (not using pandas)

Question

I have a problem that I hope you can help me with.
The text file that looks like this:

Report Name : 
column1,column2,column3
this is row 1,this is row 2, this is row 3

I am leveraging Synapse Notebooks to try to read this file into a dataframe. If I try to read the csv file using spark.read.csv() it thinks that the column name is "Report Name : ", which is obviously incorrect. I know that the Pandas csv reader has a 'skipRows[1]' function but unfortunately I cannot read the file directly with Pandas, as I am getting some strange networking errors. I can however convert a PySpark dataframe to a Pandas dataframe via: df.toPandas() I'd like to be able to solve this with straight PySpark dataframes.

Surely someone else has encountered this issue! Help!

I have tried every variation of reading files, and drop, etc. but the schema has already been defined when the first dataframe was created, with 1 column (Report Name : ). Not sure what to do now..

There are existing solutions: https://stackoverflow.com/questions/44077404/how-to-skip-lines-while-reading-a-csv-file-as-a-dataframe-using-pyspark — Raid, Jan 19 '23 at 22:55

score 0 · Answer 1 · answered Jan 19 '23 at 22:57

0

Copied answer from similar question: How to skip lines while reading a CSV file as a dataFrame using PySpark?

import csv
from pyspark.sql.types import StringType

df = sc.textFile("test.csv")\
           .mapPartitions(lambda line: csv.reader(line,delimiter=',', quotechar='"')).filter(lambda line: len(line)>=2 and line[0]!= 'column1')\
           .toDF(['column1','column2','column3'])

answered Jan 19 '23 at 22:57

Raid

188
1
10

Thanks for the link! Pretty sure I saw that code earlier. The issue is that I'm trying to use this for a generic file reader and would like to keep from hard coding column names. The column names are actually in the file, just below the file header "Report Name :" row. – data_engineer_eric Jan 20 '23 at 00:20
I found this article: https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/tutorial-use-pandas-spark-pool I might have to reach out to Microsoft support and see what the deal is. According to this tutorial, the code I have should work. Maybe a missing private endpoint or something. No idea at this point. – data_engineer_eric Jan 20 '23 at 00:21

score 0 · Answer 2 · answered Jan 21 '23 at 13:49

Microsoft got back to me with an answer that worked! When using pandas csv reader, and you use the path to the source file you want to read. It requires an endpoint to blob storage (not adls gen2). I only had an endpoint that read dfs in the URI and not blob. After I added the endpoint to blob storage, the pandas reader worked great! Thanks for looking at my thread.

PySpark - Read CSV and ignore file header (not using pandas)

2 Answers2