How to skip lines while reading a CSV file as a dataFrame using PySpark?

Question

I have a CSV file that is structured this way:

Header
Blank Row
"Col1","Col2"
"1,200","1,456"
"2,000","3,450"

I have two problems in reading this file.

I want to Ignore the Header and Ignore the blank row
The commas within the value is not a separator

Here is what I tried:

df = sc.textFile("myFile.csv")\
              .map(lambda line: line.split(","))\ #Split By comma
              .filter(lambda line: len(line) == 2).collect() #This helped me ignore the first two rows

However, This did not work, because the commas within the value was being read as a separator and the len(line) was returning 4 instead of 2.

I tried an alternate approach:

data = sc.textFile("myFile.csv")
headers = data.take(2) #First two rows to be skipped

The idea was to then use filter and not read the headers. But, when I tried to print the headers, I got encoded values.

[\x00A\x00Y\x00 \x00J\x00u\x00l\x00y\x00 \x002\x000\x001\x006\x00]

What is the correct way to read a CSV file and skip the first two rows?

score 8 · Answer 1 · answered May 19 '17 at 23:52

8

Try to use csv.reader with 'quotechar' parameter.It will split the line correctly. After that you can add filters as you like.

import csv
from pyspark.sql.types import StringType

df = sc.textFile("test2.csv")\
           .mapPartitions(lambda line: csv.reader(line,delimiter=',', quotechar='"')).filter(lambda line: len(line)>=2 and line[0]!= 'Col1')\
           .toDF(['Col1','Col2'])

answered May 19 '17 at 23:52

zlidime

1,124
11
6

I fixed it by calling `csv.reader([l.replace('\0','') for l in line],delimiter=',', quotechar='"')` – Rakesh Adhikesavan May 22 '17 at 16:35

score 7 · Answer 2 · answered May 20 '17 at 00:05

7

For your first problem, just zip the lines in the RDD with zipWithIndex and filter the lines you don't want. For the second problem, you could try to strip the first and the last double quote characters from the lines and then split the line on ",".

rdd = sc.textFile("myfile.csv")
rdd.zipWithIndex().
    filter(lambda x: x[1] > 2).
    map(lambda x: x[0]).
    map(lambda x: x.strip('"').split('","')).
    toDF(["Col1", "Col2"])

Although, if you're looking for a standard way to deal with CSV files in Spark, it's better to use the spark-csv package from databricks.

answered May 20 '17 at 00:05

zenofsahil

1,713
2
16
18

1

Upvoted for your "although" - With the addition, that that package shouldn't be used with Spark 2, since it's been integrated into Spark, which makes the "although" all the more important. I would stringly recommend doing that kind of filtering in a separate job outside your other Spark logic, since this is classic data normalization/regularization, which should not be part of an analytics pipeline. Doing this outside Spark allows you to use custom tools for that job, and then have a proper file format that everybody can work with. – Rick Moritz May 23 '17 at 08:46

score 4 · Accepted Answer · answered May 24 '17 at 16:27

Answer by Zlidime had the right idea. The working solution is this:

import csv

customSchema = StructType([ \
    StructField("Col1", StringType(), True), \
    StructField("Col2", StringType(), True)])

df = sc.textFile("file.csv")\
        .mapPartitions(lambda partition: csv.reader([line.replace('\0','') for line in partition],delimiter=',', quotechar='"')).filter(lambda line: len(line) > 2 and line[0] != 'Col1')\
        .toDF(customSchema)

score 2 · Answer 4 · answered May 24 '17 at 08:25

If CSV file structure always has two columns, on Scala can be implemented:

val struct = StructType(
  StructField("firstCol", StringType, nullable = true) ::
  StructField("secondCol", StringType, nullable = true) :: Nil)

val df = sqlContext.read
  .format("com.databricks.spark.csv")
  .option("header", "false")
  .option("inferSchema", "false")
  .option("delimiter", ",")
  .option("quote", "\"")
  .schema(struct)
  .load("myFile.csv")

df.show(false)

val indexed = df.withColumn("index", monotonicallyIncreasingId())
val filtered = indexed.filter(col("index") > 2).drop("index")

filtered.show(false)

Result is:

+---------+---------+
|firstCol |secondCol|
+---------+---------+
|Header   |null     |
|Blank Row|null     |
|Col1     |Col2     |
|1,200    |1,456    |
|2,000    |3,450    |
+---------+---------+

+--------+---------+
|firstCol|secondCol|
+--------+---------+
|1,200   |1,456    |
|2,000   |3,450    |
+--------+---------+

PySpark allows you to do the same. This would work if it wasn't for the header. Only the header get's read in and other rows get skipped. — Rakesh Adhikesavan, May 24 '17 at 16:23
Monotonically increasing id dosen't guarantee a consecutive Id, so if you use the >2 condition you're not assured that this will remove the first 2 lines — Federico Ponzi, Feb 11 '19 at 16:02

score 0 · Answer 5 · answered Aug 09 '23 at 16:19

0

For all of those who are still wondering how to do it, the simple answer is - to use the option parameter while reading the file:

spark.read.option("skipRows", "2").csv("file.csv")

answered Aug 09 '23 at 16:19

Konstantin Mokhov

474
5
18

score -4 · Answer 6 · answered May 23 '17 at 05:54

-4

Why don't you just try the DataFrameReader API from pyspark.sql? It is pretty easy. For this problem, I guess this single line would be good enough.

df = spark.read.csv("myFile.csv") # By default, quote char is " and separator is ','

With this API, you can also play around with few other parameters like header lines, ignoring leading and trailing whitespaces. Here is the link: DataFrameReader API

answered May 23 '17 at 05:54

Sivaprasanna Sethuraman

4,014
5
31
60

3

This doesn't allow me to skip lines – Rakesh Adhikesavan May 23 '17 at 14:22
Did you try that with the `ignoreLeadingWhiteSpace` or `ignoreTrailingWhiteSpace` set to True? I'm not sure it would work but at least, give it a try. – Sivaprasanna Sethuraman May 24 '17 at 05:17
And also try `mode=DROPMALFORMED`. My assumption is that it would consider that empty line(s) to be corrupt. – Sivaprasanna Sethuraman May 24 '17 at 05:28
2

I tried `mode=DROPMALFORMED` , while it may drop empty lines, it doesn't skip the header(first line) After the first line is read, only one column is expected. So it will error out when it reads a row with more than one column. – Rakesh Adhikesavan May 24 '17 at 16:13

How to skip lines while reading a CSV file as a dataFrame using PySpark?

6 Answers6

Linked

Related