6

I am having a .csv with few columns, and I wish to skip 4 (or 'n' in general) lines when importing this file into a dataframe using spark.read.csv() function. I have a .csv file like this -

ID;Name;Revenue
Identifier;Customer Name;Euros
cust_ID;cust_name;€
ID132;XYZ Ltd;2825
ID150;ABC Ltd;1849

In normal Python, when using read_csv() function, it's simple and can be done using skiprow=n option like -

import pandas as pd
df=pd.read_csv('filename.csv',sep=';',skiprows=3) # Since we wish to skip top 3 lines

With PySpark, I am importing this .csv file as follows -

df=spark.read.csv("filename.csv",sep=';') 
This imports the file as -
ID          |Name         |Revenue
Identifier  |Customer Name|Euros
cust_ID     |cust_name    |€
ID132       |XYZ Ltd      |2825
ID150       |ABC Ltd      1849

This is not correct, because I wish to ignore first three lines. I can't use option 'header=True' because it will only exclude the first line. One can use 'comment=' option, but for that one needs the lines to start with a particular character and that is not the case with my file. I could not find anything in the documentation. Is there any way this can be accomplished?

cph_sto
  • 7,189
  • 12
  • 42
  • 78
  • one option is to filter the rows which you don't want to hold, check [here](https://stackoverflow.com/a/35882046/1025328) – Prasad Khode Oct 04 '18 at 10:49
  • 1
    Thanks Prasad for the answer! Well, that way I have to manually code the filter statement and if the header is written differently the next time, then the code will fail because we are basically checking strings. The code written this way will not be robust and must be avoided. – cph_sto Oct 04 '18 at 10:59
  • 2
    That there is no simple answer on how do to this really shows that Spark has a long way to go... – Thomas Dec 10 '18 at 12:28
  • Let me make 2 points - 1. After spending a lot of time investigating this issue, I think the answer provided by Mayank Agarwak below is the best option. Import the file via schema using `RDD` and then convert `RDD` into `DF`. Still, with his answer, I need to investigate if the order of rows read from `.csv` file is respected or not. 2. Apropos to your comment that SPARK has along way to go, well, may be we have such an arrangement by design. In `DF` we can't really be sure as to which partition a row lands up. That's why we do not have `ix/.loc` in SPARK, as opposed to Pandas. I am no expert – cph_sto Dec 10 '18 at 12:42

3 Answers3

5

I couldnt find a simple solution for your problem. Although this will work no matter how the header is written,

df = spark.read.csv("filename.csv",sep=';')\
          .rdd.zipWithIndex()\
          .filter(lambda x: x[1] > n)\
          .map(lambda x: x[0]).toDF()
mayank agrawal
  • 2,495
  • 2
  • 13
  • 32
  • 1
    After investigating this issue for quite some time, I think that your procedure is the best option. But, the question remains - will `.zipWithIndex()` respect the order of rows? If the order is not respected, then this procedure comes to naught - i.e; header line 3, which being read ends up as some other line. That's my only concern. – cph_sto Dec 10 '18 at 12:50
  • did you get a better answer for this which ensures that order of the rows is also respected? @cph_sto? – Manish Mehra Apr 28 '20 at 11:28
  • You can't maintain the order in spark, as the data is distributed. That was my learning atleast. – cph_sto Apr 28 '20 at 13:26
1

I've been trying to find a solution to this problem for the past couple of days as well. The solution I implemented is probably not the fastest, but it works:

with open("filename.csv", 'r') as fin:
    data = fin.read().splitlines(True)
with open("filename.csv", 'w') as fout:
    fout.writelines(data[3:]) # Select the number of rows you want to skip, in this case we skip the first 3

df = spark.read.format("csv") \
          .option("header", "true") \
          .option("sep", ";") \
          .load("filename.csv")
Max
  • 21
  • 7
0

Filter first 2 lines from csv file:

_rdd = (
    spark.read
        .text(csv_file_path_with_bad_header)
        .rdd
        .zipWithIndex()
        .filter(lambda x: x[1] > 2)
        .map(lambda x: x[0][0])
     )

df = (
  spark.read.csv(
    _rdd, 
    header=True, 
    sep="\t", 
    inferSchema = False, 
    quote="\\", 
    ignoreLeadingWhiteSpace=True, 
    ignoreTrailingWhiteSpace=True
  )
)