1

In Pandas with Python I could use:

for item in read_csv(csv_file, header=1)

And in Spark I only have the option of true/false?

df = spark.read.format("csv").option("header", "true").load('myfile.csv')

How can I read starting from the second row in Spark? The suggested duplicate post is an outdated version of Spark. I am using the latest, 2.4.3.

crystyxn
  • 1,411
  • 4
  • 27
  • 59
  • 1
    Possible duplicate of [How to skip lines while reading a CSV file as a dataFrame using PySpark?](https://stackoverflow.com/questions/44077404/how-to-skip-lines-while-reading-a-csv-file-as-a-dataframe-using-pyspark) - refer to [this answer](https://stackoverflow.com/a/44080537/5858851). – pault Jul 19 '19 at 18:12

1 Answers1

1

Looks like there's no option in spark csv to specify how many lines to skip. Here are some alternatives you can try:

  1. Read with option("header", "true"), and rename the column names using withColumnRenamed.
  2. Read with option("header", "false"), and select rows from 2nd line using select.
  3. If the first character of the first line is different from all other lines, you can use comment option to skip it. For example, if the first character of line #1 is D, you set comment='D'. Just be careful, comment will skip any line that starts with D here.

Hope this helps.

niuer
  • 1,589
  • 2
  • 11
  • 14