7

What is the best way to read .tsv file with header in pyspark and store it in a spark data frame.

I am trying to use "spark.read.options" and "spark.read.csv" commands however no luck.

Thanks.

Regards, Jit

Jitu
  • 91
  • 1
  • 1
  • 4
  • Hi JKD, welcome to SO! Please read up on [asking questions](https://stackoverflow.com/help/how-to-ask) before writing your next one. Happy coding! – Diggy. May 14 '20 at 14:13
  • Does this answer your question : https://stackoverflow.com/questions/43508054/spark-sql-how-to-read-a-tsv-or-csv-file-into-dataframe-and-apply-a-custom-sche?rq=1 – Ehsan May 14 '20 at 14:19
  • @Ehsan Do we have to create a schema always (as it involves opening the file in local machine) ? Can't we use the header column as column names? – Jitu May 14 '20 at 14:49
  • up to my knowledge I think schema is needed – Ehsan May 14 '20 at 14:53

1 Answers1

12

Well you can directly read the tsv file without providing external schema if there is header available as:

df = spark.read.csv(path, sep=r'\t', header=True).select('col1','col2')

Since spark is lazily evaluated it'll read only selected columns. Hope it helps.

Shubham Jain
  • 5,327
  • 2
  • 15
  • 38