How to load a csv file in a pyspark DataFrame

Question

How can I change the csv file to DataFrame.

csv values -

country,2015,2016,2017,2018,2019
Norway,4.141,4.152,4.157,4.166,4.168
Australia,4.077,4.086,4.093,4.110,4.115
Switzerland,4.009,4.036,4.032,4.041,4.046
Netherlands,3.977,3.994,4.043,4.045,4.045
UnitedStates,4.017,4.027,4.039,4.045,4.050
Germany,3.988,3.999,4.017,4.026,4.028
NewZealand,3.982,3.997,3.993,3.999,4.018

I want DataFrame/table format like -

 +----------------------------------------+
 |   Country| 1980| 1985| 1990| 2000| 2005|    
 +----------+-----+-----+-----+-----+-----+    
 |    Norway|4.141|4.152|4.157|4.166|4.168|      
 | Australia|4.077 ...
 ......
 ......
 ......    
 |NewZealand|.......................|4.018|
 +----------------------------------------+

Possible duplicate of [Load CSV file with Spark](https://stackoverflow.com/questions/28782940/load-csv-file-with-spark). Specifically [this answer](https://stackoverflow.com/a/34528938/5858851). — pault, Feb 13 '19 at 15:19

cph_sto · Answer 1 · 2019-02-13T13:40:25.340

Read the documentation here. Let's say your file filename.csv in stored at path, then this way you can import it with very basic configuration.

# Specify a schema
schema = StructType([
        StructField('country', StringType()),
        StructField('2015', StringType()),
        StructField('2016', StringType()),
        StructField('2017', StringType()),
        StructField('2018', StringType()),
        StructField('2019', StringType()),
        ])

# Start the import
df = spark.read.schema(schema)\
               .format("csv")\
               .option("header","true")\
               .option("sep",",")\
               .load("path/filename.csv")

Mind it, your numbers are to be imported as a String because PySpark can't recognise the thousands separator dot .. You have to convert them to numeric as shown below -

# Convert them to numerics
from pyspark.sql.functions import regexp_replace
cols_with_thousands_separator = ['2015','2016','2017','2018','2019']
for c in cols_with_thousands_separator:
    df = df.withColumn(c, regexp_replace(col(c), '\\.', ''))\
           .withColumn(c, col(c).cast("int"))

How to load a csv file in a pyspark DataFrame

1 Answers1