2
SL No:  Customer    Month   Amount
1   A1  12-Jan-04   495414.75
2   A1  3-Jan-04    245899.02
3   A1  15-Jan-04   259490.06

My Df is above

Code

import findspark
findspark.init('/home/mak/spark-3.0.0-preview2-bin-hadoop2.7')
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('mak').getOrCreate()
import numpy as np
import pandas as pd

# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

pdf3 = pd.read_csv('Repayment.csv')
df_repay = spark.createDataFrame(pdf3)

only loading df_repay is having issue, other data frame are loaded successfully. When i shfted my above to code to below code its worked successfully

df4 = (spark.read.format("csv").options(header="true") .load("Repayment.csv"))

  • why df_repay is not loaded with spark.createDataFrame(pdf3) while similar data frames loaded successfully
  • Does this answer your question? [pyspark type error on reading a pandas dataframe](https://stackoverflow.com/questions/39888188/pyspark-type-error-on-reading-a-pandas-dataframe) – user10938362 Feb 02 '20 at 16:51

1 Answers1

5

pdf3 is pandas dataframe and you are trying to convert pandas dataframe to spark dataframe. if you want to stick to your code please use below code that is convert your pandas dataframe to spark dataframe.

from pyspark.sql.types import *
pdf3 = pd.read_csv('Repayment.csv')
#create schema for your dataframe
schema = StructType([StructField("Customer", StringType(), True)\
                   ,StructField("Month", DateType(), True)\
                   ,StructField("Amount", IntegerType(), True)])

#create spark dataframe using schema
df_repay = spark.createDataFrame(pdf3,schema=schema)
Jay Kakadiya
  • 501
  • 1
  • 5
  • 12