1

I am new to spark and facing an error while converting .csv file to dataframe. I am using pyspark_csv module for the conversion but gives an error saying "module 'pyspark_csv' has no attribute 'csvToDataframe".

here is my code:

import findspark
findspark.init()
findspark.find()
import pyspark
sc=pyspark.SparkContext(appName="myAppName")
sqlCtx = pyspark.SQLContext

#csv to dataframe

sc.addPyFile('/usr/spark-1.5.0/python/pyspark_csv.py')
sc.addPyFile('https://raw.githubusercontent.com/seahboonsiew/pyspark-csv/master/pyspark_csv.py')
import pyspark_csv as pycsv

#skipping the header
def skip_header(idx, iterator):
    if(idx == 0):
        next(iterator)
    return iterator
#loading the dataset  
data=sc.textFile('gdeltdata/20160427.CSV')

data_header = data.first()

data_body = data.mapPartitionsWithIndex(skip_header)

data_df = pycsv.csvToDataframe(sqlctx, data_body, sep=",", columns=data_header.split('\t'))


AttributeError                            Traceback (most recent call last)
<ipython-input-10-8e47cd9759e6> in <module>()
----> 1 data_df = pycsv.csvToDataframe(sqlctx, data_body, sep=",", columns=data_header.split('\t'))

AttributeError: module 'pyspark_csv' has no attribute 'csvToDataframe'
Shafaat Hussain
  • 79
  • 1
  • 2
  • 12

1 Answers1

0

As mentioned in https://github.com/seahboonsiew/pyspark-csv

Please try using the following command:

csvToDataFrame

with Frame instead of frame

Yaron
  • 10,166
  • 9
  • 45
  • 65
  • Thank you, for the reply i corrected it but it is giving a different error now. Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 10.0 failed 1 times, most recent failure: Lost task 0.0 in stage 10.0 (TID 20, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last): – Shafaat Hussain May 02 '16 at 14:05
  • I'm happy that I solved the previous problem. And sorry to hear that you have a new one. If I answered your question you can mark the answer as it solved the question. – Yaron May 02 '16 at 14:11
  • I'd suggest trying the executing the sample code in https://github.com/seahboonsiew/pyspark-csv/blob/master/README.md - and after it will work for you, to try and modify it to answer your specific case – Yaron May 02 '16 at 14:16
  • Another method to parse CSV as dataframe can be found here: http://stackoverflow.com/questions/36966550/exceptions-when-reading-tutorial-csv-file-in-the-cloudera-vm/36980808#36980808 – Yaron May 02 '16 at 14:17