1

I've a csv file, which looks something like this

A B C
1 2 
2 4
3 2 5
1 2 3
4 5 6

When I'm reading this data into spark, it's considering column C as "string" because of "blanks" in the first few rows.

Could anybody please tell me how to load this file in SQL dataframe so that column c remains integer (or float)?

I'm using "sc.textFile" to read the data into spark, and then converting it into SQL dataframe.

I read this and this links. But they didn't help me much.

My code portion. In the last line of the code I'm getting the error.

myFile=sc.textFile(myData.csv)

header = myFile.first()
fields = [StructField(field_name, StringType(), True) for field_name in header.split(',')]
fields[0].dataType = FloatType()
fields[1].dataType = FloatType()
fields[2].dataType = FloatType()

schema = StructType(fields)

myFileCh = myFile.map(lambda k: k.split(",")).map(lambda p: (float(p[0]),float(p[1]),float(p[2])))

Thanks!

Community
  • 1
  • 1
Beta
  • 1,638
  • 5
  • 33
  • 67
  • You would need to use pattern matching and cast to the desired type according to the content in c – z-star May 24 '16 at 10:50
  • @z-star: Thanks for your comment! But I didn't get what you are saying. I'm following this (http://www.nodalpoint.com/spark-data-frames-from-csv-files-handling-headers-column-types/) method to convert my data into SQL dataframe. The issue is coming when I'm trying to create "taxi_temp" part. In my dataset the last column is blank and I mentioned datatype as "float". So, it's saying can't convert "string" into "float". – Beta May 24 '16 at 11:31
  • o ok. can you please publish your code? – z-star May 24 '16 at 11:45
  • I've updated the code snippet in the main question. – Beta May 24 '16 at 12:06
  • You spilt the data on commas, but there are no commas in what you posted as your data – OneCricketeer May 24 '16 at 12:16
  • The data I'm using is csv file. I just put that data structure just as an example. – Beta May 24 '16 at 12:27

1 Answers1

1

So the issue is with this unsafe casting. you could implement a short function that will perform a "safe" cast and return a defult value in case cast to fload fails.

def safe_cast(val, to_type, default=None):
try:
    return to_type(val)
except ValueError:
    return default

safe_cast('tst', float) # will return None
safe_cast('tst', float, 0.0) # will return 0.0

myFileCh = myFile.map(lambda k: k.split(",")).map(lambda p: (safe_cast(p[0], float),safe_cast(p[1], float),safe_cast(p[2], float)))
z-star
  • 680
  • 5
  • 6