I have a dataframe with many columns, that I have created from a csv file defining a schema. The only column I'm interest in is a column called "Point", where I defined a magellan Point(long, lat). What I need to do now, is creating an RDD[Point] from that dataframe.
Below is the code that I have tried, but it does not work since rdd
is a RDD[Row] instead of RDD[Point].
val schema = StructType(Array(
StructField("vendorId", StringType, false),
StructField("lpep_pickup_datetime", StringType, false),
StructField("Lpep_dropoff_datetime", StringType, false),
StructField("Store_and_fwd_flag",StringType, false),
StructField("RateCodeID", IntegerType, false),
StructField("Pickup_longitude", DoubleType, false),
StructField("Pickup_latitude", DoubleType, false),
StructField("Dropoff_longitude", DoubleType, false),
StructField("Dropoff_latitude", DoubleType, false),
StructField("Passenger_count", IntegerType, false),
StructField("Trip_distance", DoubleType, false),
StructField("Fare_amount", StringType, false),
StructField("Extra", StringType, false),
StructField("MTA_tax", StringType, false),
StructField("Tip_amount", StringType, false),
StructField("Tolls_amount", StringType, false),
StructField("Ehail_fee", StringType, false),
StructField("improvement_surcharge", StringType, false),
StructField("Total_amount", DoubleType, false),
StructField("Payment_type", IntegerType, false),
StructField("Trip_type", IntegerType, false)))
import spark.implicits._
val points = spark.read.option("mode", "DROPMALFORMED")
.schema(schema)
.csv("/home/riccardo/Scrivania/Progetto/Materiale/NYC-taxi/")
.withColumn("point", point($"Pickup_longitude",$"Pickup_latitude"))
.limit(2000)
val rdd = points.select("point").rdd
How can I obtain an RDD[Point] instead of RDD[Row] from the dataframe? If it is not possible, which solution would you suggest? I need a RDD[Point] to work with a provided library that takes RDD[Point] as input.