0

I am trying to use KMeans on geospatial data stored in MongoDB database using Apache Spark. The data has following format,

DataFrame[decimalLatitude: double, decimalLongitude: double, features: vector]

The code is as follows, where inputdf is the dataframe.

vecAssembler = VectorAssembler(
                inputCols=["decimalLatitude", "decimalLongitude"],
                outputCol="features")
inputdf = vecAssembler.transform(inputdf)
kmeans = KMeans(k = 10, seed = 123)
model = kmeans.fit(inputdf.select("features"))

There seems to be some empty strings in the dataset, as I get following error,

com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a IntegerType (value: BsonString{value=''})

I tried to find such rows using,

issuedf = inputdf.where(inputdf.decimalLatitude == '')
issuedf.show()

But I get the same type conversion error as above. I also tried df.replace, but I got the same error. How do I remove all rows where such value is present?

Registered User
  • 2,239
  • 3
  • 32
  • 58
  • Please include the code which causes the error to your question. – cronoik Nov 13 '19 at 15:33
  • 1
    @cronoik, I made the edit. The error must be because of string in the `features` attribute. However, as it is created from `decimalLatitude` and `decimalLongitude`, I believe the issue must be in one or both of them. Also the code works fine for a different subset of the same parent dataset. – Registered User Nov 13 '19 at 15:46
  • 2
    You have a dataframe which schema is probably [int, int] but some rows have string values. I believe you have to filter the string values (even converting them into None is ok) before the dataframe is created, otherwise you won't be able to work with that dataframe. Have a look at the accepted [answer](https://stackoverflow.com/questions/35990117/spark-dataframe-not-respecting-schema-and-considering-everything-as-string) here. – LizardKing Nov 13 '19 at 16:19

1 Answers1

1

This issue can be solved by providing data types when loading the data as follows,

inputdf = my_spark.read.format("mongo").load(schema=StructType(
    [StructField("decimalLatitude", DoubleType(), True),
     StructField("decimalLongitude", DoubleType(), True)]))

This ensures that all values are of DoubleType. Now empty values can be removed using inputdf.dropna()

Registered User
  • 2,239
  • 3
  • 32
  • 58