Remove rows where value is string in pyspark dataframe

Question

I am trying to use KMeans on geospatial data stored in MongoDB database using Apache Spark. The data has following format,

DataFrame[decimalLatitude: double, decimalLongitude: double, features: vector]

The code is as follows, where inputdf is the dataframe.

vecAssembler = VectorAssembler(
                inputCols=["decimalLatitude", "decimalLongitude"],
                outputCol="features")
inputdf = vecAssembler.transform(inputdf)
kmeans = KMeans(k = 10, seed = 123)
model = kmeans.fit(inputdf.select("features"))

There seems to be some empty strings in the dataset, as I get following error,

com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a IntegerType (value: BsonString{value=''})

I tried to find such rows using,

issuedf = inputdf.where(inputdf.decimalLatitude == '')
issuedf.show()

But I get the same type conversion error as above. I also tried df.replace, but I got the same error. How do I remove all rows where such value is present?

Please include the code which causes the error to your question. — cronoik, Nov 13 '19 at 15:33
@cronoik, I made the edit. The error must be because of string in the `features` attribute. However, as it is created from `decimalLatitude` and `decimalLongitude`, I believe the issue must be in one or both of them. Also the code works fine for a different subset of the same parent dataset. — Registered User, Nov 13 '19 at 15:46
You have a dataframe which schema is probably [int, int] but some rows have string values. I believe you have to filter the string values (even converting them into None is ok) before the dataframe is created, otherwise you won't be able to work with that dataframe. Have a look at the accepted [answer](https://stackoverflow.com/questions/35990117/spark-dataframe-not-respecting-schema-and-considering-everything-as-string) here. — LizardKing, Nov 13 '19 at 16:19

score 1 · Accepted Answer · answered Nov 13 '19 at 19:47

This issue can be solved by providing data types when loading the data as follows,

inputdf = my_spark.read.format("mongo").load(schema=StructType(
    [StructField("decimalLatitude", DoubleType(), True),
     StructField("decimalLongitude", DoubleType(), True)]))

This ensures that all values are of DoubleType. Now empty values can be removed using inputdf.dropna()

Remove rows where value is string in pyspark dataframe

1 Answers1