How to handle nulls in SparkSQL Dataframes

Question

This is the code that I am following:

val ebayds = sc.textFile("/user/spark/xbox.csv")

case class Auction(auctionid: String, bid: Float, bidtime: Float, bidder: String, bidderrate: Int, openbid: Float, price: Float)

val ebay = ebayds.map(a=>a.split(",")).map(p=>Auction(p(0),p(1).toFloat,p(2).toFloat,p(3),p(4).toInt,p(5).toFloat,p(6).toFloat)).toDF()

ebay.select("auctionid").distinct.count

The error that I am getting is:

 For input string: ""
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)

Possible duplicate of [Replace null values in Spark DataFrame](https://stackoverflow.com/questions/33376571/replace-null-values-in-spark-dataframe) — eliasah, Jun 02 '17 at 08:53
It looks like you have an empty String `""`, not `null`. No? — Jasper-M, Jun 06 '17 at 13:45

Ram Ghadiyaram · Answer 1 · 2016-12-01T03:19:24.793

Use DataFrameNaFunctions

DataFrame fill(double value) Returns a new DataFrame that replaces null values in numeric columns with value.

DataFrame fill(double value, scala.collection.Seq cols) (Scala-specific) Returns a new DataFrame that replaces null values in specified numeric columns.

Example Usage :

df.na.fill(0.0,Seq("your columnname"))

for that column null values will be replaced with 0.0 or any default value.

replace is also useful for replacing empty strings with default values

replace public DataFrame replace(String col, java.util.Map replacement) Replaces values matching keys in replacement map with the corresponding values. Key and value of replacement map must have the same type, and can only be doubles or strings. If col is "*", then the replacement is applied on all string columns or numeric columns.

import com.google.common.collect.ImmutableMap;

// Replaces all occurrences of 1.0 with 2.0 in column "height".
df.replace("height", ImmutableMap.of(1.0, 2.0));

// Replaces all occurrences of "UNKNOWN" with "unnamed" in column "name". df.replace("name", ImmutableMap.of("UNKNOWN", "unnamed"));

// Replaces all occurrences of "UNKNOWN" with "unnamed" in all string columns. df.replace("*", ImmutableMap.of("UNKNOWN", "unnamed")); Parameters: col - name of the column to apply the value replacement replacement - value replacement map, as explained above Returns: (undocumented) Since: 1.3.1

for example :

df.na.replace("your column", Map(""-> 0.0)))

score 0 · Answer 2 · edited Jun 02 '17 at 08:44

0

This worked for me. It returned a dataframe. Here A and B are columns and 1.0 and "unknown" are values to be replaced.

df.na.fill(Map("A" -> "unknown","B" -> 1.0))

edited Jun 02 '17 at 08:44

peter.hrasko.sk

4,043
2
19
34

answered Jun 02 '17 at 07:14

Prem

1

How to handle nulls in SparkSQL Dataframes

2 Answers2