1

I am trying to replace a full stop in my raw data with the value 0 in PySpark.

  1. I tried to use a .when and .otherwise statement.
  2. I tried to use regexp_replace to change the '.' to 0.

Code tried:

from pyspark.sql import functions as F

#For #1 above:

dataframe2 = dataframe1.withColumn("test_col", F.when(((F.col("test_col") == F.lit(".")), 0).otherwise(F.col("test_col")))

#For #2 above:

dataframe2 = dataframe1.withColumn('test_col', F.regexp_replace(dataframe1.test_col, '.', 0))

Instead of "." it should rewrite the column with numbers only (i.e. there is a number in non full stop rows, otherwise, it's a full stop that should be replaced with 0).

Grzegorz Skibinski
  • 12,624
  • 2
  • 11
  • 34
Dino Abraham
  • 151
  • 8
  • The above should work fine, if ```test_col``` equals exactly "dot" ("."). Is this what you want, or you want it to replace ".", whenever it's inside ```test_col``` ? – Grzegorz Skibinski Sep 01 '19 at 17:50
  • Both solutions above? Yes I want it to replace the ".". I checked the data, it seems to be numeric or a dot, perhaps there are some with spaces or something so maybe I can try a logic for whenever its inside the test_col? Otherwise there is only a fukll number e.g. 425 with no decimal. – Dino Abraham Sep 01 '19 at 18:43
  • I get a syntax error with both codes above btw. – Dino Abraham Sep 01 '19 at 19:04
  • with column should be the dataframe index such as id or timestamp. check this out https://stackoverflow.com/questions/44773758/how-to-conditionally-replace-value-in-a-column-based-on-evaluation-of-expression – oetzi Sep 01 '19 at 20:18
  • can you provide a sample dataset createDataframe? Are you looking at replacing all dots in the dataframe, or specific to a column? – Yuva Sep 02 '19 at 04:33
  • Hm, your code looks alright then. Could you share the sample dataset ? I suppose it has to be something different than just "." then. – Grzegorz Skibinski Sep 02 '19 at 06:55
  • I solved it... seems I had too many brackets. dataframe2 = dataframe1.withColumn("test_col", F.when((F.col("test_col") == F.lit(".")), 0).when((F.col("test_col") == F.lit("N/A")), 0).otherwise(F.col("test_col"))) – Dino Abraham Sep 03 '19 at 14:36

3 Answers3

1

pyspark version

from pyspark.sql import SparkSession
from pyspark.sql.types import (StringType, IntegerType, StructField, StructType)
from pyspark.sql import functions

column_schema = StructType([StructField("num", IntegerType()), StructField("text", StringType())])

data = [[3, 'r1'], [9, 'r2.'], [27, '.']]

spark = SparkSession.builder.master("local").getOrCreate()
spark.conf.set("spark.executor.memory", '1g')
spark.conf.set('spark.executor.cores', '1')
spark.conf.set('spark.cores.max', '2')
spark.conf.set("spark.driver.memory", '1g')
spark_context = spark.sparkContext

data_frame = spark.createDataFrame(data, schema=column_schema)
data_frame.show()

filtered_data_frame = data_frame.withColumn('num',
                                            functions.when(data_frame['num'] == 3, -3).otherwise(data_frame['num']))
filtered_data_frame.show()

filtered_data_frame = data_frame.withColumn('text',
                                            functions.when(data_frame['text'] == '.', '0').otherwise(
                                                data_frame['text']))
filtered_data_frame.show()

output

+---+----+
|num|text|
+---+----+
|  3|  r1|
|  9| r2.|
| 27|   .|
+---+----+

+---+----+
|num|text|
+---+----+
| -3|  r1|
|  9| r2.|
| 27|   .|
+---+----+

+---+----+
|num|text|
+---+----+
|  3|  r1|
|  9| r2.|
| 27|   0|
+---+----+
oetzi
  • 1,002
  • 10
  • 21
0

sample code does query properly package otz.scalaspark

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}

object ValueReplacement {
  def main(args: Array[String]) {
    val sparkConfig = new SparkConf().setAppName("Value-Replacement").setMaster("local[*]").set("spark.executor.memory", "1g");
    val sparkContext = new SparkContext(sparkConfig)

    val someData = Seq(
      Row(3, "r1"),
      Row(9, "r2"),
      Row(27, "r3"),
      Row(81, "r4")
    )

    val someSchema = List(
      StructField("number", IntegerType, true),
      StructField("word", StringType, true)
    )

    val sqlContext = new SQLContext(sparkContext)

    val dataFrame = sqlContext.createDataFrame(
      sparkContext.parallelize(someData),
      StructType(someSchema)
    )

    val filteredDataFrame = dataFrame.withColumn("number", when(col("number") === 3, -3).otherwise(col("number")));
    filteredDataFrame.show()
  }
}

output

+------+----+
|number|word|
+------+----+
|    -3|  r1|
|     9|  r2|
|    27|  r3|
|    81|  r4|
+------+----+
oetzi
  • 1,002
  • 10
  • 21
0

You attempt #2 was almost correct, if you have a dataframe1 like:

+--------+
|test_col|
+--------+
|     1.0|
|     2.0|
|       2|
+--------+

Your attempt must be yielding:

dataframe2 = dataframe1.withColumn('test_col', F.regexp_replace(dataframe1.test_col, '.', 0))
dataframe2.show()

+--------+
|test_col|
+--------+
|     000|
|     000|
|       0|
+--------+

Here the . means all the letter are to be replaced and not just '.'. However, if you add an escape sequence (\) before the dot then things should work fine.

dataframe2 = dataframe1.withColumn('test_col', F.regexp_replace(dataframe1.test_col, '\.', '0'))
dataframe2.show()

+--------+
|test_col|
+--------+
|     100|
|     200|
|       2|
+--------+
Sunny Shukla
  • 342
  • 2
  • 8