-1

The file generated from API contains data like below

col1,col2,col3
503004,(d$üíõ$F|'.h*Ë!øì=(.î;      ,.¡|®!®3-2-704

when i am reading in spark it is appearing like this. i am using case class to read from RDD then convert it to DataFrame using .todf.

503004,������������,������������������������3-2-704

but i am trying to get value like

503004,dFh,3-2-704-- only alpha-numeric value is retained.

i am using spark 1.6 and scala.

Please share your suggestion

Sophie Dinka
  • 73
  • 1
  • 8
  • 1
    try `df.select(regexp_replace('col2,"[^a-zA-Z]",""))` – chlebek Sep 20 '19 at 16:32
  • @AndrzejS: Many Thanks. but again instead of empty string i am getting value like {"col1":"CM","col2":"�����������������������������"}, i am generating dataframe from RDD using .todf , i am not sure why it is happening. please help, i am using Spark 1.6, and i am not sure whether 'utf-8' option can be applied and check – Sophie Dinka Sep 20 '19 at 17:03
  • @AndrzejS: any other way kindly suggest – Sophie Dinka Sep 20 '19 at 17:34
  • @AndrzejS: i tried like this even. val res = df.map(x => x(1).decode ('utf-8')). Error: . but throwing error as cannot resolve symbol decode. added org.apache.spark.sql.functions._ – Sophie Dinka Sep 20 '19 at 17:37

1 Answers1

0
#this ca be achieved by using the regex_replace
    val df = spark.sparkContext.parallelize(List(("503004","d$üíõ$F|'.h*Ë!øì=(.î;      ,.¡|®!®","3-2-704"))).toDF("col1","col2","col3")
    df.withColumn("col2_new", regexp_replace($"col2", "[^a-zA-Z]", "")).show()    
Output:
+------+--------------------+-------+--------+
|  col1|                col2|   col3|col2_new|
+------+--------------------+-------+--------+
|503004|d$üíõ$F|'.h*Ë!øì=...|3-2-704|     dFh|
+------+--------------------+-------+--------+