remove all the special characters from a csv file using spark

Question

how to remove all the special characters from a csv file from a spark dataframe using java spark For example: Below is the csv file content with spaces and special characters

"UNITED STATES CELLULAR CORP. - OKLAHOMA",WIRELESS,"US Cellular"

o/p I needed

UNITEDSTATESCELLULARCORPOKLAHOMA|WIRELESS|US Cellular( in lower case)

Thanks in Advance

shanmuga · Answer 1 · 2019-01-08T09:38:51.650

You should use String.replaceAll method (and regex) to replace every character that is not alapha numeric to empty string. Use this as udf and apply to all columns in the dataframe.

The java code should look like

import org.apache.spark.sql.Column;
import static org.apache.spark.sql.functions.udf;
import org.apache.spark.sql.expressions.UserDefinedFunction;
import org.apache.spark.sql.types.DataTypes;

import java.util.Arrays;

UserDefinedFunction cleanUDF = udf(
  (String strVal) -> strVal.replaceAll("[^a-zA-Z0-9]", ""), DataTypes.StringType
);

Column newColsLst[] = Arrays.stream(df.columns())
    .map(c -> cleanUDF.apply(new Column(c)).alias(c) )
    .toArray(Column[]::new);

Dataset<Row> new_df = df.select(newColsLst);

Reference: How do I call a UDF on a Spark DataFrame using JAVA?

@pragadeeshwaranvenkatachalam I have added a java code. Sorry I couldn't test it, it may not work ex — shanmuga, Jan 07 '19 at 13:16

remove all the special characters from a csv file using spark

1 Answers1