Efficient text preprocessing using PySpark (clean, tokenize, stopwords, stemming, filter)

Question

Recently, I began to learn the spark on the book "Learning Spark". In theory, everything is clear, in practice, I was faced with the fact that I first need to preprocess the text, but there were no actual tips on this topic.

The first thing that I took into account is that it is now preferable to use Dataframe instead of RDD, so my preprocessing attempt was made on dataframes.

Required operations:

Clearing text from punctuation (regexp_replace)
Tokenization (Tokenizer)
Delete stop words (StopWordsRemover)
Stematization (SnowballStemmer)
Filtering short words (udf)

My code is:

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col, lower, regexp_replace
from pyspark.ml.feature import Tokenizer, StopWordsRemover
from nltk.stem.snowball import SnowballStemmer

spark = SparkSession.builder \
    .config("spark.executor.memory", "3g") \
    .config("spark.driver.cores", "4") \
    .getOrCreate()
df = spark.read.json('datasets/entitiesFull/full').select('id', 'text')

# Clean text
df_clean = df.select('id', (lower(regexp_replace('text', "[^a-zA-Z\\s]", "")).alias('text')))

# Tokenize text
tokenizer = Tokenizer(inputCol='text', outputCol='words_token')
df_words_token = tokenizer.transform(df_clean).select('id', 'words_token')

# Remove stop words
remover = StopWordsRemover(inputCol='words_token', outputCol='words_clean')
df_words_no_stopw = remover.transform(df_words_token).select('id', 'words_clean')

# Stem text
stemmer = SnowballStemmer(language='english')
stemmer_udf = udf(lambda tokens: [stemmer.stem(token) for token in tokens], ArrayType(StringType()))
df_stemmed = df_words_no_stopw.withColumn("words_stemmed", stemmer_udf("words_clean")).select('id', 'words_stemmed')

# Filter length word > 3
filter_length_udf = udf(lambda row: [x for x in row if len(x) >= 3], ArrayType(StringType()))
df_final_words = df_stemmed.withColumn('words', filter_length_udf(col('words_stemmed')))

Processing takes a very long time, the size of the entire document is 60 GB. Does it make sense to use RDD? Will caching help? How can I optimize preprocessing?

First I tested the implementation on the local computer, then I will try on the cluster. Local computer - Ubuntu RAM 6Gb, 4 CPU. Any alternative solution is also welcome. Thanks!

60Gb is still an extremely large file, at this day, and the type of processing you do does not lend itself to numerical optimization. There are probably methods to speed it up a little, but expecting results in seconds sounds hopelessly optimistic to me. — Jongware, Dec 02 '18 at 10:55
Is there spark version of SnowballStemmer available anywhere? If there is you could use this one in order to boost up the speed. Also. I would try to utilize native functions of pyspark instead of using udf's since those are slower than native ones. Here you should be able to use functions like length and when to come up with same results. https://medium.com/@fqaiser94/udfs-vs-map-vs-custom-spark-native-functions-91ab2c154b44 — uxke, Apr 23 '20 at 06:43
With a 60gb file spark can't paralellize the read of the object, so you have here a bottleneck. Before reading, you can repartition the dataframe to get better performance. — Shadowtrooper, Feb 17 '21 at 12:37

score 1 · Answer 1 · answered Jul 08 '20 at 21:59

JSON is typically the worst file format for Spark analysis, especially if it's a single 60GB JSON file. Spark works well with 1GB Parquet files. A little pre-processing will help a lot:

temp_df = spark.read.json('datasets/entitiesFull/full').select('id', 'text').repartition(60)
temp_df.write.parquet('some/other/path')
df = spark.read.parquet('some/other/path')
# ... continue the rest of the analysis

Wrapping the SnowballStemmer in a UDF isn't the best from a performance perspective, but the most realistic unless you're comfortable writing algos in low level Java bytecode. I created a Porter Stemming algo in ceja using a UDF as well.

Here's an example of a native implementation of a Spark function. It's possible to implement, but not easy.

Efficient text preprocessing using PySpark (clean, tokenize, stopwords, stemming, filter)

1 Answers1

Linked