I am trying to use Spark's StringIndexer feature transformer on a column with about 15.000.000 unique string values. Regardless of how many resources I throw at it, Spark always dies on me with some sort of Out Of Memory exception.
from pyspark.ml.feature import StringIndexer
data = spark.read.parquet("s3://example/data-raw").select("user", "count")
user_indexer = StringIndexer(inputCol="user", outputCol="user_idx")
indexer_model = user_indexer.fit(data) # This never finishes
indexer_model \
.transform(data) \
.write.parquet("s3://example/data-indexed")
An error file is produced on the driver, with the begining of it looking like this:
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 268435456 bytes for committing reserved memory.
# Possible reasons:
# The system is out of physical RAM or swap space
# In 32 bit mode, the process size limit was hit
# Possible solutions:
# Reduce memory load on the system
# Increase physical memory or swap space
# Check if swap backing store is full
# Use 64 bit Java on a 64 bit OS
# Decrease Java heap size (-Xmx/-Xms)
# Decrease number of Java threads
# Decrease Java thread stack sizes (-Xss)
# Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
# Out of Memory Error (os_linux.cpp:2657)
Now, if I try to manually index the values and store them in a dataframe, everything works like charm, all on couple of Amazon c3.2xlarge
workers.
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
data = spark.read.parquet("s3://example/data-raw").select("user", "count")
uid_map = data \
.select("user") \
.distinct() \
.select("user", row_number().over(Window.orderBy("user")).alias("user_idx"))
data.join(uid_map, "user", "inner").write.parquet("s3://example/data-indexed")
I would really like to use the formal transformers provided by Spark, but at this time this doesn't seem possible. Any ideas of how I can make this work?