10

I have large data records formatted as the following sample:

// +---+------+------+
// |cid|itemId|bought|
// +---+------+------+
// |abc|   123|  true|
// |abc|   345|  true|
// |abc|   567|  true|
// |def|   123|  true|
// |def|   345|  true|
// |def|   567|  true|
// |def|   789| false|
// +---+------+------+

cid and itemId are strings.

There are 965,964,223 records.

I am trying to convert cid to an integer using StringIndexer as follows:

dataset.repartition(50)
val cidIndexer = new StringIndexer().setInputCol("cid").setOutputCol("cidIndex")
val cidIndexedMatrix = cidIndexer.fit(dataset).transform(dataset)

But these lines of code are very slow (takes around 30 minutes). The problem is that it is so huge that I could not do anything further after that.

I am using amazon EMR cluster of R4 2XLarge cluster with 2 nodes (61 GB of memory).

Is there any performance improvement that I can do further? Any help will be much appreciated.

10465355
  • 4,481
  • 2
  • 20
  • 44
Rengasami Ramanujam
  • 1,858
  • 4
  • 19
  • 29
  • which version of Spark are you using? – Machiel Aug 12 '19 at 09:11
  • I have the same problem. In my case it's only 100,000,000 rows x 150 column. On 4 nodes with a total of 256 cores and 1,024GB memory it doesn't even finish a single task after 15 minutes. Background: My string column is not entirely unique but as good as. Also, it is rather long with ~ 40 characters on average. Is there a better way to add an index column? I've tried creating a lookup table with `zipWithIndex` but that also didn't scale well. – Juergen Sep 05 '19 at 17:30
  • I've already tried the alternative solution using zipWithIndex described here https://blogs.msdn.microsoft.com/azuredatalake/2016/06/09/appending-an-index-column-to-distributed-dataframe-based-on-another-column-with-non-unique-entries/. I.e. creating lookup table of IDs and joining. This didn't scale well either though since the lookup table was too large for a broadcast join. – Juergen Sep 09 '19 at 09:31
  • looks like a very old question and it may be related to this bug report https://issues.apache.org/jira/browse/SPARK-20392 what is your current spark version? – geo Sep 11 '19 at 03:13
  • I'm on Spark 2.4.3 so that shouldn't be an issue – Juergen Sep 11 '19 at 17:03

2 Answers2

5

That is an expected behavior, if cardinality of column is high. As a part of the training process, StringIndexer collects all the labels, and to create label - index mapping (using Spark's o.a.s.util.collection.OpenHashMap).

This process requires O(N) memory in the worst case scenario, and is both computationally and memory intensive.

In cases where cardinality of the column is high, and its content is going to be used as feature, it is better to apply FeatureHasher (Spark 2.3 or later).

import org.apache.spark.ml.feature.FeatureHasher

val hasher = new FeatureHasher()
  .setInputCols("cid")
  .setOutputCols("cid_hash_vec")
hasher.transform(dataset)

It doesn't guarantee uniqueness and it is not reversible, but it is good enough for many applications, and doesn't require fitting process.

For column that won't be used as a feature you can also use hash function:

import org.apache.spark.sql.functions.hash

dataset.withColumn("cid_hash", hash($"cid"))
10465355
  • 4,481
  • 2
  • 20
  • 44
  • You are right that the cardinality of my columns is extremely high. I've read the documentation for both functions and have questions regarding how they are different from StringIndexer. I don't mind if the hash columns are not reversible if I can create them as additional columns. However, if they're non-unique, does it mean that hash collisions are likely? Also, why wouldn't you use `hash` for features? I already see a problem that `hash` returns an Int column but in the future I will have more unique values in my column than the Integer limit allows. – Juergen Sep 16 '19 at 11:29
  • _if they're non-unique, does it mean that hash collisions are likely_ - In general, for fixed number of buckets, hash collision is proportional to the number of unique values. So if number of unique values is close to `Integer.MAX_VALUE` collisions are pretty much guaranteed. _why wouldn't you use hash for features_ - because hash alone is not suitable as a feature. At minimum (used as categorical for tree models) it would require a metadata (another bottleneck), and cannot be used directly for linear model (which means that expansion is necessary). – 10465355 Sep 17 '19 at 13:22
  • *So if number of unique values is close to Integer.MAX_VALUE collisions are pretty much guaranteed* - right, that's what I was afraid of. Since my number of unique IDs will over time approach `Integer.MAX_VALUE` neither solution will work. Even with > 10,000,000 unique IDs I guess collisions will happen *a lot* because of the birthday problem. – Juergen Sep 17 '19 at 14:54
0

Assuming that:

  • You plan to use the cid as a feature (after StringIndexer + OneHotEncoderEstimator)
  • Your data sits in S3

A few questions first:

Without knowing much more, my first guess is that you should not worry about memory now and check your degree of parallelism first. You only have 2 R4 2XLarge instances that will give you:

  • 8 CPUs
  • 61GB Memory

Personally, I would try to either:

  • Get more instances
  • Swap the R4 2XLarge instances with others that have more CPUs

Unfortunately, with the current EMR offering this can only be achieved by throwing money at the problem:

Finally, what's the need to repartition(50)? That might just introduce further delays...

Marsellus Wallace
  • 17,991
  • 25
  • 90
  • 154