I have large data records formatted as the following sample:
// +---+------+------+
// |cid|itemId|bought|
// +---+------+------+
// |abc| 123| true|
// |abc| 345| true|
// |abc| 567| true|
// |def| 123| true|
// |def| 345| true|
// |def| 567| true|
// |def| 789| false|
// +---+------+------+
cid
and itemId
are strings.
There are 965,964,223 records.
I am trying to convert cid
to an integer using StringIndexer
as follows:
dataset.repartition(50)
val cidIndexer = new StringIndexer().setInputCol("cid").setOutputCol("cidIndex")
val cidIndexedMatrix = cidIndexer.fit(dataset).transform(dataset)
But these lines of code are very slow (takes around 30 minutes). The problem is that it is so huge that I could not do anything further after that.
I am using amazon EMR cluster of R4 2XLarge cluster with 2 nodes (61 GB of memory).
Is there any performance improvement that I can do further? Any help will be much appreciated.