2

For Spark, there is a StringIndexer in Spark ML that can do label encoding for a given column. However it cannot directly handle the situation where the column is variable length feature (or multi-value feature). For example,

+-------+--------------------+--------------------+--------------------+--------------------+
|  jobid|        country_list|     typeofwork_list|             publish|              expire|
+-------+--------------------+--------------------+--------------------+--------------------+
|1636051|USA;Spain;China;A...|1441;1442;1443      |27/03/2017 2:00:0...|3/04/2017 1:59:59 PM|
|1636052|USA;Spain;Japan;A...|1441;1442           |27/03/2017 2:00:0...|3/04/2017 1:59:59 PM|
|1636053|UK;Spain;China;A....|1442;1443           |27/03/2017 2:00:0...|3/04/2017 1:59:59 PM|
|1636054|USA;Spain;China;A...|1443                |27/03/2017 2:00:0...|3/04/2017 1:59:59 PM|

The country_list and typeofwork_list columns are variable length features. They can have more than a single value in one column, and the number of values is variant. When I wanna do label encoding on them, I cannot use StringIndexer directly on these columns.

Taking the country_list column for example, something like the following is the result I need:

+--------------------+
|        country_list|
+--------------------+
|0;1;2;3...          |
|0;1;4;3...          |
|5;1;2;3...          |
|0;1;2;3...          |

What is the best way to do label encoding on such columns in Spark?

One way I am thinking is first explode the country_list into a single-column dataframe, then do label encoding (StringIndexer) on this interim dataframe. After that, dropDulicate, then collect it, now I should have the mapping. Then I can broadcast the mapping to all worker machines. The original Dataframe can use a UDF which wraps the mapping to transform the country_list column. I'm wondering is there easier ways to do this?

Thank you.

CyberPlayerOne
  • 3,078
  • 5
  • 30
  • 51

0 Answers0