you can try hash tricks to convert categorical features in to number, and then convert the dataframe to rdd if order do map the function to each Row.
The following fake example is solved using pyspark.
for example the dataframe for convertion is df:
>> df.show(5)
+------+----------------+-------+-------+
|gender| city|country| os|
+------+----------------+-------+-------+
| M| chennai| IN|ANDROID|
| F| hyderabad| IN|ANDROID|
| M|leighton buzzard| GB|ANDROID|
| M| kanpur| IN|ANDROID|
| F| lafayette| US| IOS|
+------+----------------+-------+-------+
I want to use features: yob, city, country to predict gender.
import hashlib
from pyspark.sql import Row
from pyspark.ml.linalg import SparseVector
spark = SparkSession \
.builder \
.appName("Spark-app")\
.config("spark.some.config.option", "some-value")\
.getOrCreate() # create the spark session
NR_BINS = 100000 # the total number of categories, it should be a big number if you have many different categories in each feature and a lot of categorical features. in the meantime do consider the memory.
def hashnum(input):
return int(hashlib.md5(input).hexdigest(), 16)%NR_BINS + 1
def libsvm_converter(row):
target = "gender"
features = ['city', 'country', 'os']
if row[target] == "M":
lab = 1
elif row[target] == "F":
lab = 0
else:
return
sparse_vector = []
for f in features:
v = '{}-{}'.format(f, row[f].encode('utf-8'))
hashv = hashnum(v) # the index in libsvm
sparse_vector.append((hashv, 1)) # the value is always 1 because of categorical feature
sparse_vector = list(set(sparse_vector)) # in case there are clashes (BR_BINS not big enough)
return Row(label = lab, features=SparseVector(NR_BINS, sparse_vector))
libsvm = df.rdd.map(libsvm_converter_2)
data = spark.createDataFrame(libsvm)
if you check the data, it would looks like this;
>> data.show()
+--------------------+-----+
| features|label|
+--------------------+-----+
|(100000,[12626,68...| 1|
|(100000,[59866,68...| 0|
|(100000,[66386,68...| 1|
|(100000,[53746,68...| 1|
|(100000,[6966,373...| 0|
+--------------------+-----+