3

I am using spark MLlib to build machine learning models. I need to give libsvm format files as input if there are categorical variables in the data.

I tried converting csv file to libsvm using 1. Convert.c as suggested in the libsvm site 2. Csvtolibsvm.py in phraug github

But both these scripts do not seem to be converting categorical data. I also installed weka and tried saving to libsvm format. But couldn't find that option in the weka explorer.

Please suggest any other way of converting csv with categorical data to libsvm format or let me know if I am missing anything here.

Thanks for the help in advance.

zero323
  • 322,348
  • 103
  • 959
  • 935
Sirisha
  • 31
  • 1
  • 3

2 Answers2

0

I guess you want to train a SVM. It needs an input of an rdd [LabeledPoint].

https://spark.apache.org/docs/1.4.1/api/scala/#org.apache.spark.mllib.classification.SVMWithSGD

I suggest you treat your categorical columns similar to the second answer here:

How to transform a categorical variable in Spark into a set of columns coded as {0,1}?

where the LogisticRegression case is very similar to the SVM one.

Community
  • 1
  • 1
Dr VComas
  • 735
  • 7
  • 22
0

you can try hash tricks to convert categorical features in to number, and then convert the dataframe to rdd if order do map the function to each Row. The following fake example is solved using pyspark.

for example the dataframe for convertion is df:

>> df.show(5)

+------+----------------+-------+-------+
|gender|            city|country|     os|
+------+----------------+-------+-------+
|     M|         chennai|     IN|ANDROID|
|     F|       hyderabad|     IN|ANDROID|
|     M|leighton buzzard|     GB|ANDROID|
|     M|          kanpur|     IN|ANDROID|
|     F|       lafayette|     US|    IOS|
+------+----------------+-------+-------+

I want to use features: yob, city, country to predict gender.

import hashlib
from pyspark.sql import Row
from pyspark.ml.linalg import SparseVector

spark = SparkSession \
    .builder \
    .appName("Spark-app")\
     .config("spark.some.config.option", "some-value")\
    .getOrCreate() # create the spark session

NR_BINS = 100000 # the total number of categories, it should be a big number if you have many different categories in each feature and a lot of categorical features. in the meantime do consider the memory.

def hashnum(input):
    return int(hashlib.md5(input).hexdigest(), 16)%NR_BINS + 1

def libsvm_converter(row):
    target = "gender"
    features = ['city', 'country', 'os']
    if row[target] == "M":
        lab = 1
    elif row[target] == "F":
        lab = 0
    else:
        return
    sparse_vector = []
    for f in features:
        v = '{}-{}'.format(f, row[f].encode('utf-8'))
        hashv = hashnum(v) # the index in libsvm
        sparse_vector.append((hashv, 1)) # the value is always 1 because of categorical feature
    sparse_vector = list(set(sparse_vector)) # in case there are clashes (BR_BINS not big enough)
    return Row(label = lab, features=SparseVector(NR_BINS, sparse_vector))


libsvm = df.rdd.map(libsvm_converter_2)
data = spark.createDataFrame(libsvm)

if you check the data, it would looks like this;

>> data.show()
+--------------------+-----+
|            features|label|
+--------------------+-----+
|(100000,[12626,68...|    1|
|(100000,[59866,68...|    0|
|(100000,[66386,68...|    1|
|(100000,[53746,68...|    1|
|(100000,[6966,373...|    0|
+--------------------+-----+