0

I have a textfile with hundreds of columns , but the columns don't have column names.

The first column is the label and the others are features. I've read some examples that must specify cloumn names for the train data. But it is quite troublesome to specify all the names since there are too many columns.

how can I deal with this situation?

April
  • 819
  • 2
  • 12
  • 23

1 Answers1

1

You can use VectorAssempler in combination with list comprehension to structure your data for model training. Consider this example data with two feature columns (x1 and x2) and a response variable y.

df = sc.parallelize([(5, 1, 6),
                     (6, 9, 4),
                     (5, 3, 3),
                     (4, 4, 2),
                     (4, 5, 1),
                     (2, 2, 2),
                     (1, 7, 3)]).toDF(["y", "x1", "x2"])

First, we create a list of column names that are not "y":

colsList = [x for x in df.columns if x!= 'y']

Now, we can use VectorAssembler:

from pyspark.ml.feature import VectorAssembler

vectorizer = VectorAssembler()
vectorizer.setInputCols(colsList)
vectorizer.setOutputCol("features")

output = vectorizer.transform(df)
output.select("features", "y").show()
+---------+---+
| features|  y|
+---------+---+
|[1.0,6.0]|  5|
|[9.0,4.0]|  6|
|[3.0,3.0]|  5|
|[4.0,2.0]|  4|
|[5.0,1.0]|  4|
|[2.0,2.0]|  2|
|[7.0,3.0]|  1|
+---------+---+
mtoto
  • 23,919
  • 4
  • 58
  • 71