How does Spark model treat vector column?

Question

How will method in spark threat a vector assembler column? For example, if I have longitude and latitude column, is it better to assemble them using vector assembler then put it into my model or it does not make any difference if I just put them directly(separately)?

Example1:

loc_assembler = VectorAssembler(inputCols=['long', 'lat'], outputCol='loc')
vector_assembler = VectorAssembler(inputCols=['loc', 'feature1', 'feature2'], outputCol='features')
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
pipeline = Pipeline(stages=[loc_assembler, vector_assembler, lr])

Example2:

vector_assembler = VectorAssembler(inputCols=['long', 'lat', 'feature1', 'feature2'], outputCol='features')
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
pipeline = Pipeline(stages=[vector_assembler, lr])

What is the difference? Which one is better?

I have added the example @dennlinger.. Could you help me now? — Gregorius Edwadr, Sep 17 '18 at 07:49

desertnaut · Accepted Answer · 2018-09-17T09:14:29.753

There will not be any difference simply because, in both your examples, the final form of the features column will be the same, i.e. in your 1st example, the loc vector will be broken back into its individual components.

Here is short demonstration with dummy data (leaving the linear regression part aside, as it is unnecessary for this discussion):

spark.version
#  u'2.3.1'

# dummy data:
df = spark.createDataFrame([[0, 33.3, -17.5, 10., 0.2],
                              [1, 40.4, -20.5, 12., 2.2],
                              [2, 28., -23.9, -2., -1.7],
                              [3, 29.5, -19.0, -0.5, -0.2],
                              [4, 32.8, -18.84, 1.5, 1.8]
                             ],
                              ["id","lat", "long", "other", "label"])

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.pipeline import Pipeline

loc_assembler = VectorAssembler(inputCols=['long', 'lat'], outputCol='loc')
vector_assembler = VectorAssembler(inputCols=['loc', 'other'], outputCol='features')
pipeline = Pipeline(stages=[loc_assembler, vector_assembler])

model = pipeline.fit(df)
model.transform(df).show()

The result is:

+---+----+------+-----+-----+-------------+-----------------+
| id| lat|  long|other|label|          loc|         features|
+---+----+------+-----+-----+-------------+-----------------+
|  0|33.3| -17.5| 10.0|  0.2| [-17.5,33.3]|[-17.5,33.3,10.0]|
|  1|40.4| -20.5| 12.0|  2.2| [-20.5,40.4]|[-20.5,40.4,12.0]|
|  2|28.0| -23.9| -2.0| -1.7| [-23.9,28.0]|[-23.9,28.0,-2.0]|
|  3|29.5| -19.0| -0.5| -0.2| [-19.0,29.5]|[-19.0,29.5,-0.5]|
|  4|32.8|-18.84|  1.5|  1.8|[-18.84,32.8]|[-18.84,32.8,1.5]| 
+---+----+------+-----+-----+-------------+-----------------+

i.e. the features column is arguably identical with your 2nd example (not shown here), where you do not use the intermediate assembled feature loc...

How does Spark model treat vector column?

1 Answers1