How to groupBy and perform data scaling over each and every group using MlLib Pyspark?

Question

I have a dataset just like in the example below and I am trying to group all rows from a given symbol and perform standard scaling of each group so that at the end all my data is scaled by groups. How can I do that with MlLib and Pyspark? I could not find a single solution on internet for it. Can anyone help here?

+------+------------------+------------------+------------------+------------------+
|symbol|              open|              high|               low|             close|
+------+------------------+------------------+------------------+------------------+
|   AVT|             4.115|             4.115|            4.0736|            4.0736|
|   ZEC| 365.6924715181936| 371.9164684545918| 364.8854025324053| 369.5950712239761|
|   ETH|  647.220769018717| 654.6370842160561| 644.8942258095359| 652.1231757197687|
|   XRP|0.3856343600456335|0.4042970302356221|0.3662228285447956|0.4016658006619401|
|   XMR|304.97650674864144|304.98649644294267|299.96970554155274| 303.8663243145598|
|   LTC|321.32437862304715| 335.1872636382617| 320.9704201234651| 334.5057757774086|
|   EOS|            5.1171|            5.1548|            5.1075|             5.116|
|   BCH| 1526.839255299505| 1588.106037653013|1526.8392543926366|1554.8447136830328|
|  DASH|      878.00000003|      884.03769206|      869.22000004|      869.22000004|
|   BTC|17042.224796462127| 17278.87984139109|16898.509289685637|17134.611038665582|
|   REP|       32.50162799|         32.501628|       32.41062673|       32.50162799|
|  DASH|      858.98413357|      863.01413927|      851.07145059|      851.17051529|
|   ETH| 633.1390884474979|  650.546984589714| 631.2674221381849| 641.4566047907362|
|   XRP|0.3912300406160967|0.3915937383961073|0.3480682353334925|0.3488616679337076|
|   EOS|              5.11|            5.1675|            5.0995|            5.1674|
|   BCH|1574.9602789966184|1588.6004569127992|            1515.3|            1521.0|
|   BTC|  17238.0199449088| 17324.83886467445|16968.291408828714| 16971.12960974206|
|   LTC| 303.3999614441217| 317.6966006615225|302.40702519057584|  310.971265429805|
|   REP|       32.50162798|       32.50162798|         32.345677|         32.345677|
|   XMR| 304.1618444641083| 306.2720324372592|295.38042671416935|  295.520097663825|
+------+------------------+------------------+------------------+------------------+

Hi, I understand you want to group by `symbol`, but what type of scaling are you looking to do? MinMax, StdDev, anything else? — nferreira78, Feb 23 '22 at 11:51
I prefer standard scaling but generaly speaking any kind, stdDev or MinMax! — Yordan Иванов, Feb 23 '22 at 11:58
sorry forgot to ask, if you are using`rdd`, `pyspark.DataFrame` or anything else? — nferreira78, Feb 23 '22 at 12:12
also, are you able to do the `groupBy`, or are you also having trouble with it? — nferreira78, Feb 23 '22 at 12:13
I can do groupBy but the aggregated functions that groupBy has do not provide standard scaling at all. the data is in a spark dataframe format. I can convert it into RDD but still don't know how to group the RDD into symbols and scale the data that way.! I may have to use groupByKey() and map() functions but I am new to pyspark and don't know how to go about it! — Yordan Иванов, Feb 23 '22 at 12:21
try this: https://stackoverflow.com/questions/47995188/how-to-calculate-mean-and-standard-deviation-given-a-pyspark-dataframe — nferreira78, Feb 23 '22 at 12:44
This is a related question but do not address the groupBy problem. Thanks anyways for you help! — Yordan Иванов, Feb 23 '22 at 13:07
Sorry, I understood you were ok with the `groupBy` but not so much with the scaling. Please see my answer below, were I explain how its done in principle. — nferreira78, Feb 24 '22 at 14:40

score 0 · Answer 1 · answered Feb 24 '22 at 14:39

0

I suggest you import the following:

import pyspark.sql.functions as f

then you can do it like this (not fully tested code):

stats_df = df.groupBy('symbol').withColumn(\
           'open', f.mean("open")).alias("open_mean")\
                           .withColumn(\
           'open', f.stddev("open")).alias("open_stddev").collect()

This is the principle of how you would do it (you could use instead the min and max functions for a MinMax scaling), then you just have to apply the formula of standard scaling to stats_df:

x' = (x - μ) / σ

answered Feb 24 '22 at 14:39

nferreira78

1,013
4
17

I see.! It still gives me an error: "AttributeError: 'GroupedData' object has no attribute 'withColumn'". Why is that? – Yordan Иванов Feb 24 '22 at 15:49
sorry try adding `.collect()` between `groupBy('symbol')` and `.withColumn(` – nferreira78 Feb 24 '22 at 16:26
I tried it but it does not work neither! – Yordan Иванов Feb 24 '22 at 18:29

How to groupBy and perform data scaling over each and every group using MlLib Pyspark?

1 Answers1