Using Pyspark to read data from Cassandra database.
Packages:
from pyspark.ml.feature import SQLTransformer
from transform.Base import Transform
I have loaded the data it looks like below
+----+--------------------+-------+---+
|time| MEM UTI PERC % |devId |Lid|
+----+--------------------+-------+---+
| 482| 8.661052632| 6| 20|
| 654| 9.162190612| 6| 20|
| 364| 8.219230769| 6| 20|
When I apply SQLTransform, which SQL STATEMENT AS
self.sqlstatement = "SELECT Time,MEM UTI PERC % FROM __THIS__ WHERE "
sqltrans = SQLTransformer()
sqltrans.setStatement(self.sqlstatement)
new_df = sqltrans.transform(sparkdf)
It throws error
mismatched input 'UTI' expecting {<EOF>, ';'}(line 1, pos 19)
So I modified the SQL Statement to wrap the spaced column inside double quotes/single quotes like below
SELECT Time,"MEM UTI PERC %" FROM __THIS__ WHERE
This time, the transformer doesn't throw exception but instead in replaces all the value of that spaced column with same column name , like below
+----+--------------+
|Time|MEM UTI PERC %|
+----+--------------+
| 212|MEM UTI PERC %|
| 26|MEM UTI PERC %|
I want to get data properly like
+----+--------------+
|Time|MEM UTI PERC %|
+----+--------------+
| 212|20.7 |
| 26|40.0 |