is there a way convert a spark dataframe generated from a sql statement into an rdd?

Question

if i use this spark sql statement:

df = spark.sql('SELECT col_name FROM table_name')

it will return a spark dataframe object. How can i convert this to an rdd? is there a way to read a table directly using sql but generating an rdd instead of a dataframe?

Thanks in advance

i tried that but no, instead, i get the following error: `PicklingError: Could not serialize object: Py4JError: An error occurred while calling o60.__getstate__. Trace: py4j.Py4JException: Method __getstate__([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) ` — Miguel 2488, Nov 11 '18 at 16:32
i have visited a good bunch of posts here talking about more or less the same thing, but i get this error instead — Miguel 2488, Nov 11 '18 at 16:33
https://stackoverflow.com/questions/29000514/how-to-convert-a-dataframe-back-to-normal-rdd-in-pyspark Try one of the alternatives suggested here — sramalingam24, Nov 11 '18 at 16:40
Possible duplicate of [Pyspark - Why i can't convert a sql dataframe to an rdd?](https://stackoverflow.com/questions/53250152/pyspark-why-i-cant-convert-a-sql-dataframe-to-an-rdd) — 10465355, Nov 11 '18 at 16:40
@sramalingam24 thank you but none of those will work, i tried that already, basically, i get the error when calling `df.rdd` i just wanted to know if there's any other way of achieving the same result, or some kind of workaround for this situation — Miguel 2488, Nov 11 '18 at 16:45
the data is coming froma database i have in the cloud, if i do df.show() it works well, i see my data printed as expected, returning a one column df — Miguel 2488, Nov 11 '18 at 17:23

score 2 · Accepted Answer · edited Nov 12 '18 at 07:38

2

df = spark.sql('SELECT col_name FROM table_name')

df.rdd # you can save it, perform transformations etc.

df.rdd returns the content as an pyspark.RDD of Row.

You can then map on that RDD of Row transforming every Row into a numpy vector. I can't be more specific about the transformation since I don't know what your vector represents with the information given.

Note 1: df is the variable define our Dataframe.

Note 2: this function is available since Spark 1.3

edited Nov 12 '18 at 07:38

Ali AzG

1,861
2
18
28

answered Nov 11 '18 at 20:29

Nagilla Venkatesh

36
2

Did this solve the `PicklingError: Could not serialize object: Py4JError: An error occurred while calling o60.__getstate__. Trace: py4j.Py4JException: Method __getstate__([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79)` error? – pvy4917 Nov 12 '18 at 16:13

is there a way convert a spark dataframe generated from a sql statement into an rdd?

1 Answers1