0

if i use this spark sql statement:

df = spark.sql('SELECT col_name FROM table_name')

it will return a spark dataframe object. How can i convert this to an rdd? is there a way to read a table directly using sql but generating an rdd instead of a dataframe?

Thanks in advance

Miguel 2488
  • 1,410
  • 1
  • 20
  • 41
  • 1
    df.rdd should give you the RDD – sramalingam24 Nov 11 '18 at 16:28
  • i tried that but no, instead, i get the following error: `PicklingError: Could not serialize object: Py4JError: An error occurred while calling o60.__getstate__. Trace: py4j.Py4JException: Method __getstate__([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) ` – Miguel 2488 Nov 11 '18 at 16:32
  • i have visited a good bunch of posts here talking about more or less the same thing, but i get this error instead – Miguel 2488 Nov 11 '18 at 16:33
  • 1
    https://stackoverflow.com/questions/29000514/how-to-convert-a-dataframe-back-to-normal-rdd-in-pyspark Try one of the alternatives suggested here – sramalingam24 Nov 11 '18 at 16:40
  • Possible duplicate of [Pyspark - Why i can't convert a sql dataframe to an rdd?](https://stackoverflow.com/questions/53250152/pyspark-why-i-cant-convert-a-sql-dataframe-to-an-rdd) – 10465355 Nov 11 '18 at 16:40
  • @sramalingam24 thank you but none of those will work, i tried that already, basically, i get the error when calling `df.rdd` i just wanted to know if there's any other way of achieving the same result, or some kind of workaround for this situation – Miguel 2488 Nov 11 '18 at 16:45
  • Where is your data coming from? Can you do df.show? – sramalingam24 Nov 11 '18 at 17:09
  • the data is coming froma database i have in the cloud, if i do df.show() it works well, i see my data printed as expected, returning a one column df – Miguel 2488 Nov 11 '18 at 17:23

1 Answers1

2
df = spark.sql('SELECT col_name FROM table_name')

df.rdd # you can save it, perform transformations etc.

df.rdd returns the content as an pyspark.RDD of Row.

You can then map on that RDD of Row transforming every Row into a numpy vector. I can't be more specific about the transformation since I don't know what your vector represents with the information given.

Note 1: df is the variable define our Dataframe.

Note 2: this function is available since Spark 1.3

Ali AzG
  • 1,861
  • 2
  • 18
  • 28
  • Did this solve the `PicklingError: Could not serialize object: Py4JError: An error occurred while calling o60.__getstate__. Trace: py4j.Py4JException: Method __getstate__([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79)` error? – pvy4917 Nov 12 '18 at 16:13