Pandas dataframe in pyspark to hive

Question

How to send a pandas dataframe to a hive table?

I know if I have a spark dataframe, I can register it to a temporary table using

df.registerTempTable("table_name")
sqlContext.sql("create table table_name2 as select * from table_name")

but when I try to use the pandas dataFrame to registerTempTable, I get the below error:

AttributeError: 'DataFrame' object has no attribute 'registerTempTable'

Is there a way for me to use a pandas dataFrame to register a temp table or convert it to a spark dataFrame and then use it register a temp table so that I can send it back to hive.

have to convert, the error is clear the attribute doesn't exist for a pandas dataframe — EdChum, Apr 28 '16 at 16:03
yes, but I don't no how to convert a pandas dataframe to a spark dataframe. I tried using `sqlContext.createDataFrame(df)` but it only takes the schema and not the data inside the dataFrame. — thenakulchawla, Apr 28 '16 at 16:04

MaxU - stand with Ukraine · Accepted Answer · 2016-05-06T15:22:07.667

5

I guess you are trying to use pandas df instead of Spark's DF.

Pandas DataFrame has no such method as registerTempTable.

you may try to create Spark DF from pandas DF.

UPDATE:

I've tested it under Cloudera (with installed Anaconda parcel, which includes Pandas module).

Make sure that you have set PYSPARK_PYTHON to your anaconda python installation (or another one containing Pandas module) on all your Spark workers (usually in: spark-conf/spark-env.sh)

Here is result of my test:

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.random.randint(0,100,size=(10, 3)), columns=list('ABC'))
>>> sdf = sqlContext.createDataFrame(df)
>>> sdf.show()
+---+---+---+
|  A|  B|  C|
+---+---+---+
| 98| 33| 75|
| 91| 57| 80|
| 20| 87| 85|
| 20| 61| 37|
| 96| 64| 60|
| 79| 45| 82|
| 82| 16| 22|
| 77| 34| 65|
| 74| 18| 17|
| 71| 57| 60|
+---+---+---+

>>> sdf.printSchema()
root
 |-- A: long (nullable = true)
 |-- B: long (nullable = true)
 |-- C: long (nullable = true)

edited May 06 '16 at 15:22

answered Apr 28 '16 at 16:51

MaxU - stand with Ukraine

205,989
36
386
419

There is no way to convert pandas to spark DF? I can try writing all my functionality in spark df but it doesn't work as flexibly as pandas. – thenakulchawla Apr 28 '16 at 17:02
yes, that only creates the schema but does not bring data. Once you try let me know how it works for you. – thenakulchawla Apr 28 '16 at 17:10
or if you know another way to send the data back to hive without using a temp table, share that. I am not keen on having a temp table as long as I get the data back in hive. – thenakulchawla Apr 28 '16 at 17:15
did you get to try it at home? – thenakulchawla May 04 '16 at 09:54
ok well let me know if you try at some point later. I am still trying to figure this out, if I am unable to do this then I will try Scala for my work. – thenakulchawla May 04 '16 at 10:06
I will try this now and see if I am able to get it to work. – thenakulchawla May 09 '16 at 04:51
I was able to work this out. Although when i use the temptable to create a table in hive, it only creates the schema and does not push data. with a simple insert statement I was able to get that done(using sqlContext) – thenakulchawla May 09 '16 at 06:17

score 4 · Answer 2 · edited Mar 22 '20 at 19:19

first u need to convert pandas dataframe to spark dataframe:

from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
df = hive_context.createDataFrame(pd_df)

then u can create a temptable which is in memory:

df.registerTempTable('tmp')

now,u can use hive ql to save data into hive:

hive_context.sql("""insert overwrite table target partition(p='p') select a,b from tmp'''

note than:the hive_context must be keep to the same one!

score 0 · Answer 3 · answered Nov 16 '17 at 14:51

I converted my pandas df to a temp table by

1) Converting the pandas dataframe to spark dataframe:

spark_df=sqlContext.createDataFrame(Pandas_df)

2) Make sure that the data is migrated properly

spark_df.select("*").show()

3) Convert the spark dataframe to a temp table for querying.

spark_df.registerTempTable("table_name").

Cheers..

score 0 · Answer 4 · answered Apr 29 '22 at 03:03

By Following all the other answers here, I was able to convert a pandas dataframe to a permanent Hive table as follows:

# sc is a spark context created with enableHiveSupport()
from pyspark.sql import HiveContext
hc=HiveContext(sc)

# df is my pandas dataframe
sc.createDataFrame(df).registerTempTable('tmp')   

# sch is the hive schema, and tabname is my new hive table name
hc.sql("create table sch.tabname as select * from tmp")

Pandas dataframe in pyspark to hive

4 Answers4

Linked