1

I was looking for help here (and in many other place):

  1. How to save Pandas dataframe to a hive table?
  2. Pandas dataframe in pyspark to hive
  3. How to insert a pandas dataframe into an existing Hive external table using Python (without PySpark)?

But I don't think I completely understand the proposals presented, because I failed with any of them

What am I trying to do is:

  1. Extract data from hive table from schema1 to python dataframe.
  2. Do some operations on columns and save as pandas dataframe.
  3. Export pandas dataframe to hive table schema2.

I made points 1-2 as follows:

  1. Extract data from hive table to python dataframe.
transport = puretransport.transport_factory(host='my_host_name',
                                            port=10000,
                                            username='my_username',
                                            password='my_password',
                                            use_ssl=True)

engine = db.create_engine(f"hive://my_username@/schema1",
                          connect_args={'thrift_transport': transport})

print("Selecting data from table", end=" ")
tab1 = []
for chunk in pd.read_sql_query(
        """select * from schema1.my_table""", con=engine, chunksize=5):
    tab1.append(chunk)
df = pd.concat(tab1)
print("DONE")

  1. Do some operations on columns and save as pandas dataframe.
my_code_returning_dataframe...
  1. Export pandas dataframe to hive table schema2.
what_should_i_do_there?

Thank you in advance for any help.

0 Answers0