I was looking for help here (and in many other place):
- How to save Pandas dataframe to a hive table?
- Pandas dataframe in pyspark to hive
- How to insert a pandas dataframe into an existing Hive external table using Python (without PySpark)?
But I don't think I completely understand the proposals presented, because I failed with any of them
What am I trying to do is:
- Extract data from hive table from schema1 to python dataframe.
- Do some operations on columns and save as pandas dataframe.
- Export pandas dataframe to hive table schema2.
I made points 1-2 as follows:
- Extract data from hive table to python dataframe.
transport = puretransport.transport_factory(host='my_host_name',
port=10000,
username='my_username',
password='my_password',
use_ssl=True)
engine = db.create_engine(f"hive://my_username@/schema1",
connect_args={'thrift_transport': transport})
print("Selecting data from table", end=" ")
tab1 = []
for chunk in pd.read_sql_query(
"""select * from schema1.my_table""", con=engine, chunksize=5):
tab1.append(chunk)
df = pd.concat(tab1)
print("DONE")
- Do some operations on columns and save as pandas dataframe.
my_code_returning_dataframe...
- Export pandas dataframe to hive table schema2.
what_should_i_do_there?
Thank you in advance for any help.