7

I have a table in hive with 351 837 (110 MB size) records and i am reading this table using python and writing into sql server.

In this process while reading data from hive into pandas dataframe it is taking long time. When i load entire records (351k) it takes 90 minutes.

To improve i went with following approach like reading 10k rows once from hive and writing into sql server. But reading 10k rows once from hive and assigning it to Dataframe is alone taking 4-5 minutes of time.

def execute_hadoop_export():
       """
       This will run the steps required for a Hadoop Export.  
       Return Values is boolean for success fail
       """
       try:

           hql='select * from db.table '
           # Open Hive ODBC Connection
           src_conn = pyodbc.connect("DSN=****",autocommit=True)
           cursor=src_conn.cursor()
           #tgt_conn = pyodbc.connect(target_connection)

           # Using SQLAlchemy to dynamically generate query and leverage dataframe.to_sql to write to sql server...
           sql_conn_url = urllib.quote_plus('DRIVER={ODBC Driver 13 for SQL Server};SERVER=Xyz;DATABASE=Db2;UID=ee;PWD=*****')
           sql_conn_str = "mssql+pyodbc:///?odbc_connect={0}".format(sql_conn_url)
           engine = sqlalchemy.create_engine(sql_conn_str)
           # read source table.
           vstart=datetime.datetime.now()
           for df in pandas.read_sql(hql, src_conn,chunksize=10000):

               vfinish=datetime.datetime.now()

               print 'Finished 10k rows reading from hive and it took', (vfinish-vstart).seconds/60.0,' minutes'
           # Get connection string for target from Ctrl.Connnection

               df.to_sql(name='table', schema='dbo', con=engine, chunksize=10000, if_exists="append", index=False) 
               print 'Finished 10k rows writing into sql server and it took', (datetime.datetime.now()-vfinish).seconds/60.0, ' minutes'
               vstart=datetime.datetime.now()
           cursor.Close()


       except Exception, e:
           print str(e)

output:

Result

What is the fastest way to read hive table data in python?

Update hive table structure

CREATE TABLE `table1`(
  `policynumber` varchar(15), 
  `unitidentifier` int, 
  `unitvin` varchar(150), 
  `unitdescription` varchar(100), 
  `unitmodelyear` varchar(4), 
  `unitpremium` decimal(18,2), 
  `garagelocation` varchar(150), 
  `garagestate` varchar(50), 
  `bodilyinjuryoccurrence` decimal(18,2), 
  `bodilyinjuryaggregate` decimal(18,2), 
  `bodilyinjurypremium` decimal(18,2), 
  `propertydamagelimits` decimal(18,2), 
  `propertydamagepremium` decimal(18,2), 
  `medicallimits` decimal(18,2), 
  `medicalpremium` decimal(18,2), 
  `uninsuredmotoristoccurrence` decimal(18,2), 
  `uninsuredmotoristaggregate` decimal(18,2), 
  `uninsuredmotoristpremium` decimal(18,2), 
  `underinsuredmotoristoccurrence` decimal(18,2), 
  `underinsuredmotoristaggregate` decimal(18,2), 
  `underinsuredmotoristpremium` decimal(18,2), 
  `umpdoccurrence` decimal(18,2), 
  `umpddeductible` decimal(18,2), 
  `umpdpremium` decimal(18,2), 
  `comprehensivedeductible` decimal(18,2), 
  `comprehensivepremium` decimal(18,2), 
  `collisiondeductible` decimal(18,2), 
  `collisionpremium` decimal(18,2), 
  `emergencyroadservicepremium` decimal(18,2), 
  `autohomecredit` tinyint, 
  `lossfreecredit` tinyint, 
  `multipleautopoliciescredit` tinyint, 
  `hybridcredit` tinyint, 
  `goodstudentcredit` tinyint, 
  `multipleautocredit` tinyint, 
  `fortyfivepluscredit` tinyint, 
  `passiverestraintcredit` tinyint, 
  `defensivedrivercredit` tinyint, 
  `antitheftcredit` tinyint, 
  `antilockbrakescredit` tinyint, 
  `perkcredit` tinyint, 
  `plantype` varchar(100), 
  `costnew` decimal(18,2), 
  `isnocontinuousinsurancesurcharge` tinyint)
CLUSTERED BY ( 
  policynumber, 
  unitidentifier) 
INTO 50 BUCKETS

Note: I have also tried with sqoop export option but my hive table is already in bucketting format.

James Z
  • 12,209
  • 10
  • 24
  • 44
Tharunkumar Reddy
  • 2,773
  • 18
  • 32
  • Interesting. Could you please post structure of your hive table as well. – vikrant rana Jul 08 '19 at 13:12
  • @vikrantrana Thanks for response :). Updated question with hive table DDL. – Tharunkumar Reddy Jul 08 '19 at 13:26
  • 3
    Have you tried with pyhive instead of pyodbc for reading from Hive? – nacho Jul 08 '19 at 13:38
  • 1
    Chunksize may not necessarily paginate the data as you expect from it. Judging from consistent long query time on each run, I believe your queries are making hadoop to select all the data into memory, then fetch chunks from it each time. So try using limit + offset in your query instead of relying on chunksize. Another option is to try streaming the results + using chunksize, if hive supports it. – altunyurt Jul 15 '19 at 00:08
  • is your data partitioned properly? – gold_cy Jul 16 '19 at 01:10
  • @aws_apprentice Yes it is. – Tharunkumar Reddy Jul 16 '19 at 05:20
  • @nacho Thanks for the suggestion. Do we have any specific advantages in pyhive while reading? – Tharunkumar Reddy Jul 16 '19 at 05:21
  • Have you tried using Sqoop ? My suspicion is pandas is what is slow here mainly. – Andy Hayden Jul 16 '19 at 06:31
  • Hi @AndyHayden I tried with sqoop it completed in 90 seconds. But it seems like sqoop doesn't support with bucketting hive table(Production table is in bucketting). Any help on that. – Tharunkumar Reddy Jul 16 '19 at 07:29
  • hw abt loading a bucketed table data to unbucketed hive text format table and then use sqoop export to load sql server table.. – vikrant rana Jul 18 '19 at 04:50
  • @vikrantrana Thnak you for the response. But think like when i have a huge data in hive table copying to new unbucket table also time consuming and additional step. Any suggestion? – Tharunkumar Reddy Jul 18 '19 at 05:48
  • @vikrantrana I am happy that if some can provide me a solution to copy the hive bucket table directly to sql server usinf sqoop. :) – Tharunkumar Reddy Jul 18 '19 at 05:55
  • Yeah. It would be additional overhead to copy the large table but I was just guessing around if it takes less time than python. :-) did you also checked the performance of pyhive instead of pyodbc? – vikrant rana Jul 18 '19 at 06:33

2 Answers2

5

I have tried with multi-processing and i can reduce it 8-10 minutes from 2 hours. Please find below scripts.

from multiprocessing import Pool
import pandas as pd
import datetime
from query import hivetable
from write_tosql import write_to_sql
p = Pool(37)
lst=[]
#we have 351k rows so generating series to use in hivetable method
for i in range(1,360000,10000):
    lst.append(i)
print 'started reading ',datetime.datetime.now()
#we have 40 cores in  cluster 
p = Pool(37)
s=p.map(hivetable, [i for i in lst])
s_df=pd.concat(s)
print 'finished reading ',datetime.datetime.now()
print 'Started writing to sql server ',datetime.datetime.now()
write_to_sql(s_df)
print 'Finished writing to sql server ',datetime.datetime.now()

---------query.py file-------

import pyodbc
from multiprocessing import Pool
from functools import partial
import pandas as pd

conn = pyodbc.connect("DSN=******",autocommit=True)

def hivetable(row):
    query = 'select * from (select row_number() OVER (order by policynumber) as rownum, * from dbg.tble ) tbl1 where rownum between '+str(row) +' and '+str(row+9999)+';'
    result = pd.read_sql(query,conn)
    return result

---------Write_tosql.py file---------

import sqlalchemy
import urllib
import pyodbc
def write_to_sql(s_df):
    sql_conn_url = urllib.quote_plus('DRIVER={ODBC Driver 13 for SQL Server};SERVER=ser;DATABASE=db;UID=sqoop;PWD=#####;')
    sql_conn_str = "mssql+pyodbc:///?odbc_connect={0}".format(sql_conn_url)
    engine = sqlalchemy.create_engine(sql_conn_str)
    s_df.rename(columns=lambda x: remove_table_alias(x), inplace=True)
    s_df.to_sql(name='tbl2', schema='dbo', con=engine, chunksize=10000, if_exists="append", index=False)
def remove_table_alias(columnName):
    try:
        if(columnName.find(".") != -1):
            return columnName.split(".")[1]
        return columnName
    except Exception, e:
        print "ERROR in _remove_table_alias ",str(e)

Any other solutions will help me to reduce in time.

Tharunkumar Reddy
  • 2,773
  • 18
  • 32
5

What is the best way to read the output from disk with Pandas after using cmd.get_results ? (e.g. from a Hive command). For example, consider the following:

out_file = 'results.csv'
delimiter = chr(1)
....

Qubole.configure(qubole_key)
hc_params = ['--query', query]
hive_args = HiveCommand.parse(hc_params)
cmd = HiveCommand.run(**hive_args)
if (HiveCommand.is_success(cmd.status)):
    with open(out_file, 'wt') as writer:
        cmd.get_results(writer, delim=delimiter, inline=False)

If, after successfully running the query, I then inspect the first few bytes of results.csv, I see the following: $ head -c 300 results.csv b'flight_uid\twinning_price\tbid_price\timpressions_source_timestamp\n'b'0FY6ZsrnMy\x012000\x012270.0\x011427243278000\n0FamrXG9AW\x01710\x01747.0\x011427243733000\n0FY6ZsrnMy\x012000\x012270.0\x011427245266000\n0FY6ZsrnMy\x012000\x012270.0\x011427245088000\n0FamrXG9AW\x01330\x01747.0\x011427243407000\n0FamrXG9AW\x01710\x01747.0\x011427243981000\n0FamrXG9AW\x01490\x01747.0\x011427245289000\n When I try to open this in Pandas:

df = pd.read_csv('results.csv')

it obviously doesn't work (I get an empty DataFrame), since it isn't properly formatted as a csv file. While I could try to open results.csv and post-process it (to remove b', etc.) before I open it in Pandas, this would be a quite hacky way to load it. Am I using the interface correctly? This is using the very last version of qds_sdk: 1.4.2 from a three hours ago.

Song
  • 298
  • 5
  • 20