0

I use pandas.to_sql() to insert a dataframe to PostgreSQL. When i run the codes in localhost and the database also in localhost, it cost about 5 mins to finish. But when the codes and database were not in the same computer, it costs about 40 mins to finish.

Both of the two computers have the same configuration and hardware ,software, also they are in the same local area network, conneting by a siwtch. Please help me!

version:
pandas: 0.21.0
psycopg2: 2.7.3.2
PostgreSQL: 9.5

The codes are like below:

HOST = '192.168.1.18'
USER = 'test'
PASSWORD = '123'
DATABASE = 'test'
import datetime as dt
import numpy as np
from sqlalchemy import create_engine
import sqlalchemy
engine = create_engine("postgresql://%s:%s@%s:5432/%s" % (USER, PASSWORD, HOST, DATABASE))
import time
st = time.time()
sourceData.to_sql('test', con=engine, if_exists='append', index=True, chunksize= 10000, dtype={'DATETIME': sqlalchemy.types.TIMESTAMP, 'CODES': sqlalchemy.types.VARCHAR(255)})
et = time.time()
print((et-st)/60)


HOST = '192.168.1.19'
USER = 'postgres'
PASSWORD = '123'
DATABASE = 'postgres'
engine = create_engine("postgresql://%s:%s@%s:5432/%s" % (USER, PASSWORD, HOST, DATABASE))
st = time.time()
sourceData.to_sql('test', con=engine, if_exists='append', index=True, chunksize= 10000, dtype={'DATETIME': sqlalchemy.types.TIMESTAMP, 'CODES': sqlalchemy.types.VARCHAR(255)})
et = time.time()
print((et-st)/60)

The sourceData is a pandas.DataFrame, it has 3000000 rows, just like below:

                        Q_C81
DATETIME   CODES             
2013-01-04 000001.SZ   0.1828
           000002.SZ   0.1150
           000004.SZ   0.0000
           000005.SZ   0.0000
           000006.SZ  -1.5936
           000007.SZ  -1.9031
           000008.SZ   0.0000
           000009.SZ -74.5152
           000010.SZ   0.0000
           000011.SZ   0.0000
           000012.SZ  -7.0324
           000014.SZ   0.0000
           000016.SZ   2.5925
           000017.SZ   0.0000
           000018.SZ   0.0000
           000019.SZ   0.0000
           000020.SZ   0.0000
           000021.SZ -82.1918
           000022.SZ  -2.3582
           000023.SZ   0.0000
           000024.SZ  -0.2810
           000025.SZ   0.0000
           000026.SZ  -5.3294
           000027.SZ  -1.7320
           000028.SZ   1.2884
           000029.SZ   0.0000
           000030.SZ   0.0000
           000031.SZ   0.6957
           000032.SZ   0.0000
           000033.SZ   0.0000
                       ...
2013-03-01 002621.SZ -10.9103
           002622.SZ -50.2930
           002623.SZ -35.4200
           002624.SZ -34.8826
           002625.SZ -36.2222
           002626.SZ  -0.2656
           002627.SZ -13.5603
           002628.SZ   9.0788
  • 1
    I guess it's a network problem. Have you try compressing the dataframe and then copy it? – rpanai Dec 01 '17 at 11:06
  • If the one and only difference is where the database is located then obviously the problem is between your localhost and the database server. Network will always have some overhead but a factor 8 is a bit over the top indeed... – bruno desthuilliers Dec 01 '17 at 12:42
  • You should profile the problem. Measure how much time is spent where, that will point to the place where the time is lost. – Laurenz Albe Dec 01 '17 at 18:11
  • I think network is fine. Because the speed can up to 10MB per second when i read the dataframe from another computer. – Cavscoding Dec 04 '17 at 02:09

0 Answers0