I am a newbie to web development. My task is to download a large amount of data from a web api, and to upload them into a postgres database on google cloud.
Because the amount of data is very large, I have a for-loop that scrapes the data part-by-part, insert each part into the cloud table, and finally commits everything. I use sqlalchemy and pg8000 to do the job.
Here is the basic structure of my code:
engine = create_engine("postgresql+pg8000://connection/info")
session = scoped_session(sessionmaker(autocommit=False,autoflush=False,bind=engine))
Base = declarative_base()
class MyTableClass(Base):
some columns
Base.metadata.create_all(engine)
for part in scraping_data():
engine.execute(MyTableClass.__table__.insert(), part)
session.commit()
session.close()
engine.dispose()
The for loop ran successfully for 12 hours. Then I received a network error. Here is the first part. The remainder of the error message is pretty long and is just a string of errors caused by the first part.
ERROR:sqlalchemy.pool.impl.QueuePool:Exception closing connection <pg8000.legacy.Connection object at 0x7fa2ba514160>
Traceback (most recent call last):
File "/home/user/anaconda3/envs/myenv/lib/python3.8/site-packages/pg8000/core.py", line 760, in handle_messages
code, data_len = ci_unpack(self._read(5))
struct.error: unpack_from requires a buffer of at least 5 bytes for unpacking 5 bytes at offset 0 (actual buffer size is 0)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/user/anaconda3/envs/myenv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1782, in _execute_context
self.dialect.do_executemany(
File "/home/user/anaconda3/envs/myenv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 716, in do_executemany
cursor.executemany(statement, parameters)
File "/home/user/anaconda3/envs/myenv/lib/python3.8/site-packages/pg8000/legacy.py", line 304, in executemany
self.execute(operation, parameters)
File "/home/user/anaconda3/envs/myenv/lib/python3.8/site-packages/pg8000/legacy.py", line 252, in execute
self._context = self._c.execute_unnamed(
File "/home/user/anaconda3/envs/myenv/lib/python3.8/site-packages/pg8000/core.py", line 649, in execute_unnamed
self.handle_messages(context)
File "/home/user/anaconda3/envs/myenv/lib/python3.8/site-packages/pg8000/core.py", line 762, in handle_messages
raise InterfaceError("network error on read") from e
pg8000.exceptions.InterfaceError: network error on read
Could anyone help me shed light on the error?
It seems possible that my connection has just timed out. Could anyone help me better structure the code in order to avoid losing a large of downloaded data when time-out happens?
Thanks!