2

Summary

I have an application with a PostgreSQL + PostGis database setup, and I am trying to load a significant amount of rows into one of its tables. After some research, binary copy seems to be the best approach, but after numerous attempts and endless hours of debugging, I have not been able to achieve the loading of the data.

Moreover, I noticed that this specific approach has a considerably low amount of reference material available, and hence thought that opening the question could help future developers.

Background

Technologies

The application backend is written in Python 3. The targeted database is a locally hosted PostgreSQL v11 database with the PostGIS 2.5.3 extension. The psycopg2 adapter is being used to connect the backend to the database. The ppygis3 port of PPyGIS is being used for only a minor alternative attempt, as later indicated in the code.

Data

Although I can't share a copy of such data due to confidentiality, I can describe that the data consists of a geolocation (in lat-long format), a timestamp and a value. The volume of this data is in the order of millions.

Approach

I've researched potential approaches to solve this, and due to our performance concern, using the Postgres COPY command with binary format seems to be the sharpest approach.

References:

While looking for references, and found this very useful answer from Mike T. Unfortunately, it only covers numeric types: PostgreSQL Documentation on Copy with binary format

AlexImmer's port of PPyGIS and "example" on binary copy with WKB

NBSoftSolutions dissection on binary copy and useful example on copying strings

Database Table

A simplified schema of the table would be something close to the following:

POINT              TIMESTAMP                      value
(Geography)        (timestamp without zone)       (real)
CREATE TABLE testtable (
  point GEOGRAPHY(Point),
  timestamp TIMESTAMP WITHOUT TIME ZONE,
  value REAL
);

Code

These are two simplified versions of my attempts at conducting the data loading.

Copy Attempt with BytesIO

The following is the approach I am mainly interested in.

con = psycopg2.connect(user=username, password=password, host="localhost", database="test")
cur = con.cursor()

# ByteIO buffer type
buffer = BytesIO()
buffer.write(pack('!11sii', b'PGCOPY\n\377\r\n\0', 0, 0))

# Dummy values
lat = 40.0
lon = 10.0
date = "2019-01-01 00:00:00"
pointStr = "POINT({} {})".format(lon, lat)
pointWKB = Point(lon, lat).write_ewkb()
value = 555.555

# Number of columns (3)
buffer.write(pack('>h', 3))

# Geographic point
# Attempt with direct encoding of the string format
buffer.write(pack('>i', len(pointStr)))
buffer.write(pointStr.encode())

# Geographic point
# Attempt using the Well-Known Binary encoding hinted by AlexsImmer
#buffer.write(pack('>i', len(pointWKB)))
#buffer.write(pointWKB)

# Timestamp
buffer.write(pack('>i', len(date)))
buffer.write(pack('>s', date.encode()))

# Reading value
buffer.write(pack('>i', 4))
buffer.write(pack('>f', reading))

# Write final file trailer
buffer.write(pack('>h', -1))

buffer.seek(0)

# Perform a bulk upload of data to the table
cur.copy_expert('COPY testtable FROM STDIN WITH BINARY', buffer)

# Also attempted 
#cur.copy_from(buffer, 'testtable', null='None')

con.commit()

cur.close()
con.close()

I am guessing the issues have to do with either my ignorance, or the database not recognizing/supporting the insertion of geographic fields in this way. Yet surprisingly, attempting the same with the StringIO approach as shown below works flawlessly.

Copy Attempt with StringIO

I have also attempted the StringIO approach, which I've managed to make work, but its performance is not satisfactory. Note that in this case, the PostGis geographic field is being passed to the DB in its simplified string form (i.e. 'POINT(Y X)').

con = psycopg2.connect(user=username, password=password, host="localhost", database="test")
cur = con.cursor()

# StringIO buffer type
buffer = StringIO()

# Dummy values
lat = 40.0
lon = 10.0
date = "2019-01-01 00:00:00"
point = "POINT({} {})".format(lon, lat)
value = 555.555

buffer.write(point)
buffer.write('\t')
# Timestamp of reading
buffer.write(date)
buffer.write('\t')
# Parameter reading
buffer.write(str(value))
buffer.write('\n')

# Reset offset to byte 0
buffer.seek(0)

cur.copy_from(buffer, "testtable", null='None')

con.commit()

cur.close()
con.close()

Attempting the binary copy with either of the two approaches (encoding the string representation of the point of the WKB representation) yields the following error:

psycopg2.errors.InternalError_: Invalid endian flag value encountered. CONTEXT: COPY testtable, line 1, column geography

The expected/ideal result would naturally be a successful load of the millions of rows into the database.

Any input and/or guidance will be appreciated!

Santiago
  • 21
  • 3

1 Answers1

1

If you use binary mode, you need to supply the internal binary representation. That is the format in which the values are stored in memory and on disk.

Data is converted between the internal format and the external format(s) with the type input and output functions.

Now you expect, and I cannot blame you, the the internal binary format for geometry is the EWKB (both have “binary” in their name). But this is not the case — EWKB is a text representation of the data. This is the reason for your problems.

If you want to use the internal binary format, you'd have to read the PostGIS source. I think that you do not use binary cooy. I think that's a premature optimization. What makes you think that your code is more efficient than PostGIS' type input function? Apart from that, you expose yourself to dangers if the clent architecture differs from the server architecture: can you be certain that they represent 8 byte floating point values in the same way?

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • I see - I will dig into the PostGIS documentation/source to try find the internal binary representation of the Geography field and give it a couple more shots. On the matter of the code vs. input function efficiency, it is not that I doubt on the performance of the input function, but rather than due to the amount of inserts, according to the references I found, a bulk-copy attempt seems like a better choice. Thank you for the insights! – Santiago Sep 10 '19 at 07:55
  • You can bulk `COPY` using the text format. There is no need to mess with the internal binary representation for that. – Laurenz Albe Sep 10 '19 at 07:56