10

I hope that my question is not ridiculous since, surprisingly, this question has apparently not really been asked yet (to the best of my knowledge) on the popular websites.

The situation is that I have several csv files containing more than 1 Mio observations in total. Each observation contains, among others, a postal address. I am planning to read all files into a single GeoDataFrame, geocode the addresses, perform a spatial join given a shapefile and save some information from the polygon for each row. Quite standard, I suppose. This is a part of a one-time data cleaning process.

My goal is to set up a database with this final dataset. This is because it allows me to share and search the data quite easily as well as e.g. plot some observations on a website. Also, it makes it quite easy to select observations based on some criteria and then run some analyses.

My problem is that the feature of inserting a GeoDataFrame into a database seems not to be implemented yet - apparently because GeoPandas is supposed to be a subsitute for databases ("GeoPandas enables you to easily do operations in python that would otherwise require a spatial database such as PostGIS").

Of course, I could iterate through each line and insert each data point "manually", but I am looking for the best solution here. For any workaround I would also be afraid that the datatype may conflict with that of the database. Is there "a best way" to take here?

Thanks for your help.

Jhonny
  • 568
  • 1
  • 9
  • 26

4 Answers4

14

As mentioned before, @Kartik's answer works only for a single call, for appending data it raises a DataError since the geom column then expects the geometry to have an SRID. You can use GeoAlchemy to handle all the cases:

# Imports
from geoalchemy2 import Geometry, WKTElement
from sqlalchemy import *

# Use GeoAlchemy's WKTElement to create a geom with SRID
def create_wkt_element(geom):
    return WKTElement(geom.wkt, srid = <your_SRID>)

geodataframe['geom'] = geodataframe['geom'].apply(create_wkt_element)

db_url = 'postgresql://username:password@host:socket/database'
engine = create_engine(db_url, echo=False)

# Use 'dtype' to specify column's type
# For the geom column, we will use GeoAlchemy's type 'Geometry'
your_geodataframe.to_sql(table_name, engine, if_exists='append', index=False, 
                         dtype={'geom': Geometry('POINT', srid= <your_srid>)})
wfgeo
  • 2,716
  • 4
  • 30
  • 51
Hamri Said
  • 263
  • 4
  • 10
7

So, I just implemented this for a PostGIS database, and I can paste my method here. For MySQL, you'll have to adapt the code.

First step was to convert the geocoded columns into WKB hex string, because I use SQLAlchemy, with an engine based on pyscopg, and both of those packages do not understand geo-types natively. Next step is to write that data into a SQL DB, as usual (note that all geometry columns should be converted to text columns holding the WKB hex string), and finally change the type of the columns to Geometry by executing a query. Refer to the following pseudocode:

# Imports
import sqlalchemy as sal
import geopandas as gpd

# Function to generate WKB hex
def wkb_hexer(line):
    return line.wkb_hex

# Convert `'geom'` column in GeoDataFrame `gdf` to hex
    # Note that following this step, the GeoDataFrame is just a regular DataFrame
    # because it does not have a geometry column anymore. Also note that
    # it is assumed the `'geom'` column is correctly datatyped.
gdf['geom'] = gdf['geom'].apply(wkb_hexer)

# Create SQL connection engine
engine = sal.create_engine('postgresql://username:password@host:socket/database')

# Connect to database using a context manager
with engine.connect() as conn, conn.begin():
    # Note use of regular Pandas `to_sql()` method.
    gdf.to_sql(table_name, con=conn, schema=schema_name,
               if_exists='append', index=False)
    # Convert the `'geom'` column back to Geometry datatype, from text
    sql = """ALTER TABLE schema_name.table_name
               ALTER COLUMN geom TYPE Geometry(LINESTRING, <SRID>)
                 USING ST_SetSRID(geom::Geometry, <SRID>)"""
    conn.execute(sql)
Kartik
  • 8,347
  • 39
  • 73
  • Cool! This works on first call when the table does not exist. But consecutive calls give me a `DataError: (psycopg2.DataError) Geometry SRID (0) does not match column SRID (4326)`. – j08lue Dec 12 '16 at 12:58
  • `psycopg-postgis` might have some inspiration: https://github.com/yohanboniface/psycopg-postgis. However, they implement their own types rather than those in `shapely` / `geopandas`. – j08lue Dec 12 '16 at 13:05
  • This **can** conly work for a single call, not for appending data, since the `geom` column then expects the geometry to have an SRID. – j08lue Dec 12 '16 at 13:46
  • @j08lue, Yes, you are correct. However, if you are expecting multiple inserts, then you simply avoid running the last `alter table` query till all inserts are finished. If it is an online system where inserts and queries may be interleaved, you may use a temporary table to insert your data from python, then copy it over to your working table while converting the columns appropriately. Or you may use string imputation to generate a SQL insert statement, with appropriate type casting applied to the columns, and execute that. – Kartik Dec 12 '16 at 18:00
  • @j08lue, Ultimately, it depends on your specific needs. I'd use string imputation if only a few rows need to be appended at a time, and a temporary table or csv intermediate for bulk inserts. This will likely remain the only solution unless an automatic converter is built into the packages. Also for PostGIS, WKB hex is the preferred method of loading and reading geo-typed data. – Kartik Dec 12 '16 at 18:06
3

A version of Hamri Said's answer, but using a lambda, which in my opinion is a bit nicer because it is such a short function:

# Imports
from geoalchemy2 import Geometry, WKTElement
from sqlalchemy import *

geodataframe['geom'] = geodataframe['geom'].apply(lambda geom: WKTElement(geom.wkt, srid = <your_SRID>))

db_url = 'postgresql://username:password@host:socket/database'
engine = create_engine(db_url, echo=False)

# Use 'dtype' to specify column's type
# For the geom column, we will use GeoAlchemy's type 'Geometry'
your_geodataframe.to_sql(table_name, engine, if_exists='append', index=False, 
                         dtype={'geom': Geometry('POINT', srid= <your_srid>)})
wfgeo
  • 2,716
  • 4
  • 30
  • 51
  • Dear all, what is a srid? How should one generate that? – Philipe Riskalla Leal Jun 03 '22 at 17:50
  • 1
    @PhilipeRiskallaLeal, this is the spatial reference id system your geodataframe is in. Assuming you've imported geopandas as gpd, convert your geodataframe's crs to it's official id: `geom_srid_num=gpd.tools.crs.epsg_from_crs(gdf.crs)` – Liquidgenius Feb 10 '23 at 21:34
2

I'm coming back to this to give a better answer. A geopandas.GeoDataFrame object has a .to_postgis() method which handles a lot of the mucky business of dealing with geometry types.

wfgeo
  • 2,716
  • 4
  • 30
  • 51