2

I have one BigQuery table with addresses including Lat/Lng and other BQ tables with working geom definitions imported from Census shapefiles. For each row in addresses table I am attempting to lookup which geom row contains it.

The following query were I look up an INDIVIDUAL lat/lng works fine:

SELECT SLDLST FROM `geographies.tl_2018_sldl_*` sldl WHERE ST_CONTAINS(sldl.geom, ST_GEOGPOINT(-95.221080, 38.974500));

But when I attempt abstract into a join like

SELECT 
  address_id,
  SLDLST
FROM `launchpad-239920.address_standardization.temp_delete_geo_match_sample` ssgolden
LEFT JOIN `geographies.tl_2018_sldl_*` sldl ON ST_CONTAINS(sldl.geom, ST_GEOGPOINT(ssgolden.longitude, ssgolden.latitude));

I get an error: "LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join."

How do I restructure my join query to be able to extract the matching geography for each address?

Jake Lowen
  • 899
  • 1
  • 11
  • 21
  • why not to simply use `JOIN`? unless you need preserve not matching addresses in the result. please clarify/confirm – Mikhail Berlyant May 31 '19 at 17:18
  • oh.. nice. I probably do want to preserve not matching address, but I can totally do it with a simple join and then left join back in against my original data. I'm happy to consider marking it as accepted if you want to convert that comment into an answer. Otherwise, Thanks! – Jake Lowen May 31 '19 at 17:23
  • just added my answer – Mikhail Berlyant May 31 '19 at 17:26

2 Answers2

7

Below is for BigQuery Standard SQL

If you want to preserve not matching addresses in output - you can use below

#standardSQL
WITH matched_addresses AS (
  SELECT 
    address_id,
    SLDLST
  FROM `launchpad-239920.address_standardization.temp_delete_geo_match_sample` ssgolden
  JOIN `geographies.tl_2018_sldl_X` sldl 
  ON ST_CONTAINS(sldl.geom, ST_GEOGPOINT(ssgolden.longitude, ssgolden.latitude)) 
)
SELECT * FROM matched_addresses UNION ALL 
SELECT address_id, NULL 
FROM `launchpad-239920.address_standardization.temp_delete_geo_match_sample`
WHERE NOT address_id IN (SELECT address_id FROM matched_addresses)   

but if you interested in only matched - use below one

#standardSQL
WITH matched_addresses AS (
  SELECT 
    address_id,
    SLDLST
  FROM `launchpad-239920.address_standardization.temp_delete_geo_match_sample` ssgolden
  JOIN `geographies.tl_2018_sldl_X` sldl 
  ON ST_CONTAINS(sldl.geom, ST_GEOGPOINT(ssgolden.longitude, ssgolden.latitude)) 
)
SELECT * FROM matched_addresses  
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
1

A solution that automatically takes care of not-matched addresses, without needing the UNION_ALL that Mikhail suggests (so performance could be better):

#standardSQL
WITH addresses AS (
  SELECT *, GENERATE_UUID() uuid
  FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2015`  ssgolden
  WHERE DATE(ssgolden.pickup_datetime) = '2015-10-07'
), matched_addresses AS (
  SELECT ARRAY_AGG(
      IF(
        ST_CONTAINS(sldl.zone_geom, SAFE.ST_GEOGPOINT(ssgolden.pickup_longitude, ssgolden.pickup_latitude))
        , sldl.zone_name, null)
      IGNORE NULLs LIMIT 1)[OFFSET(0)] zone_name
  FROM addresses  ssgolden
  CROSS JOIN `bigquery-public-data.new_york_taxi_trips.taxi_zone_geom`  sldl 
  GROUP BY uuid
)

SELECT zone_name, COUNT(*) c
FROM matched_addresses 
GROUP BY 1
ORDER BY c DESC

enter image description here

Now, let's test performance against a larger set of geometries (74,133 - the whole US and more - in response to Michael's comment):

#standardSQL
WITH addresses AS (
  SELECT *, GENERATE_UUID() uuid
  FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2015`  ssgolden
  WHERE DATE(ssgolden.pickup_datetime) = '2015-10-07'
), matched_addresses AS (
  SELECT ARRAY_AGG(
      IF(
        ST_CONTAINS(sldl.tract_geom, SAFE.ST_GEOGPOINT(ssgolden.pickup_longitude, ssgolden.pickup_latitude))
        , FORMAT('%s %s', sldl._table_suffix,sldl.lsad_name), null)
      IGNORE NULLs LIMIT 1)[OFFSET(0)] zone_name
  FROM addresses  ssgolden
  CROSS JOIN `bigquery-public-data.geo_census_tracts.census_tracts_*`   sldl 
  GROUP BY uuid
)

SELECT zone_name, COUNT(*) c
FROM matched_addresses 
GROUP BY 1
ORDER BY c DESC
Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
  • 1
    That version is OK if the number of zones is low. But it uses regular CROSS JOIN with ST_CONTAINS computed per each pair. Mikhail's version uses spatially optimized JOIN, so it will scale better when the number of zones is larger. – Michael Entin Jun 05 '19 at 02:10
  • Are you Michael Entin? Because Michael Entin taught me to do it this way in a similar use case (closest point). https://stackoverflow.com/questions/53678306/reverse-geocoding-how-to-determine-the-city-closest-to-a-lat-lon-with-bigque – Felipe Hoffa Jun 05 '19 at 03:57
  • (I added a query joining with the 74k US census tracts. .. Mikhail's performs better) – Felipe Hoffa Jun 05 '19 at 04:46