-1

I am using the ST_Intersects function of geospark to make the intersection between points and polygons.

queryOverlap = """
        SELECT p.ID, z.COUNTYNS as zone, p.date, timestamp, p.point
        FROM gpsPingTable as p, zoneShapes as z
        WHERE ST_Intersects(p.point, z.geometry)
    """
pingsDay = spark.sql(queryOverlap)
pingsDay.show()

Why does return for each row a duplicate?

+--------------------+--------+----------+-------------------+--------------------+
|                  ID|    zone|      date|          timestamp|               point|
+--------------------+--------+----------+-------------------+--------------------+
|45cdaabc-a804-46b...|01529224|2020-03-17|2020-03-17 12:29:24|POINT (-122.38825...|
|45cdaabc-a804-46b...|01529224|2020-03-17|2020-03-17 12:29:24|POINT (-122.38825...|
|45cdaabc-a804-46b...|01529224|2020-03-18|2020-03-18 11:21:27|POINT (-122.38851...|
|45cdaabc-a804-46b...|01529224|2020-03-18|2020-03-18 11:21:27|POINT (-122.38851...|
|aae0bb4e-4899-4ce...|01531402|2020-03-18|2020-03-18 06:58:03|POINT (-122.23097...|
|aae0bb4e-4899-4ce...|01531402|2020-03-18|2020-03-18 06:58:03|POINT (-122.23097...|
|f9b58c70-0665-4f5...|01531928|2020-03-17|2020-03-17 17:32:46|POINT (-119.43811...|
|f9b58c70-0665-4f5...|01531928|2020-03-17|2020-03-17 17:32:46|POINT (-119.43811...|
|f9b58c70-0665-4f5...|01531928|2020-03-18|2020-03-18 08:21:34|POINT (-119.41080...|
|f9b58c70-0665-4f5...|01531928|2020-03-18|2020-03-18 08:21:34|POINT (-119.41080...|
|f9b58c70-0665-4f5...|01531928|2020-03-19|2020-03-19 00:26:43|POINT (-119.43623...|
|f9b58c70-0665-4f5...|01531928|2020-03-19|2020-03-19 00:26:43|POINT (-119.43623...|
|fb768b89-b92a-4f0...|01531402|2020-03-18|2020-03-18 06:30:43|POINT (-122.22106...|
|fb768b89-b92a-4f0...|01531402|2020-03-18|2020-03-18 06:30:43|POINT (-122.22106...|
|fb768b89-b92a-4f0...|01531402|2020-03-18|2020-03-18 07:57:47|POINT (-122.22102...|
|fb768b89-b92a-4f0...|01531402|2020-03-18|2020-03-18 07:57:47|POINT (-122.22102...|
|a32f727d-566b-4ad...|01529224|2020-03-18|2020-03-18 14:38:13|POINT (-122.59499...|
|a32f727d-566b-4ad...|01529224|2020-03-18|2020-03-18 14:38:13|POINT (-122.59499...|
|ad7e4d7e-f8e5-45b...|01529224|2020-03-18|2020-03-18 07:58:51|POINT (-122.14959...|
|ad7e4d7e-f8e5-45b...|01529224|2020-03-18|2020-03-18 07:58:51|POINT (-122.14959...|
+--------------------+--------+----------+-------------------+--------------------+
emax
  • 6,965
  • 19
  • 74
  • 141
  • Did you tried adding a simple ```DISTINCT``` statement? https://gis.stackexchange.com/questions/286683/removing-duplicate-rows-created-by-st-intersects-in-postgis – Manuel Carrero Mar 30 '20 at 13:19
  • @ManuelCarrero thanks I tried but I am not sure where to add the `DISTINCT` statement in my query. Can you help? – emax Mar 30 '20 at 13:36
  • ```SELECT DISTINCT p.ID, z.COUNTYNS as zone, p.date, timestamp, p.point FROM gpsPingTable as p, zoneShapes as z WHERE ST_Intersects(p.point, z.geometry)``` This should works – Manuel Carrero Mar 30 '20 at 14:30

1 Answers1

0

The most obvious reason would be if the points or zones in source tables are not unique. If there are duplicate points or zones, you obviously get duplicates

Check source tables for uniqueness:

SELECT p.ID, p.date count(*) c
FROM gpsPingTable as p 
GROUP BY ID, data HAVING c > 1

This will report duplicate points. And this will report duplicate zones:

SELECT z.COUNTYNS as zone, COUNT(*) c
FROM zoneShapes as z
GROUP BY zone HAVING c > 1
Michael Entin
  • 7,189
  • 3
  • 21
  • 26