4

I have an ESRI shape file containing a library of about 150 contiguous geographic areas (polygons) that together comprise a geographic region. I also have a csv file containing 60,000 events which each have a unique set of x,y point coordinates. Each of the events in the csv file occurred within one (and only one) of the 150 polygons in the shape file, but I do not know which one of the polygons to associate with each record. Therefore, I need to write an algorithm that will find out the identity of the polygon within which each of the 60,000 events occured. The output from the algorithm that I need to write will enable me to subsequently compute statistics, such as the likelihood of specific types of events occurring within specific polygons(geographic areas).

(Also, of course, the shape file is not just one file. It is a directory containing 8 files with file extensions including .dbf, .prj, .qix, .sbn, .sbx, .shp, .xml, and .shx.)

I do not have an ArcGIS license. I have found the File Geodatabase API at http://resources.arcgis.com/content/geodatabases/10.0/file-gdb-api but I am not sure that it is the right toolset, and I am also having trouble finding sample code.

Can anyone show me some code for finding which polygon (from a shape file) each of a large number of points (from an external data source such as a csv file) falls within?

Also, I need to specify that I need to be able to add a specific code for the identity of the relevant polygon to the csv record for each event. So it is not enough for me to simply plot the points on the map to visualize which polygons contain the events. I do not need to visualize this data at all. Instead, what I need is to be able to tag a polygon id to each event record in the csv file so that I can do subsequent numerical analysis that is not visual in nature.

Links to articles, tutorials, and other resources on this topic are also appreciated. I imagine that this is a problem that people solve every day, so there must be an established code base out there, if a person knows how to look for it. I code in Java every day, so Java solutions are preferred. However, I can port something from another language if you have a good code sample written in a different language.


*EDIT: *
I tried the following R code based on Spacedman's suggestion, and I got the following error message:

> myCSV <- read.csv(file="myCSVFile.csv",head=TRUE,sep=",")  
> pts = SpatialPoints(myCSV)  
> ZipCodes = readShapeSpatial("path/myshapefile.shp")  
> overlay(myCSV,ZipCodes)  
Error in function (classes, fdef, mtable)  : unable to find an inherited method for function "overlay", for signature "data.frame", "SpatialPolygonsDataFrame"  
> 

See my other comments below.


SECOND EDIT:
The R code that I ended up using is:

myCSV <- read.csv(file="myData.csv",head=TRUE,sep=",")  
pts = SpatialPoints(myCSV)  
ZipCodes = readShapeSpatial("myPath/ZipCodes.shp")  
write.csv(ZipCodes$ZIPCODE[overlay(pts,ZipCodes)], "ZipMatches.csv", quote=FALSE, row.names=FALSE)

Note: I had to use:

summary(ZipCodes)  

to find the name of the field that contained the encodings for the zip codes. Before I ran summary(ZipCodes), the script was just putting out the index of each zip code and not the zip code itself.

CodeMed
  • 9,527
  • 70
  • 212
  • 364
  • 2
    I don't think that's too hard to code with R. Is R an acceptable solution? – Roman Luštrik Jan 08 '12 at 10:12
  • @RomanLuštrik, R can work. I would be interested to see sample code. +1 for trying to help me. – CodeMed Jan 08 '12 at 20:13
  • You are overlaying myCSV on the ZipCodes, not the 'pts' SpatialPoints object! The error message is saying "I don't know how to overlay a data frame onto a SpatialPolygons data frame". overlay(pts,ZipCodes) should work. – Spacedman Jan 09 '12 at 22:44
  • @Spacedman, Thank you. I posted working code that I ended up using above. But I have a question. The output of that code was a csv file of the exact length of my input csv file, and contained NA at each record where x,y coordinates in the input were zero. Therefore, I felt OK just manually inserting the output as a new column in the original spreadsheet that was missing this data. Do you think that I should do anything more to validate the correspondence of each output code with each row in my original data? If so, what else can I do? I just assumed that output and input were in same order. – CodeMed Jan 10 '12 at 07:21
  • Yes, the output of overlay is in the same order as 'pts' in your input, which is the same order as your myData.csv. – Spacedman Jan 10 '12 at 09:02

2 Answers2

9

The java library for doing point/poly operations is the JTS:

http://www.vividsolutions.com/jts/JTSHome.htm

You probably need something else to read the shapefile, maybe this:

http://openmap.bbn.com/doc/api/com/bbn/openmap/layer/shape/ShapeFile.html

or maybe java bindings to OGR and GDAL:

http://gdal.org/java/

HOWEVER.. you might just be able to do this with an open source GIS package: Quantum GIS is my fave, but if you want a Java-based one there's a few - OpenJump, gvSIG (www.osgeo.org for everything on those things).

In R, if you have read your point coordinates into a matrix or data frame:

> xy
            [,1]       [,2]
 [1,]  15.224275  -0.840066
 [2,]  -1.207046   7.959644
 [3,]   9.397658  17.426323
 [4,]  28.242840 -29.581008
 [5,]  18.664603  15.361146
 [6,]   0.975846   8.534910
 [7,] -10.941825  10.438541
 [8,] -10.331097  20.260005
 [9,]   8.105570   9.595169
[10,] -14.468177  14.366980

Then, using the maptools package and its dependencies:

> require(maptools)

First make a SpatialPoints object from your coordinates:

> pts = SpatialPoints(xy)

Then read in your shapefile:

> africa=readShapeSpatial("/path/to/africa.shp")

Now overlay:

> overlay(pts,africa)
 [1] 12 20 39 27 10 55 22 33 40 45

that's a vector of row numbers in the shapefile. You can lookup attributes in the shapefile easy enough:

> africa$CNTRY_NAME[overlay(pts,africa)]
 [1] Congo      Ghana      Niger      Lesotho    Chad       Togo      
 [7] Guinea     Mauritania Nigeria    Senegal   
61 Levels: Algeria Angola Benin Botswana Burkina Faso Burundi ... Zimbabwe

Then if you want to write a vector to a CSV file,

> write.csv(africa$CNTRY_NAME[overlay(pts,africa)],file="out.csv")

produces:

"","x"
"1","Congo"
"2","Ghana"
"3","Niger"
"4","Lesotho"
"5","Chad"
"6","Togo"
"7","Guinea"
"8","Mauritania"
"9","Nigeria"
"10","Senegal"

Comma separated, quoted, with a header called 'x'. These things are tweakable with other options to write.csv.

If any of your points fall outside the polygons, the return overlay vector will be an 'NA' value, you may want to test for this:

> if(any(is.na(overlay(pts,africa)))){stop("splash!")}
Error: splash!

Nuff?

Spacedman
  • 92,590
  • 12
  • 140
  • 224
  • Thank you. +1 for helping me figure this out. I have coded in R before. I could download it again and study it, if you think it will make coding that easy. I know how to get .csv data into R, and I could probably get your 'match' to output into .csv easily enough. But are you willing to post a more complete version of what you think my R script would need to be in order to accomplish the entire task that I outlined in my original posting above? Thank you. – CodeMed Jan 08 '12 at 20:12
  • Thank you. I am currently downloading with intent to try these methods. +1 for helping. – CodeMed Jan 09 '12 at 20:13
  • I downloaded R and maptools. I imported my csv data. But my csv data file has several fields, of which x and y are only two fields. Even if I strip out all extraneous fields, I would probably still need to at least keep the recordID field in addition to the x and y fields. But before I start writing a bunch of complicated string-splitting code, is there some way that you can show me to most easily handle a csv file with 10 columns in each of 60,000 rows, when it seems that SpatialPoints() only wants two column xy? I would like to use the most efficient code possible. – CodeMed Jan 09 '12 at 21:45
  • I am researching the commands that you use, and I have not been able to find any documentation for spatialpoints() and overlay(). I have googled those terms and I have also done key word searches in various R manuals in PDF. I cannot run these commands yet because my csv data has extra fields (see other question), but I am trying to at least understand the syntax. Are you aware of any good explanations of that syntax on the web? Thank you again. – CodeMed Jan 09 '12 at 22:01
  • 1
    SpatialPoints and overlay come from the "sp" package, which you probably will have got if you got the maptools package installed. If you can do library(maptools) and library(sp) then you can probably do help(SpatialPoints) - capitalisation is important. For your CSV, you can read it all into a data frame with read.table or read.csv and then just extract the x and y columns. Any intro to R will tell you how to do this. Sections 6 and 7 here http://cran.r-project.org/doc/manuals/R-intro.html for example, or ask new questions here. But search first! – Spacedman Jan 09 '12 at 22:15
  • Thank you again. I will look into this when I log in next. I need to go into a meeting now. For the moment, I also posted some code at the end of my original posting above, which identifies an error message I encountered when attempting to run the code you suggested. It does not seem to like overlay(). – CodeMed Jan 09 '12 at 22:27
8

For a GUI solution without programming you could have a look at QGIS software.

It will load your polygon shape file without any problems and can also handle creation of a point layer from your CSV file.

enter image description here

60k points shouldn't be an issue here - I have worked with larger datasets on my laptop.

Getting the attributes of the polygon would be then as easy as performing a spatial join on your two datasets.

enter image description here

enter image description here

Result of this operation will be point layer equivalent to your input data plus the attributed of joined polygons.

If too much clicking is involved or if you need to repeat such analysis frequently you could think about scripting of this process using PyQGIS.

For other solutions have a look at Fastest way to spatially join a point CSV with a polygon Shapefile question on GIS SE site.

Community
  • 1
  • 1
radek
  • 7,240
  • 8
  • 58
  • 83