0

I thought R would be good for this... but am a complete novice at it. I have a set of UK Postcodes (e.g. 'CB2 8UR') and a separate table that maps each postcode to an OS grid coordinate. Both start as CSV:

file1:
  "pcd"
  "CB2 8UR"
  "TE3 5LJ"

file2:
  "pcd","col2","col3","oseast1m","osnrth1m","col6",...
  ...
  "CB2 8UR","?","?",9823,2034,"?"
  ...

The real file1 has a thousand or so entries, and the real file2 has several hundred thousand (and about 20 columns). The only point of file2 here is to convert the postcode to a UK OS grid coordinate. At the moment, I think I can treat the coords as being on a 2d plane.

The task is to get a map with the 'centre of mass' of each postcode marked together with a heatmap representation of the postcodes.

I did manage to plot file2 data (i.e. all uk postcodes) as bins using qplot() + stat_bin2d():

m <- qplot(xlab="Longitude",ylab="Latitude",main="Postcode heatmap",geom="blank",x=pcd$oseast1m,y=pcd$osnrth1m,data=pcd)  + stat_bin2d(bins =200,aes(fill = log1p(..count..))) 

where pcd is a data.frame read from file2.

So:

  • How can I merge file1 and file to map just the codes in file1 but using the coords in file2?

  • How can I calculate and add a marker for the centre of mass?

  • If I wanted to mark some postcodes 'special' so their 'mass' was higher than normal, would that be simple to do?

Many thanks for your help.

rivimey
  • 921
  • 1
  • 7
  • 24

1 Answers1

0

Here is code that might help your progress. First, based on the toy data frame, we use the dplyr package to merge the two data files based on the pcd variable.

Then it is beyond my familiarity, but I offer some code on finding the centroids of your data and plotting them.

library(dplyr)
post.codes <- data.frame(id = c(1, 2), pcd = c("CB2 8UR", "TE3 5LJ"))

coords <- data.frame(pcd = c("CB2 8UR", "TE3 5LJ"), coord1 = c("9823", "5555"),  coord2 = c("2034", "1234"), 
                    othervar = c("XYZ", "ABC"), stringsAsFactors = FALSE)

merged <- left_join(post.codes, coords, by = "pcd")

Next, use kmeans from the built-in stats package to find and add centroids. This code is beyond pseudo code, I hope, but is only directional.

merged$centroid <- cbind(kmeans(merged$[the variable to cluster, 2)$cluster)

centroids <- df %>% group_by(centroid) %>% summarise(average = mean(centroid))
library(ggplot2)
ggplot(centroids, aes([coord1, coord2, color=factor(notsurewhatgoes here))) +
  geom_point(size=3)+ geom_point(data=centroids, size=5)

Third, if you want to mark or highlight certain codes (centroids?), the general approach would be to create a new factor variable where the codes to be highlighted are TRUE and the others are FALSE. Then in ggplot you do something based on that factor, such as fill = highlight factor. All the TRUES will then have one fill color and all the rest will have the other default color. You can use scale_fill_manual(values = c("yourdesiredcolor", "yourseconddesiredcolor") to pick the colors other than default

lawyeR
  • 7,488
  • 5
  • 33
  • 63