4

I have made this short code to automate geocoding of IP addresses by using the freegeoip.net (15,000 queries per hour by default; excellent service!):

> library(RCurl)
Loading required package: bitops
> ip.lst = 
c("193.198.38.10","91.93.52.105","134.76.194.180","46.183.103.8")
> q = do.call(rbind, lapply(ip.lst, function(x){ 
  try( data.frame(t(strsplit(getURI(paste0("freegeoip.net/csv/", x)), ",")[[1]]), stringsAsFactors = FALSE) ) 
}))
> names(q) = c("ip","country_code","country_name","region_code","region_name","city","zip_code","time_zone","latitude","longitude","metro_code")
> str(q)
'data.frame':   4 obs. of  11 variables:
$ ip          : chr  "193.198.38.10" "91.93.52.105" "134.76.194.180" "46.183.103.8"
$ country_code: chr  "HR" "TR" "DE" "DE"
$ country_name: chr  "Croatia" "Turkey" "Germany" "Germany"
$ region_code : chr  "" "06" "NI" ""
$ region_name : chr  "" "Ankara" "Lower Saxony" ""
$ city        : chr  "" "Ankara" "Gottingen" ""
$ zip_code    : chr  "" "06450" "37079" ""
$ time_zone   : chr  "Europe/Zagreb" "Europe/Istanbul" "Europe/Berlin" ""
$ latitude    : chr  "45.1667" "39.9230" "51.5333" "51.2993"
$ longitude   : chr  "15.5000" "32.8378" "9.9333" "9.4910"
$ metro_code  : chr  "0\r\n" "0\r\n" "0\r\n" "0\r\n"

In three lines of code you get coordinates for all IPs including city/country codes. I wonder if this could be parallelized so it runs even faster? To geocode >10,000 IPs can take hours otherwise.

Tom Hengl
  • 166
  • 8
  • 1
    Don't do this. There's no real need to use an external API service when you've got [`rgeolocate`](https://cran.r-project.org/web/packages/rgeolocate/index.html) can process 1m IPs locally in 5s. – hrbrmstr Aug 14 '17 at 13:45
  • I was not aware of that package. Thanks for the link. I assume the API service should be much faster than reading the csv files? – Tom Hengl Aug 14 '17 at 14:07
  • Parallelizing API calls from one machine rarely results in a speed-up - the limiting factor is often the server response, not how fast your CPU can handle them. Hammer the server with multiple requests and you'll slow down the response, and get more drop-outs, and then get banned :) – Spacedman Aug 14 '17 at 14:35
  • @T.Hengl well, this is a local C++ code in an R package reading from a local database. It quite literally geocodes 1m IP addresses in 5s (or less, on some systems). – hrbrmstr Aug 14 '17 at 16:50

2 Answers2

8
library(rgeolocate)

ip_lst = c("193.198.38.10", "91.93.52.105", "134.76.194.180", "46.183.103.8")

maxmind(ip_lst, "~/Data/GeoLite2-City.mmdb", 
        fields=c("country_code", "country_name", "region_name", "city_name", 
                 "timezone", "latitude", "longitude"))

##   country_code country_name            region_name  city_name        timezone latitude longitude
## 1           HR      Croatia                   <NA>       <NA>   Europe/Zagreb  45.1667   15.5000
## 2           TR       Turkey               Istanbul   Istanbul Europe/Istanbul  41.0186   28.9647
## 3           DE      Germany           Lower Saxony Bilshausen   Europe/Berlin  51.6167   10.1667
## 4           DE      Germany North Rhine-Westphalia     Aachen   Europe/Berlin  50.7787    6.1085

There are instructions in the package for obtaining the necessary data files. Some of the fields you're pulling are woefully inaccurate (more so than any geoip vendor would like to admit). If you do need ones that aren't available, file an issue and we'll add them.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
2

I've found multidplyr is a great package for making parallel server calls. This is the best guide I've found, and I highly recommend reading the whole thing to better understand how the package works: http://www.business-science.io/code-tools/2016/12/18/multidplyr.html

library("devtools")
devtools::install_github("hadley/multidplyr")
library(parallel)
library(multidplyr)
library(RCurl)
library(tidyverse)

# Convert your example into a function
get_ip <- function(ip) {
  do.call(rbind, lapply(ip, function(x) {
    try(data.frame(t(strsplit(getURI(
      paste0("freegeoip.net/csv/", x)
    ), ",")[[1]]), stringsAsFactors = FALSE))
  })) %>% nest(X1:X11)
}

# Made ip.lst into a Tibble to make it work better with dplyr
ip.lst =
  tibble(
    ip = c(
      "193.198.38.10",
      "91.93.52.105",
      "134.76.194.180",
      "46.183.103.8",
      "193.198.38.10",
      "91.93.52.105",
      "134.76.194.180",
      "46.183.103.8"
    )
  )

# Create a cluster based on how many cores your machine has
cl <- detectCores()
cluster <- create_cluster(cores = cl)

# Create a partitioned tibble
by_group  <- partition(ip.lst, cluster = cluster)

# Send libraries and the function get_ip() to each cluster
by_group %>%
  cluster_library("tidyverse") %>%
  cluster_library("RCurl") %>%
  cluster_assign_value("get_ip", get_ip)

# Send parallel requests to the website and parse the results
q <- by_group %>%
  do(get_ip(.$ip)) %>% 
  collect() %>% 
  unnest() %>% 
  tbl_df() %>% 
  select(-PARTITION_ID)

# Set names of the results
names(q) = c(
  "ip",
  "country_code",
  "country_name",
  "region_code",
  "region_name",
  "city",
  "zip_code",
  "time_zone",
  "latitude",
  "longitude",
  "metro_code"
)
Andrew Brēza
  • 7,705
  • 3
  • 34
  • 40