split string and extract according to a pattern to form data frame

Question

I am trying to slice the following strings as 3 separated columns (Country, City, Count) in R

Country    City     Count    
Japan      Tokyo    361

The data:

"country=Japan&city=Tokyo","361"
"country=Spain&city=Barcelona","359"
"country=United Kingdom&city=London","333"
"country=Japan&city=Fukuoka","259"
"country=United States of America&city=New York City","223"

I've tried this:

library(data.table)
library(stringr)

df <- read.table(file.choose(), header = FALSE, sep = ",", colClasses = c('character', 'character'), na.strings = 'null')

df.1 <- data.table(str = as.character(df$V1))

df.2 <- df.1[grepl("country=.+&city=\\w+", str),
             country := str_extract(str,"(?<=country=)(.+)"),
             city := str_extract(str, "(?<=city=)(.+)")]

But from this, while the city format comes as I'd like to view, the column country would return as follows:

Japan&city=Tokyo

I would like to eliminate the &city=Tokyo bit to make nice format.

Then, I'd merge the df and df.2 together so I have the number values aligned. However, I think there must be a smarter way to do this.

Please share me your knowledge. I appreciate your assistance.

akrun · Accepted Answer · 2017-04-19T03:08:55.547

We can use base R strsplit to split the 'V1' column by = and & into a list, loop through the list, extract the alternative elements (x[c(FALSE, TRUE)]) while naming it with the remaining elements, rbind the list elements and then cbind with the second column of original dataset

res <- do.call(rbind, lapply(strsplit(as.character(df$V1), "[=&]"), 
             function(x) setNames(x[c(FALSE, TRUE)], x[c(TRUE, FALSE)])))
res1 <- cbind(res, setNames(df[-1], 'Count'))
res1
#                   country          city Count
#1                    Japan         Tokyo   361
#2                    Spain     Barcelona   359
#3           United Kingdom        London   333
#4                    Japan       Fukuoka   259
#5 United States of America New York City   223

We can also do this with tidyverse. Create a row index column (rownames_to_column from tibble), then split the 'V1' with the delimiter '&' (separate_rows) to reshape into 'long' format, split the 'V1' into new columns ('new1' and 'new2') by specifying the 'sep' as =, reshape the dataset back into 'wide' (spread) and reorder the columns (select)

library(tidyverse)
rownames_to_column(df, 'rn') %>%
      separate_rows(V1, sep='[&]') %>% 
      separate(V1, into= c("new1", "new2"), sep="=")  %>% 
      spread(new1, new2) %>% 
      select(country, city, Count=V2) 
#                   country          city Count
#1                    Japan         Tokyo   361
#2                    Spain     Barcelona   359
#3           United Kingdom        London   333
#4                    Japan       Fukuoka   259
#5 United States of America New York City   223

data

df <- structure(list(V1 = structure(c(2L, 3L, 4L, 1L, 5L), 
.Label = c("country=Japan&city=Fukuoka", 
 "country=Japan&city=Tokyo", "country=Spain&city=Barcelona", 
  "country=United Kingdom&city=London", 
"country=United States of America&city=New York City"), class = "factor"), 
V2 = c(361L, 359L, 333L, 259L, 223L)), .Names = c("V1", "V2"
), row.names = c(NA, -5L), class = "data.frame")

alistaire · Answer 2 · 2017-04-19T04:42:11.800

What you have are effectively URL encoded queries, so you can use httr::parse_url to decode them. Two complications:

parse_url looks for a ? in front of the query to recognize it as such, so you'll have to paste0 it on, and
parse_url is not vectorized, so it has to be applied to each query via lapply or purrr::map.

Mostly, though, it works quite nicely:

library(tidyverse)

df <- read_csv('"country=Japan&city=Tokyo","361"
"country=Spain&city=Barcelona","359"
"country=United Kingdom&city=London","333"
"country=Japan&city=Fukuoka","259"
"country=United States of America&city=New York City","223"', 
               col_names = c('query', 'count'))

df %>% transmute(count, 
                 query = map(paste0('?', query), 
                             ~as_data_frame(httr::parse_url(.x)$query))) %>% 
    unnest()

#> # A tibble: 5 × 3
#>   count                  country          city
#>   <int>                    <chr>         <chr>
#> 1   361                    Japan         Tokyo
#> 2   359                    Spain     Barcelona
#> 3   333           United Kingdom        London
#> 4   259                    Japan       Fukuoka
#> 5   223 United States of America New York City

or even just

df %>% do(data.frame(count = .$count, 
                     query = map_df(paste0('?', .$query), 
                                    ~httr::parse_url(.x)$query)))

or use curlconverter::parse_query or shiny::parseQueryString, which don't require the extra ?:

df %>% bind_cols(map_df(.$query, curlconverter::parse_query)) %>% select(-query)

All return the same thing.

This is cool! I didn't know until now that there's parse_url function. Thank you so much for sharing this, I can apply this for other datasets I work on. — tmhs, Apr 19 '17 at 04:20

split string and extract according to a pattern to form data frame

2 Answers2

data