2

I'm working with a column of vectors of urls formatted as a string with each url separated by a comma:

column_with_urls

["url.a, url.b, url.c"]

["url.d, url.e, url.f"]

I would like to use the tidytext::unnest_tokens() R function to separate these out into one url per line (although I'm open to other preferably R based solutions). I've read the docs here but I can't tell if it's possible/advisable to enter a single character to split on.

My thought is something like unnest_tokens(url, column_with_urls, by = ','). Is there a way to specify that kind of argument and/or a better way to solve this problem?

My desired output is a dataframe with one url per row like this (and all other data for the original rows copied over to each row):

url

url.a

url.b

url.c

...

Thanks in advance.

Josh
  • 1,237
  • 4
  • 15
  • 22

1 Answers1

7

The unnest_tokens function has an option for you to split on a regex pattern. Below is the example syntax to split on a comma using this option (you could also use it for more complex patterns).

Note that this will convert the class of your input data to a tibble

my_df = data.frame(id=1:2, urls=c("url.a, url.b, url.c",
                                  "url.d, url.e, url.f"))
tidytext::unnest_tokens(my_df, out, urls, token = 'regex', pattern=",")
# # A tibble: 6 × 2
#     id    out
#   <int>  <chr>
# 1     1  url.a
# 2     1  url.b
# 3     1  url.c
# 4     2  url.d
# 5     2  url.e
# 6     2  url.f
Adam Spannbauer
  • 2,707
  • 1
  • 17
  • 27
  • Perfect. Thank you! – Josh Dec 05 '17 at 18:43
  • in my use case, the text column I wish to split concatenates posts on a message board, where in between each post is "|||" and that seems to really confuse the regex... any suggestions? – Andrew McCartney Mar 10 '22 at 23:48
  • `|` means "or" when in a regex. You have to "escape" the `|` using `\\ ` in R to make it a literal backslash. If using regex in R to split on `|||` you'd use `\\|\\|\\|`. – Adam Spannbauer Mar 11 '22 at 12:14