0

I try to extract a file extension (if exists) from URLs like

> http://www.example.com/index.php?option=com&etc
> http://www.example.com/subpage1/subpage2/file.pdf

With basename(URL) function, I got the file. But when I was applying sub(), I get this

> sub(".*([.*])", "\\1", basename(URL))
> php?option=com&etc
> .pdf

How to retrieve only the extension (if exist)?

I have tried file_ext(basename(URL)). It works for the second example (when there is no parameter) but it gives nothing for the first.

file_ext(basename(URL))
[1] ""

Is it possible to have a regex that retrieve strings between "." and "?".

Andrew T.
  • 4,701
  • 8
  • 43
  • 62
SalimK
  • 360
  • 1
  • 3
  • 18

1 Answers1

1

Get rid of all the arguments listed after ?, and then run file_ext:

tools::file_ext(sub("\\?.+", "", URL))
#[1] "php" "pdf"

Where URL was:

URL <- c(
"http://www.example.com/index.php?option=com&etc",
"http://www.example.com/subpage1/subpage2/file.pdf"
)
Andrew T.
  • 4,701
  • 8
  • 43
  • 62
thelatemail
  • 91,185
  • 12
  • 128
  • 188