Extracting particular information of character variables

Question

for a side project, I am currently trying to make a dashboard based on this data (a kind of scatterplot) where there is a date and how the prices change over time. I am currently in one of the hardest states (data cleaning). I am trying to split the variables in order to extract information. For example, in the variable departure_info I want to split into the variable departure_info_time and departure_info_day. Furthermore, in the variable price I want to extract the numbers such as 358, 480, 590, etc.

df1 <- data.frame(depart = c("OSL", "WAW", "VIE", "MUC", "FRA"),
                  destination = c("KEF", "ARN", "RIX", "VCE", "OSL"),
                  departue_info = c("['12:45 am Sa 19 Feb']", "['07:55 am Sa 19 Feb']", "['09:05 am Sa 19 Feb']", "['21:45 am Sa 19 Feb', '15:30 am Sa 19 Feb']", "['10:25 am Sa 19 Feb', '16:10 am Sa 19 Feb', '21:40 am Sa 19 Feb']"),
                  price = c("['358<U+0080>']", "['480<U+0080>']", "['590<U+0080>']", "['354<U+0080>', '418<U+0080>']", "['249<U+0080>', '249<U+0080>', '249<U+0080>', '419<U+0080>']"))

I would appreciate if someone can help me. I tried with str_extract() and gsub() but I could not succeed. I also would thank if someone can give me an advice what I have to do, if in a row I have several prices in another row just one.

I thank you for your help :)

Gregor Thomas · Accepted Answer · 2022-07-20T18:24:17.847

It looks like each item in the departure_info has a fixed number of characters, so I use substr to extract the substring from the 3rd to 10th character for the time and 12th to 20th for the date.

We could take the same approach for price, but it's a simple regex pattern to look for one or more numbers, "\\d+", so I just use that with str_extract, which will extract the first occurrence of the pattern, that is the first set of consecutive numbers in the price string.

library(stringr)
library(dplyr)
df1 %>%
  mutate(
    depart_time = substr(departue_info, start = 3, stop = 10),
    depart_date = substr(departue_info, start = 12, stop = 20),
    price_num = as.numeric(str_extract(price, "\\d+"))
  )
#   depart destination                                                      departue_info
# 1    OSL         KEF                                             ['12:45 am Sa 19 Feb']
# 2    WAW         ARN                                             ['07:55 am Sa 19 Feb']
# 3    VIE         RIX                                             ['09:05 am Sa 19 Feb']
# 4    MUC         VCE                       ['21:45 am Sa 19 Feb', '15:30 am Sa 19 Feb']
# 5    FRA         OSL ['10:25 am Sa 19 Feb', '16:10 am Sa 19 Feb', '21:40 am Sa 19 Feb']
#                                                          price depart_time depart_date price_num
# 1                                              ['358<U+0080>']    12:45 am   Sa 19 Feb       358
# 2                                              ['480<U+0080>']    07:55 am   Sa 19 Feb       480
# 3                                              ['590<U+0080>']    09:05 am   Sa 19 Feb       590
# 4                               ['354<U+0080>', '418<U+0080>']    21:45 am   Sa 19 Feb       354
# 5 ['249<U+0080>', '249<U+0080>', '249<U+0080>', '419<U+0080>']    10:25 am   Sa 19 Feb       249

If you want to get all the price numbers, we'll first get rid of the U0080s and then use str_extract_all instead of str_extract. This will generate a list column, which you can then use in various ways...

df1 %>%
  select(price) %>%
  mutate(
    price_num = price %>% str_replace_all(fixed("U+0080"), "") %>% str_extract_all("\\d+")
  )
#                                                         price          price_num
# 1                                              ['358<U+0080>']                358
# 2                                              ['480<U+0080>']                480
# 3                                              ['590<U+0080>']                590
# 4                               ['354<U+0080>', '418<U+0080>']           354, 418
# 5 ['249<U+0080>', '249<U+0080>', '249<U+0080>', '419<U+0080>'] 249, 249, 249, 419

I thank you very much for your answer. You are the best!!. Sadly I got the following error ``` Error in substr(departue_info, start = 3, stop = 10) : invalid multibyte string, element 289 ``` Do you know what can be the reason for that? Thank you for your help! — Gaaaa, Jul 20 '22 at 18:21
Have a look at your 289th row and see if anything looks different in the `departure_info`. Using `stringr::str_sub` instead of the base `substr` might avoid the error, but if the data is different in that row (and in others) a different approach might be needed. — Gregor Thomas, Jul 20 '22 at 18:26
You also might want to hold off on "accepting" my answer until you can get it working... if you update your question with problematic data someone might come along with a better method. But if you've accepted the answer everyone will assume no more help is needed. — Gregor Thomas, Jul 20 '22 at 18:28
You are right I see that for the month March was written as `M�rz`, I have to figure out a way in which the `ä` can be replaced by a `a` — Gaaaa, Jul 20 '22 at 18:32
This might help with that: https://stackoverflow.com/q/34413332/903061 — Gregor Thomas, Jul 20 '22 at 18:41
It worked!! I Just had to write this one ``` data$departue_info <- gsub("[[:punct:]]", "", data$departue_info) ``` I thank you very much @Gregor Thomas for your help. I am new at Stack Overflow but with the time I will specify my answers and help other people. — Gaaaa, Jul 20 '22 at 18:58

Extracting particular information of character variables

1 Answers1