Find regex matches in the names of factor levels in a df in R

Question

I have a dataframe with factors. These factors have some levels. I could not find exact matches based on their names using regex.

  df <- structure(list(age = structure(1:2, .Label = c("18-25", 
                   ">25"), class = "factor"), `M` = c("13.4", 
                   "12.8"), 'N' = c("73", "76"), `SD` = c("6.8", 
                    "6.6")), row.names = 51:52, class = "data.frame")

My df

     age   M  N  SD
51 18-25 13.4 73 6.8
52   >25 12.8 76 6.6




First try: 

         regexpr(pattern = "18-25", text= df, ignore.case = FALSE, perl = FALSE,  fixed = T)


    [1] -1 -1 -1 -1
    attr(,"match.length")
    [1] -1 -1 -1 -1
    attr(,"index.type")
    [1] "chars"
    attr(,"useBytes")
    [1] TRUE

Second Try

     saved_level_name <- structure(list(V1 = structure(1L, .Label = "18-25", class = "factor")), row.names = c(NA, 
     -1L), class = "data.frame") 
     regexpr(pattern = saved_level_name, text= df, ignore.case = FALSE, perl = FALSE,  fixed = T)


    [1]  1  4 -1 -1
    attr(,"match.length")
    [1]  1  1 -1 -1
    attr(,"index.type")
    [1] "chars"
    attr(,"useBytes")
    [1] TRUE

Third Try (compare two outputs!)

     saved_name_level_2 <- structure(list(V4 = structure(1L, .Label = ">25", class = "factor")), row.names = c(NA, 
     -1L), class = "data.frame")

     regexpr(pattern = saved_level_name, text= df[1], ignore.case = FALSE, perl = FALSE,  fixed = T)

     regexpr(pattern = saved_name_level_2, text= df[1], ignore.case = FALSE, perl = FALSE,  fixed = T)



    [1] 1
    attr(,"match.length")
    [1] 1
    attr(,"index.type")
    [1] "chars"
    attr(,"useBytes")
    [1] TRUE

    [1] 1
    attr(,"match.length")
    [1] 1
    attr(,"index.type")
    [1] "chars"
    attr(,"useBytes")
    [1] TRUE

Forth Try

     regexpr(pattern = as.character( saved_name_level ), text= df, ignore.case = FALSE, perl = FALSE,  fixed = T)

    [1] -1 -1 -1 -1
    attr(,"match.length")
    [1] -1 -1 -1 -1
    attr(,"index.type")
    [1] "chars"
    attr(,"useBytes")
    [1] TRUE

First try : 0 results Second try : No meaning out of results (1, 4 ?) Third try : Same results with different inputs at face value. Forth Try : No results!

Possibly, regex finds the stored value of factors and not their face value/name?

How Can I use Regex to search factor names, and not their values?

@Ronak There is a factor of Age with two levels "18-25" and ">25". I would like to find the match of "18-25" as the name and not as factor value. However, all the outputs show that regexpr try to match the internal value of the factor level, and not its name. So my desire output would like something that will match the name of factor level and not its internal value. — Estatistics, Aug 08 '19 at 01:46
Can you simply change the class of any factors to `character`. I.e. `df$age <- as.character(df$age)` — stevec, Aug 08 '19 at 01:58
It is difficult for me to understand the explanation in words. As your input is clear in the form of `dput`, can you show the expected output in the same way how exactly will it look? — Ronak Shah, Aug 08 '19 at 04:24
Can someone explain why that question took -2 points? I have the input, I have the not desired output. I think those that rated the question as "-2" did not bother to be involved to answer the question seriously or they have no idea about regex. Sorry for my harsh comment, but this is the truth. The output will be one that Regex WILL match exact the factor level as name and not as "number". R is storing factor levels as "numbers". Thanks any way. — Estatistics, Aug 08 '19 at 09:05
How could I had provided an expected output about Regex results? I provided about the failed regex results. I thought that people that are involved with R and Regex know how Regex works and what these output means. My fault. However, Thanks for your comments. — Estatistics, Aug 08 '19 at 12:24

r2evans · Accepted Answer · 2019-08-08T04:45:22.660

The reason this is failing can be found with debug:

debugonce(regexpr)
regexpr(pattern = "18-25", text= df, ignore.case = FALSE, perl = FALSE,  fixed = T)
# debugging in: regexpr(pattern = "18-25", text = df, ignore.case = FALSE, perl = FALSE, 
#     fixed = T)
# debug: {
#     if (!is.character(text)) 
#         text <- as.character(text)
#     .Internal(regexpr(as.character(pattern), text, ignore.case, 
#         perl, fixed, useBytes))
# }
debug: if (!is.character(text)) text <- as.character(text)
debug: text <- as.character(text)

Ok, so let R run that as.character command, which is converting the "text" (really a frame) into a character version of it.

text
# [1] "1:2"                   "c(\"13.4\", \"12.8\")" "c(\"73\", \"76\")"    
# [4] "c(\"6.8\", \"6.6\")"

That last part is the clincher. When regexpr is converting your text argument (which is really intended to be a character vector), it is converting your factors of df$age into a character representation of the factor numbers, as 1:2. (The fact that it generates a :-sequence is interesting to me ... but that's a different point.)

Obviously "1:2" is not going to match your "18-25" test. You really should be checking individual vectors/columns. If you have multiples, then perhaps

lapply(df, function(v) regexpr(pattern = "18-25", text=v, ignore.case = FALSE, perl = FALSE,  fixed = T))

or df[,1:3] or df[,-5] or whatever you can use to delineate which columns to use or not use. But checking a whole frame at once with factors will not work.

If all you want to do is find instances in the factors where the pattern matches (instead of extracting or replacing it), then perhaps grepl is more suited:

lapply(df, grepl, pattern = "18-25", fixed = TRUE)
# $age
# [1]  TRUE FALSE
# $M
# [1] FALSE FALSE
# $N
# [1] FALSE FALSE
# $SD
# [1] FALSE FALSE

Thanks. I finally used "as.character". Thanks again! Useful the info about "grepl". — Estatistics, Aug 08 '19 at 09:07

Find regex matches in the names of factor levels in a df in R

1 Answers1