0

I have a dt = data.table with a character column.

I need to perform multiple regex operations on that column, which I have written as:

  dt[, Description := sapply(Description, tolower)][
      , Description := sapply(Description, gsub, pattern = " $", replacement = "")][
        , Description := sapply(Description, gsub, pattern = "  ", replacement = " ")][
          , Description := sapply(Description, gsub, pattern = "ões\\>", replacement = "ão")][
            , Description := sapply(Description, gsub, pattern = "eis\\>", replacement = "el")][
              , Description := sapply(Description, gsub, pattern = "as\\>", replacement = "a")][
                , Description := sapply(Description, gsub, pattern = "ais\\>", replacement = "al")][
                  , Description := sapply(Description, gsub, pattern = "es\\>", replacement = "e")][
                    , Description := sapply(Description, gsub, pattern = "ns\\>", replacement = "m")][
                      , Description := sapply(Description, gsub, pattern = "s\\>", replacement = "")]

These are basically all ways of changing plural to singular in Portuguese.

Is there a more efficient and elegant way of doing this?

Wasabi
  • 2,879
  • 3
  • 26
  • 48
  • 1
    I think you don't need `sapply` here as most of the operations are vectorized. Can you show a small example – akrun May 27 '19 at 20:13
  • Oh, right. Yeah, I should've known better than that one. – Wasabi May 27 '19 at 20:14
  • 1
    See https://stackoverflow.com/questions/28034975/r-replacing-multiple-regex-with-sub – Wiktor Stribiżew May 27 '19 at 20:15
  • Similar to @WiktorStribiżew's reference, I suggest you generalize multiple `gsub` pattern replacements into one function, and use that within the `data.table` call. For one, having a single function that does it is much easier to test and verify; for another, it makes your `data.table`-munging code much more readable/maintainable. (And I agree that this need not be `sapply`'ed.) – r2evans May 27 '19 at 20:17
  • For "that one function", try something like `Reduce(function(s, ptn) gsub(ptn[1], ptn[2], s), list(c("ões\\>","ão"), c("eis\\>","el"), .....), init = strings)`. – r2evans May 27 '19 at 20:22
  • May be you need to check `iconv` with `dt[, Description := trimws(tolower(Description)))]` – akrun May 27 '19 at 20:23
  • @akrun, I don't think Wasabi is trying to remove accents, it's changing from singular to plural, requiring (a form of) translation, not just conversion. – r2evans May 27 '19 at 22:36
  • 1
    @Wasabi, does https://stackoverflow.com/a/54443120/ help? It suggests using the `SnowballC::stemDocument` (CRAN's page for [`SnowballC`](https://cran.r-project.org/web/packages/SnowballC/index.html)) for *"collapsing words to a common root to aid comparison of vocabulary", and it supports Portuguese. (I have no experience with it or the process in general; I would probably brute-force it with regex as you are suggesting here.) – r2evans May 27 '19 at 22:59
  • You can use conditional replacement using a dataframe. ie Have a dataframe that has the pattern in the first column and then the replacement in the second column. collapse this into a single line using a non- special character where the first is the key and the second is the value, ie `pt = do.call(paste,c(sep=':',collapse=',',data))` where data is your dataframe that contains the pattern and replacement. Now you can use `sub('__.*','',gsub('([^,]+)(?=.*\\1:([^,]+))','\\2',paste(Description,do.call(paste,c(dat,sep=":",collapse=',')),sep="__"),perl=T))`. This should be able to replace everythin – Onyambu May 28 '19 at 00:21
  • If you give me a minimum reproducible example - that is,code to create a dataframe - I'll see what I can do. I think rowr::vectorize() could do the trick. It's a function that vectorizes functions, so you can feed gsub a vector of regexes instead of having to call gsub for each regex. (Base-R also has this, Vectorize(), but it can't vectorize both the x and y arguments) – emilBeBri Jun 01 '19 at 22:29
  • wait, too quick there. Seems like mgsub () linked to by Wicktor is the way to go. Did that solve it? – emilBeBri Jun 01 '19 at 22:40
  • @emilBeBri: Yeah, I ended up using `mgsub`. – Wasabi Jun 02 '19 at 04:37

0 Answers0