2

I have a dataframe like below:

    Col1    Col2   Col3
ten: end       5     10
five: nb       7     11
    12:4      12     10
   13:56      15     16

Using the sqldf package in R, I want to do the following:

Replace values in Col1 with: character: space with -. The dash has spaces in the beginning and end.

Replace values in Col1 with: number:number with -. The dash has no spaces in the beginning and end.

Expected Output:

     Col1    Col2   Col3
ten - end       5     10
five - nb       7     11
     12-4      12     10
    13-56      15     16

Here is an example syntax using sqldf:

df <- sqldf("SELECT *, replace([Col1], [character: space], ' - ') [New Col generated] from df")

df <- sqldf("SELECT *, replace([Col1], [number:number], '-') [New Col generated_num] from df")

I tried referencing this doc and still no luck: https://www.rexegg.com/regex-quickstart.html

nak5120
  • 4,089
  • 4
  • 35
  • 94
  • I don't think the `replace` function supports regular expressions, and `sqldf` does not seem to support `regexp_replace` – acylam Apr 02 '19 at 18:51
  • Ok thank you. Is it possible to show the result using regular SQL then? – nak5120 Apr 02 '19 at 18:56
  • That's what I'm trying to figure out. You need at least some kind of regex to be able to tell if a string contains characters or digits and replace accordingly. Looks like `regexp` is being added as an extension to `RSQLite`: https://github.com/r-dbi/RSQLite/pull/266 – acylam Apr 02 '19 at 18:59
  • Ok thank you! I am doing research on my end too. – nak5120 Apr 02 '19 at 20:02
  • h2 which is one of the backends supported by sqldf does support `regexp_replace` – G. Grothendieck Apr 02 '19 at 21:26

1 Answers1

2

1) Assuming that only the forms shown in the question are allowed replace colons with minus signs and then replace minus followed by space with space, minus, space.

library(sqldf)
sqldf("select *, replace(replace([Col1], ':', '-'), '- ', ' - ') as New from df")

giving:

      Col1 Col2 Col3       New
1 ten: end    5   10 ten - end
2 five: nb    7   11 five - nb
3     12:4   12   10      12-4
4    13:56   15   16     13-56

2) If we can assume that the only forms are number:number or character: character and that the second form contains no digits.

sqldf("select *, 
  case when strFilter(Col1, '0123456789') = '' 
         then replace(Col1, ':', ' -')
       else replace(Col1, ':', '-')
       end as New
  from df")

giving:

      Col1 Col2 Col3       New
1 ten: end    5   10 ten - end
2 five: nb    7   11 five - nb
3     12:4   12   10      12-4
4    13:56   15   16     13-56

3) This first checks for numbers:numbers and then checks for characters: characters where characters can only be numbers or lower case letters.

dig <- "0123456789"
diglet <- "0123456789abcdefghijklmnopqrstuvwxyz"

fn$sqldf("select *,
  case when trim(Col1, '$dig') = ':'
         then replace(Col1, ':', '-')
  when trim(Col1, '$diglet') = ': '
          then replace(Col1, ': ', ' - ')
  else Col1 end as New
  from df")

giving:

      Col1 Col2 Col3       New
1 ten: end    5   10 ten - end
2 five: nb    7   11 five - nb
3     12:4   12   10      12-4
4    13:56   15   16     13-56

4) This one extracts the x:y and checks whether x and y are number and if so does the appropriate replacement and if no match it extracts x:yz where y is a space and if x and z are digits or lower case then it performs the appropriate replacement and otherwise returns Col1. dig and diglet are from above.

fn$sqldf("select *, 
  case when trim(substr(Col1, instr(Col1, ':')-1, 3), '$dig') = ':'
         then replace(Col1, ':', '-')
       when trim(substr(Col1, instr(Col1, ':') -1, 4), '$diglet') = ': '
         then replace(Col1, ': ', ' - ')
       else Col1 end as New
  from df")

Note

The input in reproducible form is:

Lines <- "Col1,Col2,Col3
ten: end,5,10
five: nb,7,11
12:4,12,10
13:56,15,16"
df <- read.csv(text = Lines, as.is = TRUE, strip.white = TRUE)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341