0

How do I find the position of the first occurrence of a character string in a dataframe column?

This is to convert the following SAS code to R:

AVAL = INPUT(SUBSTR(AVALC,INDEX(AVALC,"- ")+2,1),8.);

This is a sample of the data:

AVALC

7

4

EXTREME DIFFICULTY - 4

STOPPED DOING THIS FOR OTHER REASONS OR NOT INTERESTED IN DOING THIS - 6

A LITTLE DIFFICULTY - 2

MODERATE DIFFICULTY - 3

NO DIFFICULTY AT ALL - 1

MODERATE DIFFICULTY - 3

NO DIFFICULTY AT ALL - 1

SOME OF THE TIME - 3

SOME OF THE TIME - 3

MOSTLY FALSE - 4

DEFINITELY FALSE - 5

GOOD - 3

FAIR - 3

SOME OF THE TIME - 3

NONE - 1

Some values can be converted directly to numeric values (done), extracting the numbers to a numeric field from the "-" responses is what I am trying to do.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • 3
    Probably a duplicate of https://stackoverflow.com/questions/14249562/find-the-location-of-a-character-in-string - in base R `regexpr` or `grepexpr` for multiple character matches; or the *stringr* functions `str_locate` and `str_locate_all` – thelatemail Aug 23 '22 at 22:06
  • 3
    Can you provide sample data? While I agree it's likely a dupe, there's nothing else we can do without knowing what you are starting with and what you are expecting as change/output. – r2evans Aug 23 '22 at 22:37
  • Though you wouldn't need to find the index of the preceding character then extract a substring to do this. You could probably accomplish this through regular expression replacement or something like *stringr*'s `str_extract`. – thelatemail Aug 23 '22 at 22:44
  • So you want to attempt to convert the one character (actually one byte) that follows the first occurrence of a space following a hyphen into a number? So you expect to get an integer in the range 0 to 9 ? – Tom Aug 24 '22 at 02:38
  • The variable structure is as follows "XXXXX - 3" as an example. The responses are not all the same length. Therefore I need to know at what position the "- " occurs so that I can extract the numeric value. Note that not all answers contain a number so that stripping off the last character of the column will not work. Then the stripped off value is converted to a number using as.numeric. –  Aug 24 '22 at 13:00
  • str_extract probably won't work as the text I need comes after the detected string not the string itself. Some other functionality is required for that. –  Aug 24 '22 at 14:58
  • This is a sample of the data: AVALC 7 4 EXTREME DIFFICULTY - 4 STOPPED DOING THIS FOR OTHER REASONS OR NOT INTERESTED IN DOING THIS - 6 A LITTLE DIFFICULTY - 2 MODERATE DIFFICULTY - 3 NO DIFFICULTY AT ALL - 1 MODERATE DIFFICULTY - 3 NO DIFFICULTY AT ALL - 1 SOME OF THE TIME - 3 SOME OF THE TIME - 3 MOSTLY FALSE - 4 DEFINITELY FALSE - 5 GOOD - 3 FAIR - 3 SOME OF THE TIME - 3 NONE - 1 Some values can be converted directly to numeric values (done), extracting the numbers to a numeric field from the "-" responses is what I am trying to do. –  Aug 24 '22 at 18:47
  • Please post the sample data in valid R syntax. I can't tell if this is a data frame, a vector, a list or something else. Is it one big multiline string, or multiple short strings, or a file that hasn't been read in to R yet? If you have the data in R, then you can create a copy/pasteable version of it with `dput(your_data)`, or `dput(your_data[1:5, ])` for the first 5 rows. – Gregor Thomas Aug 24 '22 at 21:24

1 Answers1

1

As a little bit of a frame challenge, you could delete all characters up to and including the - (and following spaces) and convert the rest to numeric with

as.numeric(gsub(".*- +", "", input))

If you wanted to do str_extract you could use it with a "lookbehind expression", but these are a little complicated to explain:

stringr::str_extract(input, "(?<=(- )?)[0-9]+")

(This says "extract a string of numbers ([0-9+]) following but not including ((?<= ... )) 0 or more instances of a dash followed by a space ((- )?)

input <- "AVALC
7
4
EXTREME DIFFICULTY - 4
STOPPED DOING THIS FOR OTHER REASONS OR NOT INTERESTED IN DOING THIS - 6
A LITTLE DIFFICULTY - 2
MODERATE DIFFICULTY - 3
NO DIFFICULTY AT ALL - 1
MODERATE DIFFICULTY - 3
NO DIFFICULTY AT ALL - 1
SOME OF THE TIME - 3
SOME OF THE TIME - 3
MOSTLY FALSE - 4
DEFINITELY FALSE - 5
GOOD - 3
FAIR - 3
SOME OF THE TIME - 3
NONE - 1"
## convert from one long string with embedded newlines to a vector of strings
input <- unlist(strsplit(input, "\n"))
as.numeric(gsub(".*- +", "", input))
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453