count number of times string appears in a column

Question

Can you think about an intuitive way of calculating the number of times the word space appears in a certain column? Or any other solution that is viable. I basically want to know how many times the space key was pressed, however some participants made the mistake and pressed other keys which would also be considered a mistake. So I was wondering if I should go with the "key_resp.rt" column instead and count the number of response times instead. If you had any idea of how to do both it would be great as I may need to use both.

I used the following code but the results do not conform to the data.

 Data %>% group_by(Participant, Session) %>% summarise(false_start = sum(str_count(key_resp.keys, "space")))

Here is a snippet of my data:

    Participant    RT     Session   key_resp.keys           key_resp.rt
       X        0.431265    1       ["space"]            [2.3173399999941466]
       X        0.217685    1           
       X        0.317435    2       ["space","space"] [0.6671900000001187,2.032510000000002]    2020.1.3    4
       Y        0.252515    1       
       Y        0.05127     2   ["space","space","space","space","space","space","space","space","space"]   [4.917419999999765,6.151149999999689,6.333714999999771,6.638249999999971,6.833514999999338,7.0362499999992,7.217724999999504,7.38576999999988,7.66913999999997]

dput(droplevels(head(Data_PVT)))
structure(list(Interval_stimulus = c(4.157783411, 4.876139922, 
5.67011868, 9.338167417, 9.196342656, 7.62448411), Participant = structure(c(1L, 
1L, 1L, 1L, 1L, 1L), .Label = "ADH80254", class = "factor"), 
    RT = c(431.265, 277.99, 253.515, 310.53, 299.165, 539.46), 
    Session = c(1L, 1L, 1L, 1L, 1L, 1L), date = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = "2020-06-12_11h11.47.141", class = "factor"), 
    key_resp.keys = structure(c(2L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
    "[\"space\"]"), class = "factor"), key_resp.rt = structure(c(2L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("", "[2.3173399999941466]"
    ), class = "factor"), psychopyVersion = structure(c(1L, 1L, 
    1L, 1L, 1L, 1L), .Label = "2020.1.3", class = "factor"), 
    Trials = 0:5, Reciprocal = c(2.31875992719094, 3.59725169970143, 
    3.94453977082224, 3.22030077609249, 3.3426370063343, 1.85370555740926
    )), row.names = c(NA, 6L), class = "data.frame")

Expected output:

Participant  Session  false_start
   x             1       0
   x             2       1
   y             1       2
   y             2       1
   z             1       10
   z             2       3

well, I am not sure how to reproduce this. The key_resp.keys column either has nothing in it (if no key is pressed, ["space"] if the key is pressed once, or ["space", "space",...] for how many times it was pressed. Not sure how to reproduce this. Key_resp.rt follows the same logic and presents the response times in the same format. Sorry I cannot be more helpful. — CatM, Aug 20 '20 at 20:12
Sorry the formatting is terrible, not sure how to make it better. — CatM, Aug 20 '20 at 20:34
I just put a bit of it, there will be entries for session 1 and those for session 2. — CatM, Aug 20 '20 at 20:42
In the first 6 rows in the dput, there is only the first row havng `["space"]`, others are blank. what is the expected output — akrun, Aug 20 '20 at 21:08
the expected output for those would be 1 for the one with space and 0 for the blank ones. One point per word "space" — CatM, Aug 20 '20 at 21:17
In that case, I get the expected with `Data_PVT %>% mutate(false_start = str_count(key_resp.keys, "\\bspace\\b"))` — akrun, Aug 20 '20 at 21:18
Based on your code sample, here's a solution that groups your variables and counts: `df %>% group_by(Participant, Session) %>% summarise(false_start = grep("\\", key_resp.keys))`. — hmhensen, Aug 20 '20 at 21:24
@akrun I checked and it seems like it is giving the double of the amount of words, is that possible? I can divide it by 2, but I am unable to check if this is the case for all of them. — CatM, Aug 20 '20 at 21:51
@hmhensen It gave me the following error: Error: Column `false_start` must be length 1 (a summary value), not 10 — CatM, Aug 20 '20 at 21:51
@CatM Weird, it's working just fine when I run it on the `dput` you provided. — hmhensen, Aug 20 '20 at 21:57
It works with akrun's code, but gives the double amount, do you have any idea why that would be? — CatM, Aug 20 '20 at 22:00
I get 1, 0, 0, 0, . for the example. Not clear how you are getting `double amount` — akrun, Aug 20 '20 at 22:36
@akrun I used the following: PVT_desc <- Data_PVT %>% group_by(Participant, Session) %>% mutate(false_start = str_count(key_resp.keys, "\\bspace\\b")) %>% summarise(lapses = sum(RT >500), mean_PVT = mean(RT), mean_recip = mean(Reciprocal), False = sum(false_start)) — CatM, Aug 20 '20 at 22:42
That's because there are more than 40 rows for participant, so the dput you asked for was just for one participant and hence only one row emerged when you used my code. — CatM, Aug 20 '20 at 22:52
Anything other than "reproducible question please" is encouraging another bad question. — geotheory, Aug 22 '20 at 22:07
@akrun It was giving double the amount as the dataframe contained the same data twice. — CatM, Aug 23 '20 at 12:51

Ronak Shah · Accepted Answer · 2020-08-22T21:42:29.623

1

We can use str_count to count "space" values for each Participant and Session and sum them to get total. For all_false_start we count number of words in it.

library(dplyr)
library(stringr)

df %>%
  group_by(Participant, Session) %>%
  summarise(false_start = sum(str_count(key_resp.keys, '\\bspace\\b')), 
            all_false_start = sum(str_count(key_resp.keys, '\\b\\w+\\b')))

edited Aug 22 '20 at 21:42

answered Aug 21 '20 at 03:10

Ronak Shah

377,200
20
156
213

But I want to know how many times they pressed space. – CatM Aug 21 '20 at 11:05
How do you define "pressing space" ? – Ronak Shah Aug 21 '20 at 11:07
Well, if we are talking about pressing only the key "space" then it would be the number of times the word space appears in each row. If we are talking about all key presses, then it would be the number of response times in the variable key_resp.rt – CatM Aug 21 '20 at 11:10
No, we are talking about what you want to calculate. I am confused about it. The example which you have shared does not have enough observation to match your expected output. You can remove other columns which are not necessary for the question and share `dput` of more rows with expected output so that the question becomes clear. – Ronak Shah Aug 21 '20 at 11:12
That's why I tried to produce some sample of the necessary columns in the "snippet of the data". That's what it looks like but with more rows per participant. – CatM Aug 21 '20 at 11:15
PVT_desc <- Data_PVT %>% mutate(false_start = str_count(key_resp.keys, "\\bspace\\b")) %>% group_by(Participant, Session) %>% summarise(lapses = sum(RT >500), mean_PVT = mean(RT), mean_recip = mean(Reciprocal), False = sum(false_start)/2) – CatM Aug 21 '20 at 11:16
This code works but gives twice the amount of false_starts as it should hence why I divide the sum by two, not sure why that is. – CatM Aug 21 '20 at 11:17
Can you explain what are you trying to calculate and what is the logic to calculate it? We can write the code for you but what you are trying to calculate should be defined by you. Also the columns `Interval_stimulus`, `date`, `psychopyVersion`, `Trials` and `Reciprocal` don;t seem relevant to question. – Ronak Shah Aug 21 '20 at 11:21
Yes, I only care about Participant, Session, key_resp.keys and key_resp.rt for this. I want to get the sum value of the number of times a participant pressed a key when he shouldn't, i.e. the number of times the word "space" appears in the key_resp.keys (if we only care about the word space) or any word (if we care that they might have pressed other keys). So I thought I would get the number of times a key is pressed per row and then sum that for all rows for each participant. – CatM Aug 21 '20 at 11:35
@CatM Let's leave `Session` for now. What would be your expected output for this dataframe? `df <- data.frame(Participant = c(1, 1, 1, 1, 1, 2, 2, 2), key_resp.keys = c('[space]', 'a', '[space][space][space]', 'a', '', 'b', '[space][space]', ''))` ? – Ronak Shah Aug 22 '20 at 00:22
It would be the following, where false_start only includes space and all_false_start includes all. df <- data.frame(Participant = c(1, 2), false_start = c(4,2), all_false_start = c(6,3)) – CatM Aug 22 '20 at 13:02
@CatM should `all_false_start` for 1st `ID` be 5? See updated answer. – Ronak Shah Aug 22 '20 at 13:53
all_false_start for the ID 1 would be 6, i.e. pressing the key "space" (4x) and "a" (2x) – CatM Aug 22 '20 at 18:51
@CatM Okay...see updated answer. I hope I get it this time. – Ronak Shah Aug 22 '20 at 21:42

count number of times string appears in a column

1 Answers1