I'm hoping to get some advice from the community about functions that require a selection of rows and columns. I have a very messy database (real-world data from a central database) and I need to sum subscales for a total score. To make matters more complicated, I have some rows where the total has been provided but no raw data (so no individual data points for each question) and other rows where I have the individual data points and no total. For example:
Q1 Q2 Q3 Q4 Q5 TOTAL
2 3 0 1 NA 3 (Where individual data points and totals are provided (sum of Q2,Q3,Q5)
NA NA NA NA NA 9 (No raw data points, only total scores provided)
1 2 4 2 1 NA (Raw data points provided, but no total score`
If I tell r to ignore the NAs then it recognises the NA as 0 and provides a total score. However, that means it replaces the total of the 2nd row above to 0 as all the individual data points are NA. I've tried various codes such as apply, rowSum, cbind but I can't seem to find a solution. I basically want to run the following code, or equivalent, but tell r to ignore certain rows. I've been using the following:
rowSums(dat[, c(7, 10, 13)], na.rm=TRUE)
(where 7,10, 13 are the column numbers) but if I try and add row numbers (rowSums(dat[1:30, c(7, 10, 13)], na.rm=TRUE))
it tells me 'the replacement has 30 rows, data has 1651.' I've also tried rowSums(dat[c(1:30,7, 10, 13)], na.rm=TRUE
but I get an error 'undefined columns selected.'
Is there a way of telling r what rows to include and ignore when you have column conditions? I want a database that sums the individual sub-scores and ignores the rows where they are not provided. I’m very new to r, so a response along the lines of ‘r for dummies’ would be appreciated. Thank you