0

I have a column with numerical comma-separated strings, e.g., '0,1,17,200,6,0,1'. I want to create new columns for the sums of those numbers (or substrings) in the strings that are not equal to 0.

I can use something like this to count the sum of non-zero numbers for the whole string:

df$F1 <- sapply(strsplit(df1$a, ","), function(x) length(which(x>0)))
[1] 5

This outputs '5' as the number of substrings in for the example string above, which is correct as the number of substrings in '0,1,17,200,6,0,1' is indeed 5.

The challenge, however, is to be able to restrict the number of substrings. For example, how can I get the the count for only the first 3 or 6 substrings in the string?

Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
Vjain
  • 13
  • 2
  • Did you try the `strsplit` function? It will give you the split results in list containing a vector for each row. Then you have to use the suitable `apply` family function to calculate all the statistics for each row. – tushaR Mar 04 '20 at 08:53
  • I am using that only, as mentioned in my question, if you read it. My concern is with the length of the string I want to pick while using the strsplit function. – Vjain Mar 04 '20 at 08:56
  • Can you create a small  [reproducible example](http://stackoverflow.com/questions/5963269) and show expected output based on that? – Ronak Shah Mar 04 '20 at 09:05
  • Edited my question for the better understanding of the problem. – Vjain Mar 04 '20 at 09:28

1 Answers1

0

You can use gsub and backreference to cut the string to the desired length before you count how many substrings are > 0:

DATA:

df1 <- data.frame(a = "0,1,17,200,6,0,1")
df1$a <- as.character(df1$a)

SOLUTION:

First cut the string to whatever number of substrings you want--here, I'm cutting it to three numeric characters (the first two of which are followed by a comma)--and store the result in a new vector:

df1$a_3 <- gsub("^(\\d+,\\d+,\\d+)(.*)", "\\1", df1$a)
df1$a_3
[1] "0,1,17"

Now insert the new vector into your sapply statement to count how many substrings are greater than 0:

sapply(strsplit(df1$a_3, ","), function(x) length(which(x>0)))
[1] 2

To vary the number of substrings, vary the number of repetitions of \\d+ in the pattern accordingly. For example, this works for 6 substrings:

df1$a_6 <- gsub("^(\\d+,\\d+,\\d+,\\d+,\\d+,\\d+)(.*)", "\\1", df1$a)
sapply(strsplit(df1$a_6, ","), function(x) length(which(x>0)))
[1] 4

EDIT TO ACCOUNT FOR NEW SET OF QUESTIONS:

To compute the maximum value of substrings > 0, exemplified here for df1$a, the string as a whole (for the restricted strings, just use the relevant vector accordingly, e.g., df1$a_3, df1$a_6 etc.): First split the string using strsplit, then unlist the resulting list using unlist, and finally convert the resulting vector from character to numeric, storing the result in a vector, e.g., string_a:

string_a <- as.numeric(unlist(strsplit(df1$a, ",")))
string_a
[1]   0   1  17 200   6   0   1

On that vector you can perform all sorts of functions, including max for the maximum value, and sum for the sum of the values:

max(string_a)
[1] 200

sum(string_a)
[1] 225

Re the number of values that are equal to 0, adjust your sapply statement by setting x == 0:

sapply(strsplit(df1$a, ","), function(x) length(which(x == 0)))
[1] 2

Hope this helps!

Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34