R split column names with different occurrences of delimiter into strings and assign unique strings/string counts to a new dataframe

Question

I have a large dataframe with column names like this. I am not trying to work with any data yet, just the column names.

strainA_1_batch1	strainA_2_bacth2	strainB_1_bacth1	strainC_1_bacth2	strainC_2_bacth2	strainD_a_1_bacth1	strainD_b_1_bacth1

I am trying to make a few stats tables like these:

number of strains	number of batches
5	2

Batch	number of strains
batch1	4
batch2	2

strain	number of samples
StrainA	2
StrainB	1
StrainC	2
StrainD_a	1
StrainD_b	1

My first problem is how to handle things like strainD_a and strainD_b since if I split on "_" I will be breaking up part of the strain name and the different number of breaks makes accessing information more difficult. I have handled something like this in python by specifying the number of splits and starting the split from the right side, but I am not sure the R equivalent.

Secondly, maybe the search terms I am using are wrong but I have only found information on how to break a column into multiple columns. I do not need to split columns, I just want to grab information from the column names. Then use unique occurrences of each part of the name to create new column or row names with a count of total occurrences for each one. I am not picky about how the stats tables are organized, as long as the information is accurate

Dave2e · Accepted Answer · 2021-07-09T21:59:09.600

I think if you split at the "underscore, digit, underscore" it provides a solution to your statement above. This does eliminate the digit and the associated information. Does this matter?

names <- c("strainA_1_batch1", "strainA_2_batch2", "strainB_1_batch1", "strainC_1_batch2", "strainC_2_batch2", 
           "strainD_a_1_batch1", "strainD_b_1_batch1")

#split at the underscore, digit and underscore 
splitList <- strsplit(names, "_\\d_")

#convert to dataframe
df <-data.frame(t(as.data.frame.list(splitList)))

#clean up data.frame
rownames(df)<-NULL
names(df)<-c("Strain", "Batch")
df

#report
table(df$Strain)
table(df$Batch)

Another option is to replace the underscore on either side of the digit with a " " (or other character) and then split on the space.

names<-gsub("_(\\d)_", " \\1 ", names)

R split column names with different occurrences of delimiter into strings and assign unique strings/string counts to a new dataframe

1 Answers1