I have a large dataframe with column names like this. I am not trying to work with any data yet, just the column names.
strainA_1_batch1 | strainA_2_bacth2 | strainB_1_bacth1 | strainC_1_bacth2 | strainC_2_bacth2 | strainD_a_1_bacth1 | strainD_b_1_bacth1 |
---|
I am trying to make a few stats tables like these:
number of strains | number of batches |
---|---|
5 | 2 |
Batch | number of strains |
---|---|
batch1 | 4 |
batch2 | 2 |
strain | number of samples |
---|---|
StrainA | 2 |
StrainB | 1 |
StrainC | 2 |
StrainD_a | 1 |
StrainD_b | 1 |
My first problem is how to handle things like strainD_a
and strainD_b
since if I split on "_" I will be breaking up part of the strain name and the different number of breaks makes accessing information more difficult. I have handled something like this in python by specifying the number of splits and starting the split from the right side, but I am not sure the R equivalent.
Secondly, maybe the search terms I am using are wrong but I have only found information on how to break a column into multiple columns. I do not need to split columns, I just want to grab information from the column names. Then use unique occurrences of each part of the name to create new column or row names with a count of total occurrences for each one. I am not picky about how the stats tables are organized, as long as the information is accurate