filter specific values in dataframe with unique prefix in column name (e.g. 'UniqueID_commonsuffix')

Question

I have a dataframe with > 300 unique samples, there are 2 columns of similar information per sample, and I'd like to filter for 34 specific values in one of those columns per sample. I've included a screenshot of the data to help visualize this problem. I basically want to generate a new dataframe with only the information from the 34 values that I specify. My apologies if this question is difficult to understand, I hope the screenshot helps to define the problem better.

In this screenshot, each column with "sampleID_r.variant" needs to be filtered for specific values I have in a separate dataframe. There are only 34 I'm interested in. With that, I'd like to store the corresponding value to the left in the column "sampleID_reads" along with it, like a dictionary. If anyone can help with this, I'd greatly appreciate it. Thank you so much.

EDIT: the original dataframe is in the following format:

sampleID_reads	sampleID_r.variant
1	r.79_80ins79+1_79+76
64	r.79_80ins79+10857_79+10938
53	r.79_80ins80-13725_80-13587
72	r.79_80ins80-5488_80-5435
16	r.79_80ins79+2861_79+2900

the 34 samples are in the following format:

r_dot
r.646_729del
r.-19_-18ins-19+428_-19+535
r.-25_-20del
r.4186_4188del
r.5333_5406del
...so on and so forth

Please don't post images. Instead paste everything as text. Also, please provide sample input data with expected output. — Mayank Porwal, Jan 05 '21 at 17:09
How are the "34 specific values" stored and what do they look like? — It_is_Chris, Jan 05 '21 at 17:18
@It_is_Chris I am attaching that as well, my apologies for the confusion. Editing my question to better frame the problem. — srajpara, Jan 05 '21 at 17:19
@srajpara are the 34 samples equal to the full string in `sampleID_r.variant` or just a portion of the the sting. None of the values in `r_dot` match a value in `sampleID_r.variant` Also, are the the "reads" and "variants" for each ID always next to each other or can your columns be in a random order? — It_is_Chris, Jan 05 '21 at 17:41
@It_is_Chris ah so I happened to copy the top of the list of r.variants that don't match but in the entire column, there are definitely matches to the exact string from the 34 samples. The order of each column is the exact same for each sample (e.g. sample1_reads (col1) sample1_r.variant (col2), sample2_reads (col3), sample2_r.variant (col4)). I hope that clarifies it! — srajpara, Jan 05 '21 at 17:57

score 1 · Accepted Answer · answered Jan 05 '21 at 19:25

Here is some sample data

d = {'sample1_reads': [1, 64, 53, 72, 16],
    'sample1_r.variant': ['r.79_80ins79+1_79+76', 'r.79_80ins79+10857_79+10938', 
                         'r.79_80ins80-13725_80-13587', 'r.79_80ins80-5488_80-5435', 'r.79_80ins79+2861_79+2900'], 
    'sample2_reads': [0, 3, 6, 9, 11], 
    'sample2_r.variant': ['r.5333_5406del', 'r.4186_4188del', 'r.5333_54106del', 'r.2345_2345fad', 'r.65456_w56sjfy']}
df = pd.DataFrame(d)
rdot = pd.DataFrame(['r.79_80ins79+1_79+76', 'r.646_729del', 'r.5333_5406del', 'r.79_80ins80-5488_80-5435', 'r.79_80ins79+2861_79+2900'], columns=['r_dot'])

If you just want to filter for first frame based on the second frame then you can do the following

# reshape your current data frame 
new_df = pd.DataFrame(df.values.reshape((-1,2)), columns=['reads', 'variant'])
# use boolean indexing to filter your new data frame
df_f = new_df[new_df['variant'].isin(rdot['r_dot'])]

  reads                    variant
0     1       r.79_80ins79+1_79+76
1     0             r.5333_5406del
6    72  r.79_80ins80-5488_80-5435
8    16  r.79_80ins79+2861_79+2900

This is excellent!! Thank you so much! I am going to use this and manipulate it slightly to iterate through each file that stores individual sample info. I initially generated a large file with each sampleID's info (reads, variants) in columns like you saw in the screenshot. It seems to make more sense to do it by each individual file and then store the filtered values in a new dataframe as you posted above. THANK YOU so much for your help and patience. I really appreciate it! — srajpara, Jan 05 '21 at 19:39

filter specific values in dataframe with unique prefix in column name (e.g. 'UniqueID_commonsuffix')

1 Answers1