Let's say I have the two dataframes below.
In reality, both dataframes will be around a million rows each, so I would like to find the most efficient way to compare:
- each df2["BaseCall"] with each df1["seq"]
- return a dataframe that contains a list of positions on each df1["gene"] where any df2["BaseCall"] was found
The overall goal is to count the number of times each feature_id is found in a gene, and capture the position information for use downstream.
# break fasta_df sequences and mutation seqs up into kmers
data = [{"gene":"pik3ca", "start":"179148724", "stop":"179148949","seq":"TTTGCTTTATCTTTTGTTTTTGCTTTAGCTGAAGTATTTTAAAGTCAGTTACAG"},
{"gene":"brca1", "start":"179148724", "stop":"179148949","seq":"CAATATCTACCATTTGTTAACTTTGTTCTATTATCATAACTACCAAAATTAACAGA"},
{"gene":"kras1", "start":"179148724", "stop":"179148949","seq":"AAAACCCAGTAGATTTTCAAATTTTCCCAACTCTTCCACCAATGTCTTTTTACATCT"}]
# test dataframe with input seq
df1 = pd.DataFrame(data)
data2 = [{"FeatureID":"1_1_15", "BaseCall":"TTTGTT"},
{"FeatureID":"1_1_15", "BaseCall":"AATATC"},
{"FeatureID":"1_1_16", "BaseCall":"GTTTTT"},
{"FeatureID":"1_1_16", "BaseCall":"GTTCTA"},
]
df2= pd.DataFrame(data2)
The output should look something like:
| gene | feature_id | BaseCall | Position
| pik3ca | 1_1_15 | TTTGTT | 12
| pik3ca | 1_1_16 | GTTTTT | 15
| brca1 | 1_1_16 | GTTCTA | 24
| brca1 | 1_1_15 | AATATC | 1
| brca1 | 1_1_15 | TTTGTT | 12
| brca1 | 1_1_15 | TTTGTT | 21
This ngram function seems to work great when I use just one test basecall on one seq, but I'm having trouble figuring out the most efficient way to use the apply method with one argument coming from two different dataframes. Or perhaps there is an even better way to find matching strings/positions between two dataframes?
def ngrams(string, target):
ngrams = zip(*[string[i:] for i in range(6)])
output = [''.join(ngram)for ngram in ngrams]
indices = [(i,x) for i, x in enumerate(output) if x == target]
return indices