0

I have a list of Genes and I need to identify if the Gene from the list is present in the 'Article Title', if present find the start and the end position of the gene in the sentence.

The code developed does identify the gene and detects the position of the gene in the sentence. However, I need help with finding the start position and end position of the gene

doc = tree.getroot()
 for ArticleTitle in doc.iter('ArticleTitle'):
    file1 = (ET.tostring(ArticleTitle, encoding='utf8').decode('utf8'))
    filename = file1[52:(len(file1))]
    Article= filename.split("<")[0]
    # print(Article)
    # print(type(Article))
    title= Article.split()
    gene_list = ["ABCD1","ADA","ALDOB","APC","ARSB","ATAD3B","AXIN2","BLM","BMPR1A","BRAF","BRCA1"] 
    for item in title:
        for item1 in gene_list:
            if item == item1:
                str_title= ' '.join(title)
                print(str_title)
                print("Gene Found: " + item)
                index= title.index(item)
                print("Index of the Gene :" +str(index))

                result = 0
                for char in str_title:
                    result +=1
                print(result)

Current output is:

Healthy people 2000: a call to action for ADA members.
Gene Found: ADA
Index of the Gene :8
54

Expected output is:

Healthy people 2000: a call to action for ADA members.
Gene Found: ADA
Index of the Gene :8
Gene start position: 42
Gene End postion:  45

The start and end position should count the spaces between the words too.

glibdud
  • 7,550
  • 4
  • 27
  • 37
RRg
  • 123
  • 1
  • 12
  • you have to parse the document and the make a list of the each word start point , it's index value . then you can do this – sahasrara62 Jan 04 '19 at 15:45
  • You could use the [index](https://docs.python.org/3/library/stdtypes.html#str.index) method, but if you must mach the word completely I suggest you take a look regex – Dani Mesejo Jan 04 '19 at 15:46
  • 1
    Related: https://stackoverflow.com/questions/21842885/python-find-a-substring-in-a-string-and-returning-the-index-of-the-substring – Dani Mesejo Jan 04 '19 at 15:47
  • 1
    @DanielMesejo This helped me! I could get the end and start position! Thanks – RRg Jan 04 '19 at 16:06

2 Answers2

1

Could use regex

l=["ABCD1","ADA","ALDOB","APC","ARSB"]
l='|'.join(l)
test_string='Healthy people 2000: a call to action for ADA members.'
pos=0
for i in test_string.split():
    m=re.search(l,i)
    if m:
        gene=m.group(0)
        start=test_string.find(gene)
        end=start+len(gene)
        print(start,end,gene,pos)
    pos+=1

Output

(42, 45, 'ADA', 8)

The shorter solution without the actual position in the string could be

l=["ABCD1","ADA","ALDOB","APC","ARSB"]
l='|'.join(l)
test_string='Healthy people 2000: a call to action for ADA members.'

[(m.start(),m.group(0),m.end()) for m in re.finditer(l,test_string)]
mad_
  • 8,121
  • 2
  • 25
  • 40
  • the match object (`m`) has a start(), end() methods – Dani Mesejo Jan 04 '19 at 16:07
  • @DanielMesejo Yup I know but i am iterating the string as list so matching one word at a time. `m.start()` will always give me 0. My other suggestion is to use `re.finditer` but again I dont think that will give the exact output which is needed here – mad_ Jan 04 '19 at 16:11
  • I see, you could use word boundaries in the regex to avoid splitting on space – Dani Mesejo Jan 04 '19 at 16:12
  • 1
    Hmm I cannot see how to map key and value with original string. the dictionary might explode for long strings and that too if the gene could not be found. Would you mind posting as an answer? – mad_ Jan 04 '19 at 16:15
  • @mad_ The code works fine for the above test string. However, for the following test string: PAH-alpha-KG countertransport stimulates PAH uptake and net secretion in isolated snake renal tubules. The gene is 'PAH' which is in index 3, however, the code detects the position of 'PAH' as 0. – RRg Jan 04 '19 at 20:44
  • @RRg how PAH is at index 3? I see at index 0 – mad_ Jan 04 '19 at 20:48
  • @mad_ So PAH-alpha-KG is a different gene as compared to PAH. – RRg Jan 04 '19 at 20:53
  • @RRg Can you post the entire thing which you have tried here – mad_ Jan 04 '19 at 21:03
1

We can use Flashtext as well

from flashtext import KeywordProcessor

kpo = KeywordProcessor(case_sensitive=True)

gene_list = ["ABCD1","ADA","ALDOB","APC","ARSB","ATAD3B","AXIN2","BLM","BMPR1A","BRAF","BRCA1"] 

for word in gene_list:
    kpo.add_keyword(word)

kpo.extract_keywords("Healthy people 2000: a call to action for ADA members.",span_info=True)
#o/p --> [('ADA', 42, 45)]
qaiser
  • 2,770
  • 2
  • 17
  • 29