How to extract lines numbers that match a regular expression in a text file

Question

I'm doing a project on statistical machine translation in which I need to extract line numbers from a POS-tagged text file that match a regular expression (any non-separated phrasal verb with the particle 'out'), and write the line numbers to a file (in python).

I have this regular expression: '\w*_VB.?\sout_RP' and my POS-tagged text file: 'Corpus.txt'. I would like to get an output file with the line numbers that match the above-mentioned regular expression, and the output file should just have one line number per line (no empty lines), e.g.:

2

5

44

So far all I have in my script is the following:

OutputLineNumbers = open('OutputLineNumbers', 'w')
with open('Corpus.txt', 'r') as textfile:
    phrase='\w*_VB.?\sout_RP'
    for phrase in textfile: 

OutputLineNumbers.close()

Any idea how to solve this problem?

In advance, thanks for your help!

score 6 · Accepted Answer · edited Jun 09 '16 at 14:31

6

This should solve your problem, presuming you have correct regex in variable 'phrase'

import re

# compile regex
regex = re.compile('[0-9]+')

# open the files
with open('Corpus.txt','r') as inputFile:
    with open('OutputLineNumbers', 'w') as outputLineNumbers:
        # loop through each line in corpus
        for line_i, line in enumerate(inputFile, 1):
            # check if we have a regex match
            if regex.search( line ):
                # if so, write it the output file
                outputLineNumbers.write( "%d\n" % line_i )

edited Jun 09 '16 at 14:31

themadmax

2,344
1
31
36

answered Jun 12 '13 at 22:54

Kalyan02

1,416
11
16

1

Using `for line_i, line in enumerate(inputFile, 1)` would simplify this. – Janne Karila Jun 13 '13 at 06:20
Thanks a lot. Only thing I failed to clarify was that the phrase could be part of a sentence, so I will have to use re.findall instead of re.match, and that works! Thanks again :-) – user2468610 Jun 13 '13 at 10:32

score 2 · Answer 2 · answered Jun 13 '13 at 11:38

you can do it directly with bash if your regular expression is grep friendly. show the line numbers using "-n"

for example:

grep -n  "[1-9][0-9]" tags.txt

will output matching lines with the line numbers included at first

2569:vote2012
2570:30
2574:118
2576:7248
2578:2293
2580:9594
2582:577

How to extract lines numbers that match a regular expression in a text file

2 Answers2

Linked

Related