1

1. What is the most robust way to match a blank line in a file?

2. What is the most efficient way to match a blank line in a file?

3. What are the differences between the following different ways to match a blank line?

Context:

I'm trying to understand some possible gotchas in the process of detecting blank lines in a file in Python.

I can think of five ways I'll define:

  1. BLANKS: use regex to match the blank line ^$
  2. NEWLINES: use regex to match the newline characters ^\r?\n
  3. EMPTIES: match the empty string, ''
  4. PNEWLINES: match the new line character, '\n'
  5. STRIPS: strip the line with strip() and then match an empty string.
[nav] In [3]: i = 0
         ...: fin = open('warandpeace.txt', 'rt')
         ...: blanks = []
         ...: empties = []
         ...: newlines = []
         ...: pnewlines = []
         ...: strips = []
         ...: NEWLINE = re.compile(r'^\r?\n')
         ...: BLANK = re.compile(r'^$')
         ...: for line in fin:
         ...:     if re.search(BLANK, line):
         ...:         blanks.append((i, line))
         ...:     if re.search(NEWLINE, line):
         ...:         newlines.append((i, line))
         ...:     if line == '':
         ...:         empties.append((i, line))
         ...:     if line == '\n':
         ...:         pnewlines.append((i, line))
         ...:     if line.strip == '':
         ...:         strips.append((i, line))
         ...:     i += 1

[nav] In [4]: print((len(blanks), len(empties), len(newlines), len(pnewlines), len(strips)))
(13892, 0, 13892, 13892, 0)

They do not seem to be equivalent, though I've gotten uniform results. My input is a utf-8 text copy of War and Peace by Leo Tolstoy from Project Gutenberg with, I believe, Windows line endings. I am unsure what else to test it on.

I observe the following:

  • Bboth BLANKS and NEWLINES will be cross-platform portable (i.e., be able to handle Windows line endings (I don't know about Mac, actually).
  • The EMPTIES method fails, clearly, because reading in the lines from the file retains the \n, and it is therefore not an empty string.
  • The PNEWLINES method will fail in the case of Windows line endings.
  • I have no idea why the STRIPS method fails. I thought it stripped leading and trailing white space, so it should work.
  • All of them will fail if there are spaces in the blank line, which is an easy fix in the case of the regex methods (not concerned with that case).

What I am concerned with is the most robust method for matching blank lines. I always used '^$' with sed, but in Python, I'm honestly not even understanding how the lines are split in the first place! It seems strange, and counter-intuitive to me, that the new line is retained despite Python splitting on the newline.

What are the real differences between all of these, besides the superficial ones. For instance, why are '^$' and '^\r?\n' equivalent? Which is the best way? What other ways are there?

Community
  • 1
  • 1
mas
  • 1,155
  • 1
  • 11
  • 28

1 Answers1

0

It's like theory question but simple way is following this logic.

with open(file,'r') as out:
    lines = out.readlines()

for line in lines:
    if len(line.strip()) == 0:
        print('Empty')
Chandu
  • 2,053
  • 3
  • 25
  • 39