66

I have some code that reads a file of names and creates a list:

names_list = open("names", "r").read().splitlines()

Each name is separated by a newline, like so:

Allman
Atkinson

Behlendorf 

I want to ignore any lines that contain only whitespace. I know I can do this by by creating a loop and checking each line I read and then adding it to a list if it's not blank.

I was just wondering if there was a more Pythonic way of doing it?

codeforester
  • 39,467
  • 16
  • 112
  • 140
Ambrosio
  • 3,789
  • 6
  • 22
  • 13
  • 2
    There is one answer here: http://stackoverflow.com/questions/4791080/python-delete-newline-return-carriage-in-file-output – aqua Jan 30 '11 at 09:13

10 Answers10

93

I would stack generator expressions:

with open(filename) as f_in:
    lines = (line.rstrip() for line in f_in) # All lines including the blank ones
    lines = (line for line in lines if line) # Non-blank lines

Now, lines is all of the non-blank lines. This will save you from having to call strip on the line twice. If you want a list of lines, then you can just do:

with open(filename) as f_in:
    lines = (line.rstrip() for line in f_in) 
    lines = list(line for line in lines if line) # Non-blank lines in a list

You can also do it in a one-liner (exluding with statement) but it's no more efficient and harder to read:

with open(filename) as f_in:
    lines = list(line for line in (l.strip() for l in f_in) if line)

Update:

I agree that this is ugly because of the repetition of tokens. You could just write a generator if you prefer:

def nonblank_lines(f):
    for l in f:
        line = l.rstrip()
        if line:
            yield line

Then call it like:

with open(filename) as f_in:
    for line in nonblank_lines(f_in):
        # Stuff

update 2:

with open(filename) as f_in:
    lines = filter(None, (line.rstrip() for line in f_in))

and on CPython (with deterministic reference counting)

lines = filter(None, (line.rstrip() for line in open(filename)))

In Python 2 use itertools.ifilter if you want a generator and in Python 3, just pass the whole thing to list if you want a list.

aaronasterling
  • 68,820
  • 20
  • 127
  • 125
  • I think the third line in your first code should read `for line in lines`. – Felix Kling Jan 30 '11 at 09:16
  • "This will save you from having to call strip on the line twice." - yes; and it's certainly neater in that regard; but you still end up repeating yourself, and I have to wonder if it really makes up performance-wise for going through the overhead of chaining generators like that. Anyone want to do some tests? – Karl Knechtel Jan 30 '11 at 09:18
  • @Karl: He uses two generator expressions, but that's not repetition--they're entirely different expressions. – Glenn Maynard Jan 30 '11 at 09:24
  • @Glenn I mean that tokens like 'line' get repeated a bunch. – Karl Knechtel Jan 30 '11 at 09:27
  • @Karl Knechtel, Check my update. Do you think the generator approach is better? – aaronasterling Jan 30 '11 at 09:35
  • Arguably. I don't think anything is perfect here. :) – Karl Knechtel Jan 30 '11 at 09:37
  • @aaronasterling: Thanks, your answers were really interesting and helpful. – Ambrosio Jan 30 '11 at 09:44
  • @Karl, Okay, What do you think about `filter`? – aaronasterling Jan 30 '11 at 09:56
  • The same thing. I mean, you could show me 100 subtly different variations, and I don't think I could choose a clear winner. :) Any of these is fine though really. – Karl Knechtel Jan 30 '11 at 09:57
  • 3
    @aaronsterling: Your first oneliner is so hard to read that neither you nor the merry band of up-voters and commenters noticed that it needs `()` after `l.strip` :-) – John Machin Jan 30 '11 at 10:47
  • 2
    +1 for `nonblank_lines` function. This should be first. The rest is either code golf or a memory hog because it reads whole files into single lists where (sometimes) the whole file isn't needed. – S.Lott Jan 30 '11 at 13:00
  • @S.Lott What do you mean with "code golf" and "memory hog", please ? I am not anglophone and I sometimes don't understand the subtleties of english language. Besides, I don't understand also the part "it reads whole files into single lists where (sometimes) the whole file isn't needed" : yes, if the problem was more complex, the solution should be more complex...but what do we do now after this remark ? – eyquem Jan 31 '11 at 13:50
  • @eyquem: Code Golf can be found in Google. Try using a search to see what you can find. Like golf, it's a game of minimizing the amount of code to achieve a given piece of programming. Merely minimizing the code is rarely helpful. No one wins at Code Golf. – S.Lott Jan 31 '11 at 13:56
  • @eyquem: "Memory Hog" means that the first example uses more memory than necessary. It's a "pig" and eats too much memory. There's rarely a need to read whole file into memory at one time in order to apply a simple filter. The `nonblank_lines` generator function accomplishes the required filter without reading the entire file into memory. – S.Lott Jan 31 '11 at 13:57
  • @S.Lott Oh thanks. My googling reflex wasn't activated because I believed that these expressions were of your own creativity, based on the english language pliability. :( Concerning your remark, I still don't understand: as far as I understood the mechanism, a 'for line in ...' reads progressively a file, without a beforehand load of the entire file into RAM. And in the aaronasterling's three solutions, there is 'for line in f_in' instruction. So what ?! – eyquem Jan 31 '11 at 14:31
  • @eyquem: `lines= ...` is the entire file as a single list. Is that not clear? It reads the entire file into a single list. All the file. In memory. At the same time. The generator function does not read the entire file all into memory at the same time; it does not create a single list with the contents of the file. – S.Lott Jan 31 '11 at 14:33
  • @S.Lott nonblank_lines() in solution 2 is a function that reads the file, as do the other generator (line.rstrip() for line in f_in) in solutions 1 and 3 of aaronasterling; and all the three in the same progressive, on demand, manner. They are not responsible of the way in which they are used, by a list() function or an iteration or a filter() function to obtain a data storing object. So the size of the resulting list, weighing on the RAM if the file is enormous, can't be a criterion to decide that one of these three equivalent generators is better than the other. – eyquem Jan 31 '11 at 17:14
  • @eyquem: `lines= ...` reads the entire file as a single list. There's no alternative purpose behind that line of code. Agree? The function can be used to process the file one line at a time, without saving the entire file in memory. Yes, it can also be used to read the entire file, but the function can be used a variety of ways. Agree? And the function can be used to process one line at a time. Agree? The `lines =...` assignment statement cannot be used to process lines one at a time because it **must** read the entire file into a **single** list. – S.Lott Jan 31 '11 at 17:38
  • @S.Lott which lines= ... assignement please ? There are plenty of lines= something in the post of aaronasterling – eyquem Jan 31 '11 at 17:51
  • @eyquem: "which lines= ... assignement please ?" They all do the same thing different ways. They read the entire file into memory. Unlike the function, which can be used to process one line at a time without loading all lines into memory. – S.Lott Jan 31 '11 at 18:09
  • @S.Lott I posted an answer in a plain frame because it's too narrow here. But yes "They all do the same thing different ways.", nonblank_lines() comprised, they are all equivalent generators, nonblank_lines() has nothing special. By the way, do you speak about the ways of READING A FILE or of the different ALGORITHMS using different tools going from file to the resulting storing object ? – eyquem Jan 31 '11 at 18:26
  • @S.Lott "They read the entire file into memory. ". I repeat my answer: "They are not responsible of the way in which they are used" and of the size of the resulting object that stores the extracted data. They do their job - reading on demand a file - , they don't decide of the destination of read data. Hence you can't accuse them of the result's size and of the desire of the developper to record the result in memory. – eyquem Jan 31 '11 at 18:34
  • @eyquem: You point escapes me entirely. The `lines=...` statements -- all of them -- must read the entire file. No choices. No alternatives. The function definition is utterly different. It can be used in a context in which each line is processed separate. It has choices. It has alternatives. – S.Lott Jan 31 '11 at 18:57
  • @aaronasterling: Thank you! Now I see @eyquem's point. And there are only three `lines =` with no list or filter. I understand the point that `lines = (...)` will be a generator. And I finally see the subtlety that I had wrong. I still think you should change the order of your answer. – S.Lott Jan 31 '11 at 19:23
  • @S.Lott For me (line.rstrip() for line in f_in) and nonblank_lines() AS WRITTEN in the aaronsterling's post are obliged to read entirely the file. Though, you are right in the sense that an instruction 'if test: break' can be inserted in the present nonblank_lines() to make it becoming a stopable_nonblank_lines() that won't read the entire file. It also allow to put quantities of treatments between the call and the yield. These are the differences that make (line.rstrip() for line in f_in) a generator expression and nonblank_lines() a generator function. Maybe I understand words too strictly – eyquem Jan 31 '11 at 19:39
  • @aaronasterling " I do agree with you that the function is the nicest way to do this though. " And I still wonder why..... – eyquem Jan 31 '11 at 19:42
  • @eyquem: The `nonblank_lines()` function and the `(line.rstrip() for line in f_in)` are **BOTH** generator functions. I was mistaken about some of the `lines=` examples. Some of the `lines=` examples are generator functions that do not create in-memory lists. Some of the `lines=` examples do create giant lists. The `nonblank_lines()` function and the `(line.rstrip() for line in f_in)` are **BOTH** generator functions. – S.Lott Jan 31 '11 at 19:43
  • @S.Lott "I was mistaken about some of the lines= examples. Some of the lines= examples are generator functions that do not create in-memory lists. Some of the lines= examples do create giant lists." Yes, it is badly written, with the same name for different similar objects. As it is also bad to call 'lines' what is a list of names. – eyquem Jan 31 '11 at 19:58
  • @S.Lott But no, the nonblank_lines() function and the (line.rstrip() for line in f_in) are NOT both generator FUNCTIONS. The latter is a generator EXPRESSION. Compare http://docs.python.org/reference/datamodel.html#index-862 and http://docs.python.org/reference/expressions.html#index-948 . gen.func. and gen.expr. similarly produce a generator object. But a gen.func have keyword yield , can have a break instruction and is called, all things that a gen.expr hasn't. – eyquem Jan 31 '11 at 20:12
  • @eyquem: Syntax aside (one's a function, one's an expression) they are both generators. Thanks for the links, but I'm quite aware of the differences. The hair-splitting doesn't seem helpful. If it makes you happy, though, keep on posting. In spite of my confusion over the `lines=` with and without `list`, I have one and only one point. That is, I prefer the function notation over the expression. That's all. I was confused by your point. I am no longer confused. I still have nothing much to say. I have a preference of function over expression. That's all there is. – S.Lott Jan 31 '11 at 20:16
  • @S.Lott Moreover,I think the following doc's extract applies to these 2 types of objects: «The difference between a code object and a function object is that the function object contains an explicit reference to the function’s globals, while a code object contains no context; also the default argument values are stored in the function object, not in the code object (because they represent values calculated at run-time). Unlike function objects, code objects are immutable and contain no references (directly or indirectly) to mutable objects.» http://docs.python.org/reference/datamodel.html – eyquem Jan 31 '11 at 20:18
  • @eyquem: I still prefer the function notation over the expression. That's all. – S.Lott Jan 31 '11 at 20:19
  • @S.Lott It's not syntax,it's meaning. I prefer to believe in the official doc that seems to make a real difference between gen func an gen expr, even if I don't fully understand all that is behind the scene. Yes, I'm interested in behind-the-scene mechanisms, even if it's hard to understand for me. You shouldn't call that hair-splitting. It's not a matter of notation according to me, it's about the underlying implementations of objects. – eyquem Jan 31 '11 at 21:41
  • @S.Lott But it seems we have divergent interests. I'm interested by comparing options while your last assert is a preference argument in which there is no more golf nor hog to justify. It's your right to do so, but not a reason to mock of me about being happy to study Python instead of C++ or Basic. – eyquem Jan 31 '11 at 21:43
  • @S.Lott On my side it's my right to like to study the innards of Python and to continue to think there is no rational reason to qualify nonblank_lines() a better tool than the others. I regret to have entered in a debate where there was less to learn than I believed. I could say more but it's preferable for me to end off. I thank you for your answers. – eyquem Jan 31 '11 at 21:43
  • @eyquem: The rational reason for preferring the function is because it is a function. It's compatible with functional programming. It allows trivial composition as part of building a larger function out of smaller functions. That's the rational reason why I prefer functions. – S.Lott Jan 31 '11 at 21:47
  • @S.Lott Thank you, but I wish to stop. In fact I have the impression to not having the same logic and motivations as you. For example, justifying a choice in a particular problem by a so general reason than functional programming, I don't know what to think about. Hence I have no satisfaction to argue in a way in which I don't understand the affirmations and in which I am supposed to be in the wrong. I don't think there could be an end to a debate. – eyquem Jan 31 '11 at 22:12
  • @eyquem: "so general reason than functional programming" is the only reason I have. It's rational. It's my reason. What more do you want? Magic? – S.Lott Jan 31 '11 at 22:16
25

You could use list comprehension:

with open("names", "r") as f:
    names_list = [line.strip() for line in f if line.strip()]

Updated: Removed unnecessary readlines().

To avoid calling line.strip() twice, you can use a generator:

names_list = [l for l in (line.strip() for line in f) if l]
Felix Kling
  • 795,719
  • 175
  • 1,089
  • 1,143
13

If you want you can just put what you had in a list comprehension:

names_list = [line for line in open("names.txt", "r").read().splitlines() if line]

or

all_lines = open("names.txt", "r").read().splitlines()
names_list = [name for name in all_lines if name]

splitlines() has already removed the line endings.

I don't think those are as clear as just looping explicitly though:

names_list = []
with open('names.txt', 'r') as _:
    for line in _:
        line = line.strip()
        if line:
            names_list.append(line)

Edit:

Although, filter looks quite readable and concise:

names_list = filter(None, open("names.txt", "r").read().splitlines())

Sean
  • 15,561
  • 4
  • 37
  • 37
11

I guess there is a simple solution which I recently used after going through so many answers here.

with open(file_name) as f_in:   
    for line in f_in:
        if len(line.split()) == 0:
            continue

This just does the same work, ignoring all empty line.

quapka
  • 2,799
  • 4
  • 21
  • 35
a_r
  • 488
  • 6
  • 12
5

When a treatment of text must be done to just extract data from it, I always think first to the regexes, because:

  • as far as I know, regexes have been invented for that

  • iterating over lines appears clumsy to me: it essentially consists to search the newlines then to search the data to extract in each line; that makes two searches instead of a direct unique one with a regex

  • way of bringing regexes into play is easy; only the writing of a regex string to be compiled into a regex object is sometimes hard, but in this case the treatment with an iteration over lines will be complicated too

For the problem discussed here, a regex solution is fast and easy to write:

import re
names = re.findall('\S+',open(filename).read())

I compared the speeds of several solutions:

import re
from time import clock

A,AA,B1,B2,BS,reg = [],[],[],[],[],[]
D,Dsh,C1,C2 = [],[],[],[]
F1,F2,F3  = [],[],[]

def nonblank_lines(f):
    for l in f:
        line = l.rstrip()
        if line:  yield line

def short_nonblank_lines(f):
    for l in f:
        line = l[0:-1]
        if line:  yield line

for essays in xrange(50):

    te = clock()
    with open('raa.txt') as f:
        names_listA = [line.strip() for line in f if line.strip()] # Felix Kling
    A.append(clock()-te)

    te = clock()
    with open('raa.txt') as f:
        names_listAA = [line[0:-1] for line in f if line[0:-1]] # Felix Kling with line[0:-1]
    AA.append(clock()-te)

    #-------------------------------------------------------
    te = clock()
    with open('raa.txt') as f_in:
        namesB1 = [ name for name in (l.strip() for l in f_in) if name ] # aaronasterling without list()
    B1.append(clock()-te)

    te = clock()
    with open('raa.txt') as f_in:
        namesB2 = [ name for name in (l[0:-1] for l in f_in) if name ] # aaronasterling without list() and with line[0:-1]
    B2.append(clock()-te)

    te = clock()
    with open('raa.txt') as f_in:
        namesBS = [ name for name in f_in.read().splitlines() if name ] # a list comprehension with read().splitlines()
    BS.append(clock()-te)

    #-------------------------------------------------------
    te = clock()
    with open('raa.txt') as f:
        xreg = re.findall('\S+',f.read()) #  eyquem
    reg.append(clock()-te)

    #-------------------------------------------------------
    te = clock()
    with open('raa.txt') as f_in:
        linesC1 = list(line for line in (l.strip() for l in f_in) if line) # aaronasterling
    C1.append(clock()-te)

    te = clock()
    with open('raa.txt') as f_in:
        linesC2 = list(line for line in (l[0:-1] for l in f_in) if line) # aaronasterling  with line[0:-1]
    C2.append(clock()-te)

    #-------------------------------------------------------
    te = clock()
    with open('raa.txt') as f_in:
        yD = [ line for line in nonblank_lines(f_in)  ] # aaronasterling  update
    D.append(clock()-te)

    te = clock()
    with open('raa.txt') as f_in:
        yDsh = [ name for name in short_nonblank_lines(f_in)  ] # nonblank_lines with line[0:-1]
    Dsh.append(clock()-te)

    #-------------------------------------------------------
    te = clock()
    with open('raa.txt') as f_in:
        linesF1 = filter(None, (line.rstrip() for line in f_in)) # aaronasterling update 2
    F1.append(clock()-te)

    te = clock()
    with open('raa.txt') as f_in:
        linesF2 = filter(None, (line[0:-1] for line in f_in)) # aaronasterling update 2 with line[0:-1]
    F2.append(clock()-te)

    te = clock()
    with open('raa.txt') as f_in:
        linesF3 =  filter(None, f_in.read().splitlines()) # aaronasterling update 2 with read().splitlines()
    F3.append(clock()-te)


print 'names_listA == names_listAA==namesB1==namesB2==namesBS==xreg\n  is ',\
       names_listA == names_listAA==namesB1==namesB2==namesBS==xreg
print 'names_listA == yD==yDsh==linesC1==linesC2==linesF1==linesF2==linesF3\n  is ',\
       names_listA == yD==yDsh==linesC1==linesC2==linesF1==linesF2==linesF3,'\n\n\n'


def displ((fr,it,what)):  print fr + str( min(it) )[0:7] + '   ' + what

map(displ,(('* ', A,    '[line.strip() for line in f if line.strip()]               * Felix Kling\n'),

           ('  ', B1,   '    [name for name in (l.strip() for l in f_in) if name ]    aaronasterling without list()'),
           ('* ', C1,   'list(line for line in (l.strip() for l in f_in) if line)   * aaronasterling\n'),          

           ('* ', reg,  're.findall("\S+",f.read())                                 * eyquem\n'),

           ('* ', D,    '[ line for line in       nonblank_lines(f_in)  ]           * aaronasterling  update'),
           ('  ', Dsh,  '[ line for line in short_nonblank_lines(f_in)  ]             nonblank_lines with line[0:-1]\n'),

           ('* ', F1 ,  'filter(None, (line.rstrip() for line in f_in))             * aaronasterling update 2\n'),

           ('  ', B2,   '    [name for name in (l[0:-1]   for l in f_in) if name ]    aaronasterling without list() and with line[0:-1]'),
           ('  ', C2,   'list(line for line in (l[0:-1]   for l in f_in) if line)     aaronasterling  with line[0:-1]\n'),

           ('  ', AA,   '[line[0:-1] for line in f if line[0:-1]  ]                   Felix Kling with line[0:-1]\n'),

           ('  ', BS,   '[name for name in f_in.read().splitlines() if name ]        a list comprehension with read().splitlines()\n'),

           ('  ', F2 ,  'filter(None, (line[0:-1] for line in f_in))                  aaronasterling update 2 with line[0:-1]'),

           ('  ', F3 ,  'filter(None, f_in.read().splitlines()                        aaronasterling update 2 with read().splitlines()'))
    )

Solution with regex is straightforward and neat. Though, it isn't among the fastest ones. The solution of aaronasterling with filter() is surprisigly fast for me (I wasn't aware of this particular filter()'s speed) and times of optimized solutions go down until 27 % of the biggest time. I wonder what makes the miracle of the filter-splitlines association:

names_listA == names_listAA==namesB1==namesB2==namesBS==xreg
  is  True
names_listA == yD==yDsh==linesC1==linesC2==linesF1==linesF2==linesF3
  is  True 



* 0.08266   [line.strip() for line in f if line.strip()]               * Felix Kling

  0.07535       [name for name in (l.strip() for l in f_in) if name ]    aaronasterling without list()
* 0.06912   list(line for line in (l.strip() for l in f_in) if line)   * aaronasterling

* 0.06612   re.findall("\S+",f.read())                                 * eyquem

* 0.06486   [ line for line in       nonblank_lines(f_in)  ]           * aaronasterling  update
  0.05264   [ line for line in short_nonblank_lines(f_in)  ]             nonblank_lines with line[0:-1]

* 0.05451   filter(None, (line.rstrip() for line in f_in))             * aaronasterling update 2

  0.04689       [name for name in (l[0:-1]   for l in f_in) if name ]    aaronasterling without list() and with line[0:-1]
  0.04582   list(line for line in (l[0:-1]   for l in f_in) if line)     aaronasterling  with line[0:-1]

  0.04171   [line[0:-1] for line in f if line[0:-1]  ]                   Felix Kling with line[0:-1]

  0.03265   [name for name in f_in.read().splitlines() if name ]        a list comprehension with read().splitlines()

  0.03638   filter(None, (line[0:-1] for line in f_in))                  aaronasterling update 2 with line[0:-1]
  0.02198   filter(None, f_in.read().splitlines()                        aaronasterling update 2 with read().splitlines()

But this problem is particular, the most simple of all: only one name in each line. So the solutions are only games with lines, splitings and [0:-1] cuts.

On the contrary, regex doesn't matter with lines, it straightforwardly finds the desired data: I consider it is a more natural way of resolution, applying from the simplest to the more complex cases, and hence is often the way to be prefered in treatments of texts.

EDIT

I forgot to say that I use Python 2.7 and I measured the above times with a file containing 500 times the following chain

SMITH
JONES
WILLIAMS
TAYLOR
BROWN
DAVIES
EVANS
WILSON
THOMAS
JOHNSON

ROBERTS
ROBINSON
THOMPSON
WRIGHT
WALKER
WHITE
EDWARDS
HUGHES
GREEN
HALL

LEWIS
HARRIS
CLARKE
PATEL
JACKSON
WOOD
TURNER
MARTIN
COOPER
HILL

WARD
MORRIS
MOORE
CLARK
LEE
KING
BAKER
HARRISON
MORGAN
ALLEN

JAMES
SCOTT
PHILLIPS
WATSON
DAVIS
PARKER
PRICE
BENNETT
YOUNG
GRIFFITHS

MITCHELL
KELLY
COOK
CARTER
RICHARDSON
BAILEY
COLLINS
BELL
SHAW
MURPHY

MILLER
COX
RICHARDS
KHAN
MARSHALL
ANDERSON
SIMPSON
ELLIS
ADAMS
SINGH

BEGUM
WILKINSON
FOSTER
CHAPMAN
POWELL
WEBB
ROGERS
GRAY
MASON
ALI

HUNT
HUSSAIN
CAMPBELL
MATTHEWS
OWEN
PALMER
HOLMES
MILLS
BARNES
KNIGHT

LLOYD
BUTLER
RUSSELL
BARKER
FISHER
STEVENS
JENKINS
MURRAY
DIXON
HARVEY
eyquem
  • 26,771
  • 7
  • 38
  • 46
  • A couple of points. One would never write `[line for line in generator()]`, one would just write `list(generator())`. Try to work with builtins whenever possible. They're written in C and everybody knows what they do. Also, I was calling `str.rstrip` not `str.split`. I don't know if there will be a performance gain. Finally, `filter(None, ....)` is so fast because it encapsulates all of the logic in C. – aaronasterling Jan 31 '11 at 19:12
3

Why are you all going the hard way?

with open("myfile") as myfile:
    nonempty = filter(str.rstrip, myfile)

Convert nonempty into a list if you have the urge to do so, although I highly suggest keeping nonempty a generator as it is in Python 3.x

In Python 2.x you may use itertools.ifilter to do your bidding instead.

Bharel
  • 23,672
  • 5
  • 40
  • 80
3

You can use not:

for line in lines:
    if not line:
        continue
Robert Dziubek
  • 101
  • 3
  • 8
2

You can use Walrus operator for Python >= 3.8

with open('my_file') as fd:
    nonblank = [stripped for line in fd if (stripped := line.strip())]

Think 'blablabla if stripped (defined as line.strip) is Truthy'

P i
  • 29,020
  • 36
  • 159
  • 267
0

@S.Lott

The following code processes lines one at a time and produces a result that isn't memory eager:

filename = 'english names.txt'

with open(filename) as f_in:
    lines = (line.rstrip() for line in f_in)
    lines = (line for line in lines if line)
    the_strange_sum = 0
    for l in lines:
        the_strange_sum += 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'.find(l[0])

print the_strange_sum

So the generator (line.rstrip() for line in f_in) is quite the same acceptable than the nonblank_lines() function.

eyquem
  • 26,771
  • 7
  • 38
  • 46
  • @S.Lott No . If I test with print types(lines), I obtain . So lines isn't an object containing data, having a little or enormous size. And lines = (line.rstrip() for line in f_in) doesn't put anything alone in memory, it offers the possibility to be iterated. My above code doesn't put the entire file into memory, it just records an integer in the the_strange_sum object that lives in the memory. It seems we don't understand the words the same. – eyquem Jan 31 '11 at 19:23
  • Agreed. I could not understand your point until you included . – S.Lott Jan 31 '11 at 19:25
0

What about LineSentence module, it will ignore such lines:

Bases: object

Simple format: one sentence = one line; words already preprocessed and separated by whitespace.

source can be either a string or a file object. Clip the file to the first limit lines (or not clipped if limit is None, the default).

from gensim.models.word2vec import LineSentence
text = LineSentence('text.txt')
Rocketq
  • 5,423
  • 23
  • 75
  • 126