Read Up Until a Point Python

Question

I have a text file full of data that starts with

#Name
#main

then it's followed by lots of numbers and then the file ends with

#extra
!side

So here's a small snippet

#Name
#main
60258960
33031674
72302403
#extra
!side

I want to read only the numbers. But here's the kick, I want them to each be their own individual string.

So I know how to read starting after the headers with

read=f.readlines()[3:]

But I'm stumped on everything else. Any suggestions?

Keith John Hutchison · Answer 1 · 2013-04-29T22:59:41.773

Read line by line. Use #main as a flag to start processing. Use #extra as a flag to stop processing.

start = '#main'
end = '#extra'
numbers = []
file_handler = open('read_up_to_a_point.txt')
started = False
for line in file_handler:
    if end in line:
        started = False       
    if started:
        numbers.append(line.strip())
    if start in line:
        started = True
file_handler.close()
print numbers

sample output

python read_up_to_a_point.py ['60258960', '33031674', '72302403']

score 3 · Accepted Answer · answered Apr 11 '13 at 23:31

3

You're pretty close, as you are. You just need to modify your list slice to chop off the last two lines in the file along with the first two. readlines will naturally return a list where each item is one line from the file. However, it will also have the 'newline' character at the end of each string, so you may need to filter that out.

with open("myfile.txt") as myfile:
    # Get only numbers
    read = myfile.readlines()[2:-2]

# Remove newlines
read = [number.strip() for number in read]
print read

answered Apr 11 '13 at 23:31

Michael0x2a

58,192
30
175
224

1

You could get rid of the newlines at almost the same time with `read = myfile.read().splitlines()[2:-2]`. – martineau Jan 11 '15 at 15:49
Note that `.strip()` will also strip any leading/trailing space or tab. You can use `number.rstrip("\n")` to avoid that. (that’s irrelevant to OP’s question but might be useful for anyone reading that) – bfontaine Sep 27 '16 at 15:41

score 1 · Answer 3 · answered Apr 11 '13 at 23:35

I would do something like this:

nums = []
for line in f:
  stripped = line.rstrip('\n')
  if stripped.isnumeric():
    nums.append(stripped)

nums will contain only those lines with numbers. If your numbers are well formed, meaning not negative and no hexadecimal. That will take a regular expression to match precisely.

score 1 · Answer 4 · answered Apr 29 '13 at 23:50

You should only use .readlines() if you know your input files will fit comfortably into memory; it reads all lines at once.

Most of the time you can read one input line at a time, and for that you can just iterate the file handle object.

When you want special, tricky input handling, I recommend encapsulating the handling in a generator function like this:

def do_something_with_point(point):
    print(point)

class BadInputFile(ValueError):
    pass

def read_points_data(f):
    try:
        line = next(f)
        if not line.startswith("#Name"):
            raise BadInputFile("file does not start with #Name")

        line = next(f)
        if not line.startswith("#main"):
            raise BadInputFile("second line does not start with #main")
    except StopIteration:
        raise BadInputFile("truncated input file")

    # use enumerate() to count input lines; start at line number 3
    # since we just handled two lines of header
    for line_num, line in enumerate(f, 3):
        if line.startswith("#extra"):
            break
        else:
            try:
                yield int(line)
            except ValueError:
                raise BadInputFile("illegal line %d: %s" % (line_num, line))
            # if you really do want strings: yield line
    else:
        # this code will run if we never see a "#extra" line
        # if break is executed, this doesn't run.
        raise BadInputFile("#extra not seen")

    try:
        line = next(f)
        if not line.startswith("!side"):
            raise BadInputFile("!side not seen after #extra")
    except StopIteration:
        raise BadInputFile("input file truncated after #extra")

with open("points_input_file.txt") as f:
    for point in read_points_data(f):
        do_something_with_point(point)

Note that this input function thoroughly validates the input, raising an exception when anything is incorrect on the input. But the loop using the input data is simple and clean; code using read_points_data() can be uncluttered.

I made read_points_data() convert the input points to int values. If you really want the points as strings, you can modify the code; I left a comment there to remind you.

Seriously? Looks like a solution in search of a problem...and the one in this question isn't it. — martineau, Jan 11 '15 at 15:57
@martineau I think this is a good answer. Maybe the asker could get away with not checking inputs, but I think it's never wrong to validate input data, and this answer shows how to hide all the validation in its own function. Notice how the `for` loop that uses the data is clean and uncluttered despite the very thorough error checking... generators are one of the things I love about Python. — steveha, Jan 12 '15 at 18:34
Your idea of encapsulating the reading of the input file in a generator function may have some merit if it addresses the OP's problem. However IMHO your sample code would be better if it just illustrated the core concept. I'm not saying input validation and error handling aren't important, but this question isn't about them. You could just point out in your technique lends itself to doing them and left out doing so in its all its glory. It's difficult to [_see the forest for the trees_](http://en.wiktionary.org/wiki/see_the_forest_for_the_trees#Verb) in your answer's code. — martineau, Jan 12 '15 at 19:41
I don't understand why you say "*if* it addresses the OP's problem" when the code exactly solves the OP's problem. I disagree that providing tested, working code makes my answer worse. I also disagree that it's hard to generalize from the working code to solve other problems. I guess we are just going to disagree on this point. If you have any interest in discussing this further, we had best take it to chat, as StackOverflow frowns on extended discussions in the comments of an answer. — steveha, Jan 12 '15 at 20:03
No thanks although you're apparently not getting my point...but that's OK, . — martineau, Jan 12 '15 at 20:47
Please feel free to post an answer to this question to show me how you think it should have been done. Make a copy of this answer and simplify it, write your own code from scratch, or whatever. — steveha, Jan 12 '15 at 21:13
Done -- I was already considering whether it would be worth doing it or not. — martineau, Jan 13 '15 at 00:05

martineau · Answer 5 · 2015-01-13T03:54:23.153

It's not always a good idea (or perhaps even a feasible one) to usereadlines()without an argument because it will read in the entire file and potentially consume a lot of memory—and doing that may not be necessary if you don't need the all of them at once, depending on exactly what you're doing.

So, one way to do what you want is to use a Python generator function to extract just the lines or values you need from a file. They're very easy to create, essentially you just useyieldstatements to return values instead ofreturn. From a programming point-of-view the main difference between them is that execution will continue with the line following theyieldstatement next time the function is called, rather than from it first line as would normally be the case. This means their internal state automatically gets saved between subsequent calls, which makes doing complicated processing inside them easier.

Here's a fairly minimal example of using one to get the just the data you want out of the file, incrementally one-line-at-a-time so it doesn't require enough memory to hold the whole file:

def read_data(filename):
    with open(filename, 'rt') as file:
        next(file); next(file)  # ignore first two lines
        value = next(file).rstrip('\n')  # read what should be the first number
        while value != '#extra':  # not end-of-numbers marker
            yield value
            value = next(file).rstrip('\n')

for number in read_data('mydatafile'):
    # process each number string produced

Of course you can still gather them all together into a list, if you wish, like this:

numbers = list(read_data('mydatafile'))

As you can see it's possible to do other useful things in the function, such as validating the format of the file data or preprocessing it in other ways. In the example above I've done a little of that by removing the newline charactersreadlines()leaves on each line of the list it returns. It would be trivial to also convert each string value into an integer by usingyield int(value)instead of justyield value.

Hopefully this will give you enough of an idea of what's possible and the trade-offs involved when deciding on what approach to use to perform the task at hand.

Since you are opening the file in text mode, you should just use `'\n'` to represent end-of-line. You might want to use "universal newline" mode: https://docs.python.org/2/library/functions.html?highlight=open#open And since `file` is a built-in keyword I generally don't use `file` as an identifier. But those are nits; I like the answer. — steveha, Jan 13 '15 at 00:39
@steveha: Thanks -- it is after all, just your own idea presented a little differently. You're right that only `'\n'` is needed, but opening the file in `'rU'` mode isn't because opening the file in "text mode" -- which `'r'` and `'rt'` both do -- implies that platform-dependent newline character handling will be enabled. That means they will be converted to the single character `'\n'` form whether or not universal newline support is enabled in the Python interpreter being used -- only that it will be handled by the OS (which might be faster). — martineau, Jan 13 '15 at 04:13

Read Up Until a Point Python

5 Answers5