0

I wish to check if in a text file of points (x,y,z, etc) where is an header (True) or not (False). I wish to know if there is a built-in function in Python or a better method respect my own function.

def check_header(filename, parse):
    with open(filename) as f:
        first = f.readline()
        line = first.rstrip().split(parse)
        try:
            float(line[0])
            return False
        except ValueError:
            return True

i wrote this function example

a b c d
449628.46 6244026.59 0.47 1
449628.55 6244033.12 0.30 2 
449628.75 6244046.31 0.37 3 
449628.81 6244049.63 0.44 1 
449628.81 6244049.88 0.39 5 
449628.81 6244050.66 0.30 1 
449628.96 6244060.67 0.38 2 
449629.18 6244075.61 0.39 2 
449629.24 6244078.72 0.47 4 
449629.24 6244078.96 0.41 8 
449629.23 6244079.19 0.34 4 

check_header(filename, " ")
True

449628.46 6244026.59 0.47 1
449628.55 6244033.12 0.30 2 
449628.75 6244046.31 0.37 3 
449628.81 6244049.63 0.44 1 
449628.81 6244049.88 0.39 5 
449628.81 6244050.66 0.30 1 
449628.96 6244060.67 0.38 2 
449629.18 6244075.61 0.39 2 
449629.24 6244078.72 0.47 4 
449629.24 6244078.96 0.41 8 
449629.23 6244079.19 0.34 4

check_header(filename, " ")
False 
Gianni Spear
  • 7,033
  • 22
  • 82
  • 131
  • Side note: Your format is a CSV dialect, and it's readable and writable with the [`csv`][1] module in the stdlib (you just need to pass `delimiter=' '`), which may be a little simpler and a lot more robust than whatever custom code you're doing. And you might want to consider switching to the commas as separators instead of spaces (which would, e.g., make it trivial to add column names with spaces in them, without having to handle quoting). – abarnert Mar 27 '13 at 22:58
  • Also, why is this tagged "optimization"? Do you really need this check to go faster, or do you mean something else by that term? – abarnert Mar 27 '13 at 22:58

2 Answers2

4

If you can have columns named, e.g., "3.5", your code obviously won't work, so I'll assume you can't.

And that means the whole thing is a bit overcomplicated. Really, all you need to do is see if the first character is a valid float starting character for a float:

def check_header(filename):
    with open(filename) as f:
        first = f.read(1)
    return first not in '.-0123456789'

For an empty file, this will return True instead of raising an exception, but otherwise, it should work for exactly the same use cases as your original code.

I normally wouldn't even mention this, but since you tagged your question "optimization", I guess you care: This code is theoretically faster than yours for reasons that should be pretty obvious, but in real life, it will almost always make no difference. According to %timeit on my machine, the part after the read/readline takes 244ns instead of 2.6us. That's more than 10x as fast, as you'd expect. But the read/readline part takes 13.1us vs. 13.2us for a file is in the OS disk cache, or 39.7ms vs. 39.7ms for a file on a remote drive. The I/O cost of reading a block from a file into a buffer, even in the best case, swamps the cost of processing it (both the extra processing in readline, and the extra processing in your code).

abarnert
  • 354,177
  • 51
  • 601
  • 671
2

Plaintext files don't really have headers in traditional sense. It's just a stream of characters.

If this were a binary format you could have a strict header and any reader would have to adhere to that format. I assume this is a custom format that you've created, if that's the case you've already got a good solution.

If you want to learn more about headers, you should look at the JPEG header specification, which is simple.
http://www.fastgraph.com/help/jpeg_header_format.html

See this post for an example of python code that reads the binary jpeg header.
Python: Check if uploaded file is jpg

Community
  • 1
  • 1
Krets
  • 349
  • 2
  • 13
  • 1
    I think he means header in the sense of CSV headers. – abarnert Mar 27 '13 at 22:54
  • You are probably right, but CSV files are still plaintext files. There is nothing special that separates the header from the content. Even Excel and Google Spreadsheets will ask the user if the first line is the header. There is no magic solution without knowing the underlying dataset. – Krets Mar 27 '13 at 23:00
  • The OP isn't asking for a general-purpose solution to all possible files; he has a (presumably) well-defined format, and he's shown us a sample, along with code that successfully handles it. (And I'm not sure what being plaintext has to do with it—binary files are _also_ a stream of characters, and unless you know the format, you can't separate the header from the context.) – abarnert Mar 27 '13 at 23:16