3

I have a number of data files produced by some rather hackish script used in my lab. The script is quite entertaining in that the number of lines it appends before the header varies from file to file (though they are of the same format and have the same header).

I am writing a batch to process all of these files to dataframes. How can I make pandas identify the correct header if I do not know the position? I know the exact heder text, and the text of the two lines that come directly before it (they are the only consecutive instances of \r\n in the document).

I have tried to define null skipping at the end of the document and select the (thankfully) fixed number of data rows each file contains:

df = pd.read_csv(myfile, skipfooter=0, nrows=267)

That did not work.

Do you have any further ideas?

TheChymera
  • 17,004
  • 14
  • 56
  • 86
  • Not sure if there is a better Pandas way but could you pre-read the csv file, count the number of empty rows then use the skiprows named argument for read_csv? http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html – will-hart Nov 26 '13 at 23:32

1 Answers1

3

You can open file and iterate it until consecutive \r\n are met, and pass result to parser, i.e.

with open(csv_file_name, 'rb') as source:
    consec_empty_lines = 0
    for line in source:
        if line == '\r\n':
            consec_empty_lines += 1
            if consec_empty_lines == 2: 
                break
        else:
            consec_empty_lines = 0
    df = pd.read_csv(source)
alko
  • 46,136
  • 12
  • 94
  • 102
  • hmmm... apparently the if statement won't react to my two blank lines - two blank lines means '\n\n' - right? There are also no tabs or spaces on those lines in my document... :-/ – TheChymera Nov 26 '13 at 23:39
  • @TheChymera Imho, two blank lines are read as two consecutive `\n`, but you can test and see. As you didn't provide any test data, I didn't test it. I hope you get the idea and can elaborate solution serving your specific needs. – alko Nov 26 '13 at 23:43
  • can I print the raw text somehow? – TheChymera Nov 26 '13 at 23:45
  • @TheChymera I don't completely get your question. For testing purposes you can add `print line` in a loop to check for skipped lines, or replace `df = ...` for `print source.read()` to check what's left. Don't forget to replace things back when running your batch. – alko Nov 26 '13 at 23:49
  • found out how - the answer was `print repr(line)` - that lets me see what the "empty" lines actually contain. so both of them have actually have `\r\n` in them. Which makes finding them a it more problematic because I have other lines containing that (also in variable numbers 0.o) - I just have to think of some (~nice) way to check for the only to such lines which follow each other. – TheChymera Nov 26 '13 at 23:54
  • @TheChymera Then your actual task differs from your question :) If you can afford read every file twice, and number of rows at the end is constant, you can read file once, find number of rows to skip, and proceed (f.ex. with `pandas.read_csv` `skiprows` arg) – alko Nov 26 '13 at 23:57
  • I ended up doing it with a counter (I see you also included that in your answer now). Great! – TheChymera Nov 27 '13 at 00:01