4

I'm a newbie to Python and Pandas, I've spent a lot of time searching but haven't been able to find an answer to my particular problem.

I have a dataframe where the first few lines are just comments starting with '#', followed by the usual dataframe containing rows and columns. I have hundreds of such text files that I need to read in and manipulate. For eg.:

'#' blah1

'#' blah2

'#' blah3

Column1 Column2 Column3

a1 b1 c1

a2 b2 c2

etc.

I want to delete all the rows starting with '#'. Can somebody tell me how to do this in Pandas, preferably?

Alternatively, I tried to use the following code to read in the text file:

my_input=pd.read_table(filename, comment='#', header=80)

But the problem was that the header row differs for each text file. Is there a way to generalize and tell Python that my header lies below that last line that starts with a '#'?

AHegde
  • 754
  • 1
  • 8
  • 10
  • 1
    I think this may be a bug, I tried to use comment="'" (as your lines start with that?)... [read_csv docs](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html) for comment seem pretty clear this should work. – Andy Hayden Sep 05 '14 at 01:58
  • 1
    not merged yet: https://github.com/pydata/pandas/pull/7470 (can the comment at beginning of the line which I think is fixed in master) – Jeff Sep 05 '14 at 02:03
  • 1
    What version of pandas are you using? Normally this should work in 0.14.1 (Jeff, we split that PR, the comment part is already in 0.14.1). And following the docstring, the `header` kwarg should ignore fully commented lines. – joris Sep 05 '14 at 06:37
  • @joris the above raises in 0.14.1, docs say: "If found at the beginning of a line, the line will be ignored altogether." and "Also, fully commented lines are ignored by the parameter header". – Andy Hayden Sep 05 '14 at 07:22
  • So following the docs, the above should be possible, no? What does raise? With 0.14.1 this works for me: `df = pd.read_csv(StringIO(s), sep=' ', comment="'")` – joris Sep 05 '14 at 07:50
  • Ah, but I removed the empty lines between each line, that was maybe not the idea. – joris Sep 05 '14 at 07:52
  • Ok guys thanks a lot for your help, turned out the version of Pandas that comes pre-installed with Anaconda is old. So I was able to update pandas using Windows cmd (with the help of answers from: http://stackoverflow.com/questions/22840449/how-to-update-pandas-from-anaconda-and-is-it-possible-to-use-eclipse-with-this-l) and I was able to use the same code I showed above, and didn't even need to specify the "Header" parameter! :) – AHegde Sep 05 '14 at 22:53

1 Answers1

3

Updating to pandas 0.14.1 or higher allows you to correctly skip commented lines.

Older versions would leave the lines in as NaN which could be dropped with .dropna(), but would leave a broken header.

For older versions of pandas you could use 'skiprows' assuming you know how many lines are commented.

In[3]:

s = "# blah1\n# blah2\n# blah3\nCol1 Col2 Col3\na1 b1 c1\na2 b2 c2\n"
pd.read_table(StringIO(s), skiprows=3, sep=' ')

Out[3]:

Col1    Col2    Col3
0   a1  b1  c1
1   a2  b2  c2
solbs
  • 940
  • 3
  • 15
  • 29
  • I could use "skiprows" if I had one or two files, but the problem was that I had 300 files that I needed to extract data from, and for each of them I had to skip a different number of rows. But anyway, like you said correctly, the problem was with the version of Pandas that came installed with Anaconda. In the newer version, the 'comment' argument takes care of it. – AHegde Oct 26 '14 at 20:29