1

I have a dataframe like such:

>>> import pandas as pd

>>> pd.read_csv('csv/10_no_headers_with_com.csv')
                  //field  field2
0   //first field is time     NaN
1                 132605     1.0
2                 132750     2.0
3                 132772     3.0
4                 132773     4.0
5                 133065     5.0
6                 133150     6.0

I would like to add another field that says whether the first value of the first field is a comment character, //. So far I have something like this:

# may not have a heading value, so use the index not the key
df[0].str.startswith('//')  

What would be the correct way to add on a new column with this value, so that the result is something like:

pd>>> pd.read_csv('csv/10_no_headers_with_com.csv', header=None)
                       0       1       _starts_with_comment
0                 //field  field2       True
1  //first field is time     NaN       True
2                 132605       1       False
3                 132750       2       False
4                 132772       3       False
David542
  • 104,438
  • 178
  • 489
  • 842
  • In case you rather like to optimize named columns import when dealing with commented headers, please consider looking at my edit below. – SpghttCd Dec 20 '18 at 09:35

3 Answers3

1

One way is to utilise pd.to_numeric, assuming non-numeric data in the first column must indicate a comment:

df = pd.read_csv('csv/10_no_headers_with_com.csv', header=None)
df['_starts_with_comment'] = pd.to_numeric(df[0], errors='coerce').isnull()

Just note this kind of mixing types within series is strongly discouraged. Your first two series will no longer support vectorised operations as they will be stored in object dtype series. You lose some of the main benefits of Pandas.

A much better idea is to use the csv module to extract those attributes at the top of your file and store them as separate variables. Here's an example of how you can achieve this.

jpp
  • 159,742
  • 34
  • 281
  • 339
  • thanks for mentioning this approach. About the answer you linked, what if the header itself has a comment? I've actually seen that quite frequently to designate that the first row of the csv file is a header and not data. – David542 Dec 20 '18 at 02:30
  • @David542, You'll have to write some logic to store the logic separately, then *add it later* via `df.columns = [....]`, where `[...]` represents a list of strings. – jpp Dec 20 '18 at 08:49
1

What is the issue with your command, simply assigned to a new column?:

df['comment_flag'] = df[0].str.startswith('//')

Or do you indeed have mixed type columns as mentioned by jpp?


EDIT:
I'm not quite sure, but from your comments I get the impression you don't really need an additional column of comment flags. Just in case you want to load the data without comments into a dataframe but still use field names somewhat hidden in the commented header as column names, you might want to check this out:
So based on this textfile:

//field  field2
//first field is time     NaN
132605     1.0
132750     2.0
132772     3.0
132773     4.0
133065     5.0
133150     6.0

You could do:

cmt = '//'

header = []
with open(textfilename, 'r') as f:
    for line in f:
        if line.startswith(cmt):
            header.append(line)
        else:                      # leave that out if collecting all comments of entire file is ok/wanted
            break
print(header)
# ['//field  field2\n', '//first field is time     NaN\n']  

This way you have the header information prepared for being used for e.g. column names.
Getting the names from the first header line and using it for pandas import would be like

nms = header[0][2:].split()
df = pd.read_csv(textfilename, comment=cmt, names=nms, sep='\s+ ', engine='python')

    field  field2                                           
0  132605     1.0                                         
1  132750     2.0                                       
2  132772     3.0                                      
3  132773     4.0                                       
4  133065     5.0                                       
5  133150     6.0                                       
SpghttCd
  • 10,510
  • 2
  • 20
  • 25
1

Try this:

import pandas as pd
import numpy as np

df.loc[:,'_starts_with_comment'] = np.where(df[0].str.startswith(r'//'), True, False)
Jorge
  • 2,181
  • 1
  • 19
  • 30
  • Or just `df[0].str.startswith(r'//')`.. `np.where` not necessary. You also need `df.loc[:, '_startswith_comment']`. – jpp Dec 20 '18 at 01:58
  • @jorge could you please explain the difference between doing `np.where` and just doing it without that? – David542 Dec 20 '18 at 02:28
  • David542 as @jpp pointed out, in this example, there is no difference. If you have other options in column [0] that you want to use in the new column, you can try to add more np.where inside the np.where that I wrote. Something like np.where(df[0].str.startswith(r'//'), 'Starts with '//', np.where(df[0] == 132750, 'number', 'Something_else')). Just keep track of the parenthesis and where you place them. I find np.where very useful in my work. – Jorge Dec 20 '18 at 02:33
  • @Jorge thanks for the explanation. This may be a silly question, but does `pandas` automatically import `numpy` or do I need to import that separately? – David542 Dec 20 '18 at 02:35
  • @Jorge also -- what's the difference between doing `df.loc[:,'_starts_with_comment']` and `df['_starts_with_comment']`? If you could explain that in your answer I'll go ahead and accept it. – David542 Dec 20 '18 at 02:36
  • 1
    @David542, no, pandas does NOT upload numpy. You need to import it separately. As for your second question. Both produce the same results. You may get a 'warning' from pandas with df['_starts_with_comment'], Using .loc is for indexing purposes. I found this site that explain some of the differences https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/ – Jorge Dec 20 '18 at 11:08