Importing a table into pandas and specifying the data type with missing values

Question

I am using the read_table command in pandas/Python to import a tab-delimited text file.

q_data_1 = pd.read_table('data.txt', skiprows=6, dtype={'numbers': np.float64})

...but get

AttributeError: 'NoneType' object has no attribute 'dtype'

Without the dtype parameter, the column is imported as an 'object' dtype.

The 'numbers' column I think has missing data which trips up the import. How do I ignore these values?

EDIT (25-May-13): Any idea how to do this with columns that contain (i) time (e.g. '00:03:06') (ii) date (e.g. '2002-03-11') and percentages ('32.81%')? All of which convert to objects. (I have edited Q to reflect) (iv) numbers with commas (e.g. '10,982') to convert them to appropriate dtype?

What is in 'numbers' which isn't a float? And how do you want it to be? — Andy Hayden, May 24 '13 at 15:50
What does 'AttributeError: 'NoneType' object has no attribute 'dtype' - does it mean it cannot convert one of the numbers into a float? — user7289, May 24 '13 at 15:52
Well, what happens without it? Is the column name exactly 'numbers' ? — Andy Hayden, May 24 '13 at 16:03
see this question: http://stackoverflow.com/questions/16729483/converting-strings-to-floats-in-a-dataframe, you should prob just read it in and convert after (as I assume your separator in your file is NOT a comma) — Jeff, May 24 '13 at 16:25
Thanks for that very helpful - whats the best way of converting comma separated integers (e.g. 1,000) into integers? — user7289, May 24 '13 at 16:36

score 1 · Accepted Answer · edited May 23 '17 at 12:00

1

After you've read in the DataFrame (without restricting dtype) you can then convert it (using technique from this post) with apply:

import locale
locale.setlocale( locale.LC_ALL, 'en_US.UTF-8')
df = pd.DataFrame([['1,002.01'], ['300,000,000.1'], ['10']], columns=['numbers'])

In [4]: df['numbers']
Out[4]:
0         1,002.01
1    300,000,000.1
2               10
Name: numbers, dtype: object

In [5]: df['numbers'].apply(locale.atof)
Out[5]:
0    1.002010e+03
1    3.000000e+08
2    1.000000e+01
Name: numbers, dtype: float64

In[6]: df['numbers'] = df['numbers'].apply(locale.atof)

edited May 23 '17 at 12:00

Community

1
1

answered May 24 '13 at 18:56

Andy Hayden

359,921
101
625
535

BTW any idea how to do the same thing with a columns that contain (i) time (e.g. '00:03:06') (ii) date (e.g. '2002-03-11') and percentages ('32.81%')? All of which convert to objects. (I have edited Q to reflect) – user7289 May 25 '13 at 10:59
You shouldn't edit the question, but rather ask a new one :). It's essentially the same trick in both cases (just define a function which does it to a single string and then apply it the column). – Andy Hayden May 25 '13 at 11:15
Okay will do. Is this the efficient way of doing it? As I am using Pandas because it efficiently handles large data sets using essentially C-libraries. – user7289 May 25 '13 at 18:29
1

It's a good question, but certainly my advice is use this one and see if it is efficient enough, I think it will be reasonably efficient. There is a converters argument to read_csv which could be worth investigating... – Andy Hayden May 25 '13 at 18:51
Thanks for this appreciate you're help. Might be worth asking as a separate question ;) – user7289 May 25 '13 at 18:56
Heavy use of the `%timeit` helper is always recommended :) – Andy Hayden May 25 '13 at 18:59

Importing a table into pandas and specifying the data type with missing values

1 Answers1