0

I have several different data files that I need to import using genfromtxt. Each data file has different content. For example, file 1 may have all floats, file 2 may have all strings, and file 3 may have a combination of floats and strings etc. Also the number of columns vary from file to file, and since there are hundreds of files, I don't know which columns are floats and strings in each file. However, all the entries in each column are the same data type.

Is there a way to set up a converter for genfromtxt that will detect the type of data in each column and convert it to the right data type?

Thanks!

Emmanuel Sunil
  • 243
  • 4
  • 16
  • Can you use pandas? ``pandas.readcsv`` is much more powerful than ``numpy.genfromtxt``, and will do all of this for you automatically. – jakevdp Sep 28 '16 at 13:52
  • The problem is that I need to be able to convert the output to a numpy array for further processing. I've heard that datafame.as_matrix() returns a object array. So I prefer to use genfromtxt. – Emmanuel Sunil Sep 28 '16 at 14:03
  • ``dataframe.to_records()`` will give you a record array, which is probably what you want then. Storing mixed types in a numpy array can only be done via a record array or an object array. – jakevdp Sep 28 '16 at 14:10
  • Yes, that works! Thanks! – Emmanuel Sunil Sep 28 '16 at 14:14
  • OK – I added an answer for completeness – jakevdp Sep 28 '16 at 14:27
  • Have you tried `dtype=None`? It may not as general as the pandas versions, but it does handle this basic dtype inference. – hpaulj Sep 28 '16 at 15:39

1 Answers1

1

If you're able to use the Pandas library, pandas.read_csv is much more generally useful than np.genfromtxt, and will automatically handle the kind of type inference mentioned in your question. The result will be a dataframe, but you can get out a numpy array in one of several ways. e.g.

import pandas as pd
data = pd.read_csv(filename)

# get a numpy array; this will be an object array if data has mixed/incompatible types
arr = data.values

# get a record array; this is how numpy handles mixed types in a single array
arr = data.to_records()

pd.read_csv has dozens of options for various forms of text inputs; see more in the pandas.read_csv documentation.

jakevdp
  • 77,104
  • 11
  • 125
  • 160