4

Python 2.7, numpy, create levels in the form of a list of factors.

I have a data file which list independent variables, the last column indicates the class. For example:

2.34,4.23,0.001, ... ,56.44,2.0,"cloudy with a chance of rain"

Using numpy, I read all the numeric columns into a matrix, and the last column into an array which I call "classes". In fact, I don't know the class names in advance, so I do not want to use a dictionary. I also do not want to use Pandas. Here is an example of the problem:

classes = ['a', 'b', 'c', 'c', 'b', 'a', 'a', 'd']
type (classes)
<type 'list'>
classes = numpy.array(classes)
type(classes)
<type 'numpy.ndarray'>
classes
array(['a', 'b', 'c', 'c', 'b', 'a', 'a', 'd'],
      dtype='|S1')
# requirements call for a list like this:
# [0, 1, 2, 2, 1, 0, 3]

Note that the target class may be very sparse, for example, a 'z', in perhaps 1 out of 100,000 cases. Also note that the classes may be arbitrary strings of text, for example, scientific names.

I'm using Python 2.7 with numpy, and I'm stuck with my environment. Also, the data has been preprocessed, so it's scaled and all values are valid - I do not want to preprocess the data a second time to extract the unique classes and build a dictionary before I process the data. What I'm really looking for was the Python equivalent to the stringAsFactors parameter in R that automatically converts a string vector to a factor vector when the script reads the data.

Don't ask me why I'm using Python instead of R - I do what I'm told.

Thanks, CC.

ali_m
  • 71,714
  • 23
  • 223
  • 298
ccc31807
  • 761
  • 8
  • 17

1 Answers1

11

You could use np.unique with return_inverse=True to return both the unique class names and a set of corresponding integer indices:

import numpy as np

classes = np.array(['a', 'b', 'c', 'c', 'b', 'a', 'a', 'd'])

classnames, indices = np.unique(classes, return_inverse=True)

print(classnames)
# ['a' 'b' 'c' 'd']

print(indices)
# [0 1 2 2 1 0 0 3]

print(classnames[indices])
# ['a' 'b' 'c' 'c' 'b' 'a' 'a' 'd']

The class names will be sorted in lexical order.

ali_m
  • 71,714
  • 23
  • 223
  • 298
  • Thanks. I felt that this had to be easy, but all the answers I found required creating a dictionary. The last little bit (not asked in the question) is this: "indices.astype('S10')" which converts the integer values to real categories, which I need for the classification routine. Your answer works perfectly. Thank you again. – ccc31807 Jan 08 '16 at 18:16