0

I would like to find out which columns of a dataframe are categorical. This dataframe has indeed column z but my code cannot detect it and prints an empty list. How should I fix it?

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

data=[[ 10,10,'a'],
    [ 15,15,'a'],
    [ 14,14,'b']
    ,[16,16,'b'],
    [19,19,'a'],
    [17,17,'a']
    ,[6,6,'c'],
    [5,5,'b'],
    [20,20,'c']
    ,[22,22,'c'],
    [21,21,'b'],
    [18,45 ,'a']]
df = pd.DataFrame(data, columns=['x','y','z'])
categorical_values=[]
for i in df.columns.values.tolist():
    if (type(df[i].all()))==str:
        categorical_values.append(i)

print(categorical_values, 'CATEGORICAL VALUES')
print(len(categorical_values),'total of categorical variables')
etckml
  • 1
  • 2
  • 2
    Cannot replicate, prints `['z'] CATEGORICAL VALUES` and `1 total of categorical variables` (pandas 1.2.1, numpy 1.19.1) – dm2 Jul 10 '21 at 10:56
  • Does this answer your question? https://stackoverflow.com/a/65569109/16310106 –  Jul 10 '21 at 11:02
  • Use (dataframe.column.dtype) to get the type of the column and then compare it with your desired type that you are looking for. –  Jul 10 '21 at 11:04

1 Answers1

0

What seems wrong here is your test if (type(df[i].all()))==str, let’s decompose it:

  • get column i
  • check if all values of that column are True, see the doc for .all()

    Series.all(axis=0, bool_only=None, skipna=True, level=None, **kwargs)

    Return whether all elements are True, potentially over an axis.

    Returns True unless there at least one element within a series or along a Dataframe axis that is False or equivalent (e.g. zero or empty).

  • get the return type
  • check if this type is str or not

You seem to want to check the data types of your columns. For that, use dtypes

>>> df.dtypes
x     int64
y     int64
z    object

You can even select dtypes from the dataframe directly:

>>> df.select_dtypes(include=['object'])
    z
0   a
1   a
2   b
3   b
4   a
5   a
6   c
7   b
8   c
9   c
10  b
11  a
>>> categorical_values = df.select_dtypes(include=['object']).columns.to_list()
>>> categorical_values
['z']
Cimbali
  • 11,012
  • 1
  • 39
  • 68