2

I have a pandas dataframe of a standard shape:

   A     B     C       D        E................φ
1-Int   NaN Str Obj  Datetime  NaN..........Mixed Obj (like currency)
2-NaN Float Str Obj  Datetime Category..........NaN
3-Int Float   NaN    Datetime Category......Mixed Obj
.  .   .       .         .       .               .
.  .   .       .         .       .               .
.  .   .       .         .       .               .
Z-Int Float Str Obj     NaN   Category......Mixed Obj

In the example above, Z is an arbitrary row greater than 3. Φ is an arbitrary column name that represents a column greater than C. It could be the 90th column or the 150th column. My aim is to sift through the columns above replace values NaN values by datatype. My desired outcome is this:

   A     B     C       D        E......................φ
1-Int  0.00 Str Obj  Datetime Uncategorized.......Mixed Obj
2- 0  Float Str Obj  Datetime Category...............$0.00
3-Int Float  "None"  Datetime Category............Mixed Obj
.  .   .       .         .       .                     .
.  .   .       .         .       .                     .
.  .   .       .         .       .                     .
Z-Int Float Str Obj  0/00/0000 Category...........Mixed Obj

The goal is to have the ability to replace NaN values in specific columns which contain specific datatypes, with their datatype's version of 0. So 0 for integer, 0.00 for float, "None" for string, 0/00/0000 for Datetime (I know this may cause some problems), uncategorized for category, and $0.00 for mixed objects like currency.

To attempt this, I used pandas loc function to see if I could locate column by column is this were true.

for col in df.columns:
    print(df.loc[:,col].apply(isinstance, args = [int]))

The expected result was:

   A     
1-True   
2-False 
3-True  
.  .    
.  .  
.  .   
Z-True 

However I got:

   A     
1-False   
2-False 
3-False  
.  .    
.  .  
.  .   
Z-False

I don't understand why I couldn't identify the integers inside of this column.

  • You might consider running `df[col].apply(type)` to see what type of data actually is in your frame if they're not int. – Henry Ecker Jul 31 '21 at 22:16
  • 1
    @HenryEcker. Congratulations for your 20k :-) – Corralien Jul 31 '21 at 22:19
  • @HenryEcker I see what you mean. It is applying the datatype to the frame work. My aim was to isolate int values. –  Jul 31 '21 at 22:22
  • I understand but I just wanted to confirm that they were actually of type `int` and not strings that look like ints or any other int-like type. – Henry Ecker Jul 31 '21 at 22:24
  • @HenryEcker this column returns a value of `int64` when I run `df[col].dtype`. So now, I don't believe it's a string-like or int-like object. Though some columns that should be returning as date do appear to be return as string objects. –  Jul 31 '21 at 22:30
  • 1
    `isinstance(np.int64(1), int)` returns `False`. Maybe you should be checking against `np.int64` instead of int? You might also add an executable DataFrame constructor to your question so we have something we can create exactly the Data you're working with. – Henry Ecker Jul 31 '21 at 22:32
  • @HenryEcker my datetime data is returning like string like objects, as well as some columns that have floats. Is there a way to look inside of the cell and check if it is of the format '\d+' or 'f\+'? –  Jul 31 '21 at 22:42
  • @HenryEcker I understand the context of your comment. I have to look for Pandas particular datasets, understood. –  Jul 31 '21 at 22:55
  • Given the answer you've accepted, I've linked to another question which has many more options that may also be helpful depending on the situation. – Henry Ecker Jul 31 '21 at 22:59

1 Answers1

2

You can use select_dtypes to get only the columns in a dataframe that match a specific type. For example, to get just the float columns you'd use:

df.select_dtypes(include='float64')

The include argument takes a string or list so you can specify multiple types if you want.

Bill the Lizard
  • 398,270
  • 210
  • 566
  • 880