3

I am using Python datatable (https://github.com/h2oai/datatable) to read a csv file that contain only integers values. After that I convert the datatable to pandas Dataframe. At the conversion, the columns that contain only 0/1 are considered as boolean instead of integers.

let the following csv file (small_csv_file_test.csv):

a1,a2,a3,a4,a5,a6,a7,a8,a9,a10
 1, 1, 1, 1, 1, 1, 1, 0, 1, 1
 2, 2, 2, 2, 2, 2, 2, 1, 0, 1
 3, 3, 3, 3, 3, 3, 3, 0, 0, 1
 4, 4, 4, 4, 4, 4, 4, 1, 0, 0
 5, 5, 5, 5, 5, 5, 5, 0, 0, 0
 6, 6, 6, 6, 6, 6, 6, 0, 0, 0
 7, 7, 7, 7, 7, 7, 7, 1, 1, 0
 8, 8, 8, 8, 8, 8, 8, 1, 1, 1
 9, 9, 9, 9, 9, 9, 9, 1, 1, 1
 0, 0, 0, 0, 0, 0, 0, 1, 0, 1

The source code :

import pandas as pd
import datatable as dt

test_csv_matrix = "small_csv_file_test.csv"

data = dt.fread(test_csv_matrix)
print(data.head(5))

matrix= data.to_pandas()
print(matrix.head())

Result:

   | a1  a2  a3  a4  a5  a6  a7  a8  a9  a10  
-- + --  --  --  --  --  --  --  --  --  ---  
 0 |  1   1   1   1   1   1   1   0   1    1  
 1 |  2   2   2   2   2   2   2   1   0    1  
 2 |  3   3   3   3   3   3   3   0   0    1  
 3 |  4   4   4   4   4   4   4   1   0    0  
 4 |  5   5   5   5   5   5   5   0   0    0  

[5 rows x 10 columns]

   a1  a2  a3  a4  a5  a6  a7     a8     a9    a10  
0   1   1   1   1   1   1   1  False   True   True  
1   2   2   2   2   2   2   2   True  False   True  
2   3   3   3   3   3   3   3  False  False   True  
3   4   4   4   4   4   4   4   True  False  False  
4   5   5   5   5   5   5   5  False  False  False  

Edit 1: The columns a8, a9 and a10 are not correct, I want them as integer values not boolean.

Thank you for your help.

Patrik_P
  • 3,066
  • 3
  • 22
  • 39
ibra
  • 1,164
  • 1
  • 11
  • 26
  • do you want the output of a8, a9 a10 on the boolean format – Tarequzzaman Khan Jul 20 '20 at 13:18
  • i want them as int value not boolean. The first columns are correct, but the ons that contain only 1 and 0 are converted as boolean. – ibra Jul 20 '20 at 13:21
  • so your a1 to a7 which contains numbers so you don't convert them and a8 to a10 contains only 0 and 1 that's why you converted these columns to boolean. Correct me If I am wrong – Tarequzzaman Khan Jul 20 '20 at 13:25
  • The whole matrix contain only integer values, from a1 to a10. I don't convert any specific column. the method "to_pandas()" that convert the matrix from datatable to panda Dataframe seem don't convert correctly the columns that contain only 1 and 0. it consider them as boolean. So i don't know if there is some specific parameter to the method "to_pandas()" to tell that I want only integer values not boolean. – ibra Jul 20 '20 at 13:31

4 Answers4

3

You can just coerce every column to int64:

matrix = data.to_pandas().astype('int64')
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
1

You can always push in the data type

df = pd.DataFrame({"a1":[1,2,3,4,5,6,7,8,9,0],"a2":[1,2,3,4,5,6,7,8,9,0],"a3":[1,2,3,4,5,6,7,8,9,0],"a4":[1,2,3,4,5,6,7,8,9,0],"a5":[1,2,3,4,5,6,7,8,9,0],"a6":[1,2,3,4,5,6,7,8,9,0],"a7":[1,2,3,4,5,6,7,8,9,0],"a8":[0,1,0,1,0,0,1,1,1,1],"a9":[1,0,0,0,0,0,1,1,1,0],"a10":[1,1,1,0,0,0,0,1,1,1]})
df = df.astype({c:"int64" for c in df.columns})
df.dtypes


Rob Raymond
  • 29,118
  • 3
  • 14
  • 30
  • As i said, i read the matrix from a csv file (very big csv file), for that i use datable, after that i convert to panda Dataframe. So I can't do it manually. But the idea of specify the type work. thank you for your help :) – ibra Jul 20 '20 at 13:49
  • I just took a look at https://datatable.readthedocs.io/en/latest/api/dt/fread.html#datatable.fread it has something missing that I would expect. Defining column types at read. My bet this is `datatable` defines the type then passes it to pandas with `to_pandas()` – Rob Raymond Jul 20 '20 at 14:01
  • the datable fread detect automatically the type, and in my code example it work correct. my problem was when converting to pandas dataframe, and the solution as suggested is using "matrix= data.to_pandas().astype('int32')" . and if we think about it, 0 and 1 can be seen as false and true, so when we specify the type the confusion is gone. – ibra Jul 20 '20 at 14:07
1

Add this code with your snippet.

matrix = matrix.iloc[:].astype(int)
matrix

Output:

   a1   a2  a3  a4  a5  a6  a7  a8  a9  a10
0   1   1   1   1   1   1   1   0   1   1
1   2   2   2   2   2   2   2   1   0   1
2   3   3   3   3   3   3   3   0   0   1
3   4   4   4   4   4   4   4   1   0   0
4   5   5   5   5   5   5   5   0   0   0
5   6   6   6   6   6   6   6   0   0   0
1

You could do:

import datatable as dt
x = dt.Frame({"a": ["1", "2", "3"], "b":["20", "30", "40"]})
x.stypes
#(stype.str32, stype.str32)
x[:,:] = dt.int64
x.stypes
#(stype.int64, stype.int64)
Patrik_P
  • 3,066
  • 3
  • 22
  • 39
  • Yes, it also solve the problem, thank you very much. in my case (related the question post), `x.stypes` gives ***(stype.int32, ... , stype.bool8, stype.bool8, stype.bool8)*** , after that i use `x[:,:] = dt.int32` and yes the problem is solved. – ibra Nov 01 '20 at 14:50
  • **An important notice:** the problem of the type comes when ready values from csv file. when you hard code the values in source code, the way you deed it, it gives int values, not bool, you can try it whith `"c:["0","1","0"]"`. – ibra Nov 01 '20 at 14:58
  • `x[:, dt.stype.bool8] = dt.int32` converts only boolean columns to integer. – jung rhew Jul 07 '22 at 16:52