How to convert correctly a datatable of integers (from Python datatable library) to pandas Dataframe

Question

I am using Python datatable (https://github.com/h2oai/datatable) to read a csv file that contain only integers values. After that I convert the datatable to pandas Dataframe. At the conversion, the columns that contain only 0/1 are considered as boolean instead of integers.

let the following csv file (small_csv_file_test.csv):

a1,a2,a3,a4,a5,a6,a7,a8,a9,a10
 1, 1, 1, 1, 1, 1, 1, 0, 1, 1
 2, 2, 2, 2, 2, 2, 2, 1, 0, 1
 3, 3, 3, 3, 3, 3, 3, 0, 0, 1
 4, 4, 4, 4, 4, 4, 4, 1, 0, 0
 5, 5, 5, 5, 5, 5, 5, 0, 0, 0
 6, 6, 6, 6, 6, 6, 6, 0, 0, 0
 7, 7, 7, 7, 7, 7, 7, 1, 1, 0
 8, 8, 8, 8, 8, 8, 8, 1, 1, 1
 9, 9, 9, 9, 9, 9, 9, 1, 1, 1
 0, 0, 0, 0, 0, 0, 0, 1, 0, 1

The source code :

import pandas as pd
import datatable as dt

test_csv_matrix = "small_csv_file_test.csv"

data = dt.fread(test_csv_matrix)
print(data.head(5))

matrix= data.to_pandas()
print(matrix.head())

Result:

   | a1  a2  a3  a4  a5  a6  a7  a8  a9  a10  
-- + --  --  --  --  --  --  --  --  --  ---  
 0 |  1   1   1   1   1   1   1   0   1    1  
 1 |  2   2   2   2   2   2   2   1   0    1  
 2 |  3   3   3   3   3   3   3   0   0    1  
 3 |  4   4   4   4   4   4   4   1   0    0  
 4 |  5   5   5   5   5   5   5   0   0    0

[5 rows x 10 columns]

   a1  a2  a3  a4  a5  a6  a7     a8     a9    a10  
0   1   1   1   1   1   1   1  False   True   True  
1   2   2   2   2   2   2   2   True  False   True  
2   3   3   3   3   3   3   3  False  False   True  
3   4   4   4   4   4   4   4   True  False  False  
4   5   5   5   5   5   5   5  False  False  False

Edit 1: The columns a8, a9 and a10 are not correct, I want them as integer values not boolean.

Thank you for your help.

i want them as int value not boolean. The first columns are correct, but the ons that contain only 1 and 0 are converted as boolean. — ibra, Jul 20 '20 at 13:21
so your a1 to a7 which contains numbers so you don't convert them and a8 to a10 contains only 0 and 1 that's why you converted these columns to boolean. Correct me If I am wrong — Tarequzzaman Khan, Jul 20 '20 at 13:25
The whole matrix contain only integer values, from a1 to a10. I don't convert any specific column. the method "to_pandas()" that convert the matrix from datatable to panda Dataframe seem don't convert correctly the columns that contain only 1 and 0. it consider them as boolean. So i don't know if there is some specific parameter to the method "to_pandas()" to tell that I want only integer values not boolean. — ibra, Jul 20 '20 at 13:31

score 3 · Accepted Answer · answered Jul 20 '20 at 13:28

3

You can just coerce every column to int64:

matrix = data.to_pandas().astype('int64')

answered Jul 20 '20 at 13:28

Serge Ballesta

143,923
11
122
252

Thank you very much. i use .astype('int32') and it's work :) – ibra Jul 20 '20 at 13:45

score 1 · Answer 2 · answered Jul 20 '20 at 13:32

1

You can always push in the data type

df = pd.DataFrame({"a1":[1,2,3,4,5,6,7,8,9,0],"a2":[1,2,3,4,5,6,7,8,9,0],"a3":[1,2,3,4,5,6,7,8,9,0],"a4":[1,2,3,4,5,6,7,8,9,0],"a5":[1,2,3,4,5,6,7,8,9,0],"a6":[1,2,3,4,5,6,7,8,9,0],"a7":[1,2,3,4,5,6,7,8,9,0],"a8":[0,1,0,1,0,0,1,1,1,1],"a9":[1,0,0,0,0,0,1,1,1,0],"a10":[1,1,1,0,0,0,0,1,1,1]})
df = df.astype({c:"int64" for c in df.columns})
df.dtypes

answered Jul 20 '20 at 13:32

Rob Raymond

29,118
3
14
30

As i said, i read the matrix from a csv file (very big csv file), for that i use datable, after that i convert to panda Dataframe. So I can't do it manually. But the idea of specify the type work. thank you for your help :) – ibra Jul 20 '20 at 13:49
I just took a look at https://datatable.readthedocs.io/en/latest/api/dt/fread.html#datatable.fread it has something missing that I would expect. Defining column types at read. My bet this is `datatable` defines the type then passes it to pandas with `to_pandas()` – Rob Raymond Jul 20 '20 at 14:01
the datable fread detect automatically the type, and in my code example it work correct. my problem was when converting to pandas dataframe, and the solution as suggested is using "matrix= data.to_pandas().astype('int32')" . and if we think about it, 0 and 1 can be seen as false and true, so when we specify the type the confusion is gone. – ibra Jul 20 '20 at 14:07

score 1 · Answer 3 · answered Jul 20 '20 at 13:40

1

Add this code with your snippet.

matrix = matrix.iloc[:].astype(int)
matrix

Output:

   a1   a2  a3  a4  a5  a6  a7  a8  a9  a10
0   1   1   1   1   1   1   1   0   1   1
1   2   2   2   2   2   2   2   1   0   1
2   3   3   3   3   3   3   3   0   0   1
3   4   4   4   4   4   4   4   1   0   0
4   5   5   5   5   5   5   5   0   0   0
5   6   6   6   6   6   6   6   0   0   0

answered Jul 20 '20 at 13:40

Tarequzzaman Khan

484
2
16

Thank you very much for your help and comments :) , yes your solution work. – ibra Jul 20 '20 at 13:46

score 1 · Answer 4 · answered Oct 30 '20 at 16:10

1

You could do:

import datatable as dt
x = dt.Frame({"a": ["1", "2", "3"], "b":["20", "30", "40"]})
x.stypes
#(stype.str32, stype.str32)
x[:,:] = dt.int64
x.stypes
#(stype.int64, stype.int64)

answered Oct 30 '20 at 16:10

Patrik_P

3,066
3
22
39

Yes, it also solve the problem, thank you very much. in my case (related the question post), `x.stypes` gives ***(stype.int32, ... , stype.bool8, stype.bool8, stype.bool8)*** , after that i use `x[:,:] = dt.int32` and yes the problem is solved. – ibra Nov 01 '20 at 14:50
**An important notice:** the problem of the type comes when ready values from csv file. when you hard code the values in source code, the way you deed it, it gives int values, not bool, you can try it whith `"c:["0","1","0"]"`. – ibra Nov 01 '20 at 14:58
`x[:, dt.stype.bool8] = dt.int32` converts only boolean columns to integer. – jung rhew Jul 07 '22 at 16:52

How to convert correctly a datatable of integers (from Python datatable library) to pandas Dataframe

4 Answers4