5

I am trying to insert a Pandas dataframe into Clickhouse, but encountered some problems. Here is the table schema:

CREATE TABLE IF NOT EXISTS test_table
(
    name String,
    day DateTime64(3) DEFAULT '2020-07-01 00:00:00',
)
engine = MergeTree
ORDER BY (name, day);

And the data in pandas dataframe is like:

   name   day
0  'a'    NaT
1  'b'    NaT
2  'c'   '2019-08-31 00:00:00'

The insert python code is:

from clickhouse_driver import Client
with Client(host="", port="", password="",
            user="", settings={"use_numpy": True}) as client:
    client.insert_dataframe(
        'INSERT INTO test_table VALUES',
        df)

The result in clickhouse is

SELECT *
FROM test_table

┌─name─┬─day────────────────────┐
│ a    │ 1970-01-01 00:00:00.000│
│ b    │ 1970-01-01 00:00:00.000│
│ c    │ 2019-08-31 00:00:00.000│
└──────┴────────────────────────

But what I really want is its would be the default value, means '1970-01-01 00:00:00.000' would be replaced by '2020-07-01 00:00:00.000' .

I did some try and investigation, here are what I did:

  1. changed NaT and replaced it by None or Numpy.NaN
df.replace({pd.NaT: None}, inplace=True)
or
df1.replace({pd.NaT: np.NaN}, inplace=True)

But the results of these changes are still same

  1. in clickhouse-client, when used insert into, the schema could work, and the result are what I want. Like these:
insert into test_table (name,day) values ('test-null',null);
or
insert into test_table (name) values ('test-sub');


┌─name──────────┬─day────────────────────┐
│ test-null     │ 2020-07-01 00:00:00.000│
│ test-sub      │ 2020-07-01 00:00:00.000│
└───────────────┴─────────────────────────

  1. in clickhouse-client, when I used insert into with empty string, the result would be the same as the result that I used pandas dataframe
insert into test_table (name,day) values ('test-empty','');

SELECT *
FROM test_table

┌─name──────────┬─day────────────────────┐
│ test-empty    │ 1970-01-01 00:00:00.000│
└───────────────┴────────────────────────

So, what i did now is just to split the dataframe into two parts, and insert twice, (which is not pythonic and efficient I think), but it could really work

# select not null rows
mask1 = ~np.isnan(df.day.values)

# select null rows
mask3 = np.isnan(df.day.values)

with Client(host="", port="", password=,
            user="", settings={"use_numpy": True}) as client:
    # insert entire pandas dataframe
    client.insert_dataframe(
        'INSERT INTO test_table VALUES',
       df.loc[mask1])
    client.insert_dataframe(
     'INSERT INTO test_table (* EXCEPT(day)) VALUES',
        df.loc[mask3].drop(['day'], axis=1))

In conclusion, there are two things I want to ask:

  1. Is there a better way to achieve my goal: When the cell in pandas dataframe is NaT/NaN/None, after inserted into clickhouse it would be the default value of column. Without setting the value through pandas.
  2. Is there a bug for clickhouse DataTime datatype? When an empty string inserted into column, it would ignore the default value and use clickhouse's own default value.

In my opinion, the second one maybe the key to solve this problem, since when i use client of clickhouse-driver, maybe it transalte NaT/NaN/None to an empty string.

Edited: For question 2, I found that in clickhouse, DateTime column would treat empty string as 0(zero) or '0'(zero in string), and that could explain that why the value of day is 1970-01-01 00:00:00.000.

So, the question would be like: Why does DateTime would treat the value like this? And, I guess that the clickhouse-driver client would treat None/NaT/NaN as empty string and pass the empty string to clickhouse. Can the driver treat None/NaT/NaN as null(though, there is only NoneType in python) or just drop the cell (like, pass each rows, but i read the code of clickhouse-driver found it's pass each column to get the entire values).

Zhefu PENG
  • 78
  • 5

1 Answers1

0

There's probably a driver-level answer, but you could definitely insert into a Materialized view with either coalesce or ifNull and put your default value there.

danthegoodman
  • 501
  • 4
  • 10