Pandas: How to insert dataframe into Clickhouse

Question

I am trying to insert a Pandas dataframe into Clickhouse.

This is my code

import pandas
import sqlalchemy as sa

uri = 'clickhouse://default:@localhost/default'
ch_db = sa.create_engine(uri)

pdf = pandas.DataFrame.from_records([
    {'year': 1994, 'first_name': 'Vova'},
    {'year': 1995, 'first_name': 'Anja'},
    {'year': 1996, 'first_name': 'Vasja'},
    {'year': 1997, 'first_name': 'Petja'},
])

pdf.to_sql('test_humans', ch_db, if_exists='append', index=False)

And this is the error that I am receiving. Does this have to do with some missing extra arguments regarding the ENGINE? How can I fix this?

Exception: Code: 62, e.displayText() = DB::Exception: Syntax error: failed at position 65 (line 7, col 2): FORMAT TabSeparatedWithNamesAndTypes. Expected one of: ENGINE, storage definition (version 19.15.2.2 (official build))

Logging

INFO:sqlalchemy.engine.base.Engine:EXISTS TABLE test_humans INFO:sqlalchemy.engine.base.Engine:{} INFO:sqlalchemy.engine.base.Engine: CREATE TABLE test_humans ( first_name TEXT, year BIGINT )

INFO:sqlalchemy.engine.base.Engine:{} INFO:sqlalchemy.engine.base.Engine:ROLLBACK

Are you using [clickhouse](https://github.com/cloudflare/sqlalchemy-clickhouse)-dialect for sqlalchemy? Could you provide the schema description for *test_humans*-table? — vladimir, Oct 17 '19 at 00:31
Yes I am. When using to_sql(), pandas will usually create a schema automatically. I guess it has to do with exactly that. The create schema sql statement that the dialect is creating misses some proprietary command regarding "ENGINE". — Jabb, Oct 17 '19 at 05:23
See my update.. some log information about the issued queries. — Jabb, Oct 17 '19 at 05:25
which driver you use exacly https://github.com/mymarilyn/clickhouse-driver or https://github.com/cloudflare/sqlalchemy-clickhouse ? — Slach, Oct 18 '19 at 06:21

score 6 · Answer 1 · answered Mar 29 '21 at 11:51

6

Starting with version 0.2.0 clickhouse_driver implements method insert_dataframe. See: https://clickhouse-driver.readthedocs.io/en/latest/api.html#clickhouse_driver.Client.insert_dataframe

answered Mar 29 '21 at 11:51

Alex

533
4
12

1

This works very well. It is very easy, and is more efficient than using `client.execute("INSERT INTO your_table VALUES", df.to_dict('records'))` because it will transpose the DataFrame and send the data in columnar format. This doesn't do automatic table generation, but I wouldn't trust that anyway. Clickhouse has too many important decisions of schema design - especially partitioning - to leave this to autogen. – Jonathan Jul 28 '21 at 09:23
1

Somehow I needed use_numpy=True. `Client('localhost', settings={"use_numpy":True})` else – Kenji Noguchi Jul 26 '22 at 01:59

Danila Ganchar · Answer 2 · 2022-12-06T08:58:21.117

You can do it without sqlalchemy.

pip install clickhouse-driver

from clickhouse_driver import Client


client = Client('localhost')
df = pandas.DataFrame.from_records([
    {'year': 1994, 'first_name': 'Vova'},
    {'year': 1995, 'first_name': 'Anja'},
    {'year': 1996, 'first_name': 'Vasja'},
    {'year': 1997, 'first_name': 'Petja'},
])

client.insert_dataframe(
    'INSERT INTO "your_table" (year, first_name) VALUES',
    df,
    settings=dict(use_numpy=True),
)

# or 
client.execute("INSERT INTO your_table VALUES", df.to_dict('records'))

score 3 · Answer 3 · answered Jan 22 '20 at 18:14

You can also use https://github.com/kszucs/pandahouse to insert your dataframe without extra conversions.

pip install pandahouse

import pandahouse as ph

pdf = pandas.DataFrame.from_records([
    {'year': 1994, 'first_name': 'Vova'},
    {'year': 1995, 'first_name': 'Anja'},
    {'year': 1996, 'first_name': 'Vasja'},
    {'year': 1997, 'first_name': 'Petja'},
])

connection = dict(database='default',
                  host='localhost',
                  user='default',
                  password='')

ph.to_clickhouse(pdf, 'test_humans', index=False, chunksize=100000, connection=connection)

vladimir · Answer 4 · 2021-08-19T08:12:37.020

sqlalchemy-clickhouse cannot create table automatically (at least until version 0.1.5.post0 inclusively) and interprets any sql-query of table creation as SELECT-query that always tail by FORMAT TabSeparatedWithNamesAndTypes-clause.

To work around this issue need to create a table manually using the infi.clickhouse_orm-api (this module delivered with sqlalchemy-clickhouse):

import pandas as pd
from infi.clickhouse_orm.engines import Memory
from infi.clickhouse_orm.fields import UInt16Field, StringField
from infi.clickhouse_orm.models import Model
from sqlalchemy import create_engine


# define the ClickHouse table schema
class Test_Humans(Model):
    year = UInt16Field()
    first_name = StringField()
    engine = Memory()


engine = create_engine('clickhouse://default:@localhost/test')

# create table manually
with engine.connect() as conn:
    conn.connection.create_table(Test_Humans) # https://github.com/Infinidat/infi.clickhouse_orm/blob/master/src/infi/clickhouse_orm/database.py#L142

pdf = pd.DataFrame.from_records([
    {'year': 1994, 'first_name': 'Vova'},
    {'year': 1995, 'first_name': 'Anja'},
    {'year': 1996, 'first_name': 'Vasja'},
    {'year': 1997, 'first_name': 'Petja'},
    # ! sqlalchemy-clickhouse ignores the last item so add fake one
    {}
])

pdf.to_sql('test_humans', engine, if_exists='append', index=False)

Take into account that sqlalchemy-clickhouse ignores the last item so add fake one (see source code and related issue 10).

Let's check this table in DB:

SELECT *
FROM test.test_humans

┌─year─┬─first_name─┐
│ 1994 │ Vova       │
│ 1995 │ Anja       │
│ 1996 │ Vasja      │
│ 1997 │ Petja      │
└──────┴────────────┘

4 rows in set. Elapsed: 0.003 sec.
*/

See the answer https://stackoverflow.com/a/68843676/303298 too.

Pandas: How to insert dataframe into Clickhouse

4 Answers4

Linked