0

I trying use symmetricds to configure mysql to postgres synchronisation with transformations. I have very low inserting performance on initial load with 100% CPU load by postgres. When I look to postgres log I found that it used INSERT. It's ok for normal working, but not for initialization, because I have millions records. I founded PostgresBulkDatabaseWriter in source code which use COPY instead INSERT and it look like good solution (COPY sql request works pretty good for me), but I do not found how I can use it.

So my questions:

How better make initial load with symmetricds for millions records?

How can I enable PostgresBulkDatabaseWriter for initial (reverse initial) load?

Thanks

UPD:

Source tables mysql:

CREATE TABLE `companies` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `cid` int(11) NOT NULL,
  `universalName` text NOT NULL,
  `name` text NOT NULL,
  `country` text NOT NULL,
  `city` text NOT NULL,
  `street` text NOT NULL,
  `phone` text NOT NULL,
  `foundedYear` text NOT NULL,
  `employeeCountRange` text NOT NULL,
  `specialties` text NOT NULL,
  `websiteUrl` text NOT NULL,
  `twitterId` text NOT NULL,
  `check` tinyint(4) NOT NULL DEFAULT '0',
  `date` datetime NOT NULL,
  PRIMARY KEY (`id`)
);

CREATE TABLE `search_results` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `cid` int(11) NOT NULL,
  `title` text NOT NULL,
  `description` text NOT NULL,
  `link` text NOT NULL,
  `raw` text NOT NULL,
  `date` datetime NOT NULL,
  PRIMARY KEY (`id`)
);

Core tables postgres:

CREATE TABLE res_country (
    id integer NOT NULL,
    create_uid integer,
    create_date timestamp without time zone,
    write_date timestamp without time zone,
    write_uid integer,
    address_format text,
    currency_id integer,
    code character varying(2),
    name character varying(64) NOT NULL
);

INSERT INTO res_country VALUES (1, 1, '2013-11-16 06:53:31.030363', '2013-11-16 06:53:31.030363', 1, '%(street)s
%(street2)s
%(city)s %(state_code)s %(zip)s
%(country_name)s', 1, 'AD', 'Andorra, Principality of');

CREATE TABLE res_partner (
    id integer NOT NULL,
    name character varying(128) NOT NULL,
    lang character varying(64),
    company_id integer,
    create_uid integer,
    create_date timestamp without time zone,
    write_date timestamp without time zone,
    write_uid integer,
    comment text,
    ean13 character varying(13),
    color integer,
    image bytea,
    use_parent_address boolean,
    active boolean,
    street character varying(128),
    supplier boolean,
    city character varying(128),
    user_id integer,
    zip character varying(24),
    title integer,
    function character varying(128),
    country_id integer,
    parent_id integer,
    employee boolean,
    type character varying,
    email character varying(240),
    vat character varying(32),
    website character varying(64),
    fax character varying(64),
    street2 character varying(128),
    phone character varying(64),
    credit_limit double precision,
    date date,
    tz character varying(64),
    customer boolean,
    image_medium bytea,
    mobile character varying(64),
    ref character varying(64),
    image_small bytea,
    birthdate character varying(64),
    is_company boolean,
    state_id integer,
    notification_email_send character varying NOT NULL,
    opt_out boolean,
    signup_type character varying,
    signup_expiration timestamp without time zone,
    signup_token character varying,
    last_reconciliation_date timestamp without time zone,
    debit_limit double precision,
    display_name character varying,
    vat_subjected boolean,
    section_id integer
);


CREATE TABLE my_res_partner_companies (
  id INT8 NOT NULL PRIMARY KEY,
  cid INT8 NOT NULL,
  universalName VARCHAR NOT NULL,
  employeeCountRange VARCHAR NOT NULL,
  specialties VARCHAR NOT NULL,
  twitterId VARCHAR NOT NULL,
  "check" INT4 NOT NULL DEFAULT '0',
  date TIMESTAMP NOT NULL
);

CREATE TABLE my_res_partner_search_result (
  id INT8 NOT NULL PRIMARY KEY,
  link VARCHAR NOT NULL,
  raw VARCHAR NOT NULL,
  date TIMESTAMP NOT NULL
);

Source properties:

engine.name=source-001

# The class name for the JDBC Driver
db.driver=com.mysql.jdbc.Driver

# The JDBC URL used to connect to the database
db.url=jdbc:mysql://localhost:3307/data?tinyInt1isBit=false

# The user to login as who can create and update tables
db.user=root

# The password for the user to login as
db.password=

# The HTTP URL of the root node to contact for registration
registration.url=http://localhost:8080/sync/core-000
#auto.reload.reverse=true

# Do not change these for running the demo
group.id=source
external.id=001

# This is how often the routing job will be run in milliseconds
job.routing.period.time.ms=5000
# This is how often the push job will be run.
job.push.period.time.ms=10000
# This is how often the pull job will be run.
job.pull.period.time.ms=10000

Core properties:

engine.name=core-000

# The class name for the JDBC Driver
db.driver=org.postgresql.Driver

# The JDBC URL used to connect to the database
db.url=jdbc:postgresql://localhost:5432/data2?stringtype=unspecified

# The user to login as who can create and update tables
db.user=admin

# The password for the user to login as
db.password=admin

registration.url=
sync.url=http://localhost:8080/sync/core-000
auto.reload.reverse=true
datareload.batch.insert.transactional=true

# Do not change these for running the demo
group.id=core
external.id=000

# Don't muddy the waters with purge logging
job.purge.period.time.ms=7200000

# This is how often the routing job will be run in milliseconds
job.routing.period.time.ms=5000
# This is how often the push job will be run.
job.push.period.time.ms=10000
# This is how often the pull job will be run.
job.pull.period.time.ms=10000

Main symmetric configuration:

-- Nodes
insert into sym_node_group (node_group_id, description)
values ('core', 'Core Storage');
insert into sym_node_group (node_group_id, description)
values ('source', 'Source Storage');

insert into sym_node_group_link (source_node_group_id, target_node_group_id, data_event_action)
values ('source', 'core', 'P');
insert into sym_node_group_link (source_node_group_id, target_node_group_id, data_event_action)
values ('core', 'source', 'W');

insert into sym_node (node_id, node_group_id, external_id, sync_enabled)
values ('000', 'core', '000', 1);
insert into sym_node_security (node_id,node_password,registration_enabled,registration_time,initial_load_enabled,initial_load_time,initial_load_id,initial_load_create_by,rev_initial_load_enabled,rev_initial_load_time,rev_initial_load_id,rev_initial_load_create_by,created_at_node_id)
values ('000','changeme',1,current_timestamp,1,current_timestamp,null,null,0,null,null,null,'000');
insert into sym_node_identity values ('000');

-- Channels
insert into sym_channel
(channel_id, processing_order, max_batch_size, enabled, description)
values('source__acc', 1, 100000, 1, 'accounting synchronisation');

-- Triggers
insert into sym_trigger
(trigger_id,source_table_name,channel_id,last_update_time,create_time)
values('source__companies','companies','source__acc',current_timestamp,current_timestamp);

insert into sym_trigger
(trigger_id,source_table_name,channel_id,last_update_time,create_time)
values('source__search_results','search_results','source__acc',current_timestamp,current_timestamp);

-- Routers
insert into sym_router
(router_id,source_node_group_id,target_node_group_id,router_type,create_time,last_update_time)
values('source_2_core', 'source', 'core', 'default',current_timestamp, current_timestamp);

-- Trigger Router Links
insert into sym_trigger_router
(trigger_id,router_id,initial_load_order, INITIAL_LOAD_BATCH_COUNT,last_update_time,create_time)
values('source__companies','source_2_core', 100, 0, current_timestamp, current_timestamp);

insert into sym_trigger_router
(trigger_id,router_id,initial_load_order, INITIAL_LOAD_BATCH_COUNT,last_update_time,create_time)
values('source__search_results','source_2_core', 200, 0, current_timestamp, current_timestamp);

Main transformations:

-- Transform
insert into SYM_TRANSFORM_TABLE
  (transform_id, source_node_group_id, target_node_group_id, transform_point, source_table_name, target_table_name, delete_action, column_policy)
values
  ('source__companies__main', 'source', 'core', 'LOAD', 'companies', 'res_partner', 'DEL_ROW', 'SPECIFIED');
--  ('source__companies__main', 'source', 'core', 'EXTRACT', 'companies', 'res_partner', 'DEL_ROW', 'SPECIFIED');

insert into SYM_TRANSFORM_COLUMN
  (transform_id, include_on, source_column_name, target_column_name, pk, transform_type, TRANSFORM_EXPRESSION)
values
  ('source__companies__main', '*', 'id', 'id', 1, 'bsh', 'return Integer.parseInt(currentValue) + 1000000;'),
  ('source__companies__main', '*', 'name', 'name', 0, 'copy', NULL),
  ('source__companies__main', '*', 'country', 'country_id', 0, 'const', '1'),
  ('source__companies__main', '*', 'city', 'city', 0, 'copy', NULL),
  ('source__companies__main', '*', 'street', 'street', 0, 'copy', NULL),
  ('source__companies__main', '*', 'phone', 'phone', 0, 'copy', NULL),
  ('source__companies__main', '*', 'websiteUrl', 'website', 0, 'copy', NULL),
  ('source__companies__main', '*', NULL, 'notification_email_send', 0, 'const', '0'),
  ('source__companies__main', '*', NULL, 'is_company', 0, 'const', '1');

insert into SYM_TRANSFORM_TABLE
  (transform_id, source_node_group_id, target_node_group_id, transform_point, source_table_name, target_table_name, delete_action, column_policy)
values
  ('source__companies__residue', 'source', 'core', 'LOAD', 'companies', 'my_res_partner_companies', 'DEL_ROW', 'SPECIFIED');
  -- ('source__companies__residue', 'source', 'core', 'EXTRACT', 'companies', 'my_res_partner_companies', 'DEL_ROW', 'SPECIFIED');

insert into SYM_TRANSFORM_COLUMN
  (transform_id, include_on, source_column_name, target_column_name, pk, transform_type, TRANSFORM_EXPRESSION)
values
  ('source__companies__residue', '*', 'id', 'id', 1, 'bsh', 'return Integer.parseInt(currentValue) + 1000000;'),
  ('source__companies__residue', '*', 'cid', 'cid', 0, 'copy', NULL),
  ('source__companies__residue', '*', 'universalName', 'universalName', 0, 'copy', NULL),
  ('source__companies__residue', '*', 'employeeCountRange', 'employeeCountRange', 0, 'copy', NULL),
  ('source__companies__residue', '*', 'specialties', 'specialties', 0, 'copy', NULL),
  ('source__companies__residue', '*', 'twitterId', 'twitterId', 0, 'copy', NULL),
  ('source__companies__residue', '*', 'check', 'check', 0, 'copy', NULL),
  ('source__companies__residue', '*', 'date', 'date', 0, 'copy', NULL);


insert into SYM_TRANSFORM_TABLE
  (transform_id, source_node_group_id, target_node_group_id, transform_point,
   source_table_name, target_table_name, delete_action, column_policy)
values
  ('source__search_results__main', 'source', 'core', 'LOAD', 'search_results', 'res_partner', 'DEL_ROW', 'SPECIFIED');
  -- ('source__search_results__main', 'source', 'core', 'EXTRACT', 'search_results', 'res_partner', 'DEL_ROW', 'SPECIFIED');

insert into SYM_TRANSFORM_COLUMN
  (transform_id, include_on, source_column_name, target_column_name, pk, transform_type, TRANSFORM_EXPRESSION)
values
  ('source__search_results__main', '*', 'id', 'id', 1, 'bsh', 'return Integer.parseInt(currentValue) + 2000000;'),
  ('source__search_results__main', '*', 'cid', 'parent_id', 0, 'bsh', 'return Integer.parseInt(currentValue) + 1000000;'),
  ('source__search_results__main', '*', 'title', 'name', 0, 'copy', NULL),
  ('source__search_results__main', '*', 'description', 'comment', 0, 'copy', NULL),
  ('source__search_results__main', '*', NULL, 'use_parent_address', 0, 'const', '1'),
  ('source__search_results__main', '*', NULL, 'notification_email_send', 0, 'const', '0'),
  ('source__search_results__main', '*', NULL, 'is_company', 0, 'const', '0');

insert into SYM_TRANSFORM_TABLE
  (transform_id, source_node_group_id, target_node_group_id, transform_point,
   source_table_name, target_table_name, delete_action, column_policy)
values
  ('source__search_results__residue', 'source', 'core', 'LOAD', 'search_results', 'my_res_partner_search_result', 'DEL_ROW', 'SPECIFIED');
  -- ('source__search_results__residue', 'source', 'core', 'EXTRACT', 'search_results', 'my_res_partner_search_result', 'DEL_ROW', 'SPECIFIED');

insert into SYM_TRANSFORM_COLUMN
  (transform_id, include_on, source_column_name, target_column_name, pk, transform_type, TRANSFORM_EXPRESSION)
values
  ('source__search_results__residue', '*', 'id', 'id', 1, 'bsh', 'return Integer.parseInt(currentValue) + 2000000;'),
  ('source__search_results__residue', '*', 'link', 'link', 0, 'copy', NULL),
  ('source__search_results__residue', '*', 'raw', 'raw', 0, 'copy', NULL),
  ('source__search_results__residue', '*', 'date', 'date', 0, 'copy', NULL);

Simplified transformations:

-- Transform
insert into SYM_TRANSFORM_TABLE
  (transform_id, source_node_group_id, target_node_group_id, transform_point, source_table_name, target_table_name, delete_action, column_policy)
values
  ('source__companies__main', 'source', 'core', 'LOAD', 'companies', 'res_partner', 'DEL_ROW', 'SPECIFIED');
--  ('source__companies__main', 'source', 'core', 'EXTRACT', 'companies', 'res_partner', 'DEL_ROW', 'SPECIFIED');

insert into SYM_TRANSFORM_COLUMN
  (transform_id, include_on, source_column_name, target_column_name, pk, transform_type, TRANSFORM_EXPRESSION)
values
  ('source__companies__main', '*', 'id', 'id', 1, 'bsh', 'return Integer.parseInt(currentValue) + 1000000;'),
  ('source__companies__main', '*', 'name', 'name', 0, 'copy', NULL),
  ('source__companies__main', '*', 'country', 'country_id', 0, 'const', '1'),
  ('source__companies__main', '*', 'city', 'city', 0, 'copy', NULL),
  ('source__companies__main', '*', 'street', 'street', 0, 'copy', NULL),
  ('source__companies__main', '*', 'phone', 'phone', 0, 'copy', NULL),
  ('source__companies__main', '*', 'websiteUrl', 'website', 0, 'copy', NULL),
  ('source__companies__main', '*', NULL, 'notification_email_send', 0, 'const', '0'),
  ('source__companies__main', '*', NULL, 'is_company', 0, 'const', '1');


insert into SYM_TRANSFORM_TABLE
  (transform_id, source_node_group_id, target_node_group_id, transform_point,
   source_table_name, target_table_name, delete_action, column_policy)
values
  ('source__search_results__main', 'source', 'core', 'LOAD', 'search_results', 'res_partner', 'DEL_ROW', 'SPECIFIED');
  -- ('source__search_results__main', 'source', 'core', 'EXTRACT', 'search_results', 'res_partner', 'DEL_ROW', 'SPECIFIED');

insert into SYM_TRANSFORM_COLUMN
  (transform_id, include_on, source_column_name, target_column_name, pk, transform_type, TRANSFORM_EXPRESSION)
values
  ('source__search_results__main', '*', 'id', 'id', 1, 'bsh', 'return Integer.parseInt(currentValue) + 2000000;'),
  ('source__search_results__main', '*', 'cid', 'parent_id', 0, 'bsh', 'return Integer.parseInt(currentValue) + 1000000;'),
  ('source__search_results__main', '*', 'title', 'name', 0, 'copy', NULL),
  ('source__search_results__main', '*', 'description', 'comment', 0, 'copy', NULL),
  ('source__search_results__main', '*', NULL, 'use_parent_address', 0, 'const', '1'),
  ('source__search_results__main', '*', NULL, 'notification_email_send', 0, 'const', '0'),
  ('source__search_results__main', '*', NULL, 'is_company', 0, 'const', '0');

Core setup:

update sym_channel set DATA_LOADER_TYPE = 'postgres_bulk' where channel_id = 'reload';

Look like symmetric insert with COPY one by one records with main transformations (LOAD and EXTRACT) and simplified transformations (LOAD and EXTRACT).

tbicr
  • 24,790
  • 12
  • 81
  • 106
  • 1
    From [their website](http://www.symmetricds.org/download) : "SymmetricDS Pro is enterprise-class replication built on SymmetricDS open source and other proven OSS components. Enhance your experience and productivity with a web interface that simplifies configuration, monitoring, and troubleshooting. The core is extended with special features including fast, ***bulk-loading data loaders*** and support for syncing with Android devices.". So, it looks like the feature you're after is not available in the OSS version... – fvu Jan 14 '14 at 12:38
  • Just now founded http://www.symmetricds.org/doc/3.5/html-single/user-guide.html#ap02-postgresql, probably bulk load exists in standard version. – tbicr Jan 14 '14 at 12:59
  • @fvu Oh, I love half-open baitware. Really. – Craig Ringer Jan 15 '14 at 02:51
  • OSS SymmetricDS includes support for bulk data loaders. I'm a developer. – Austin Brougher Jan 15 '14 at 13:33

2 Answers2

6

The PostgresBulkDataLoaderFactory is your answer.

If you just want to use the bulk writer for initial loads and reloads, I suggest you configure the reload channel only to use the bulk writer.

On your channel table (sym_channel by default) update the reload channel data_loader_type column to 'postgres_bulk'.

The users guide briefly explains how to implement DatabaseWriters.

Austin Brougher
  • 526
  • 2
  • 9
  • Really `update sym_channel set DATA_LOADER_TYPE = 'postgres_bulk' where channel_id = 'reload';` on postgres node enable it. But look like it make `COPY` for each records separately instead bulk. Probably problems with my transformations. My CPU load: 35% by postgres, 75% by symmetric. – tbicr Jan 16 '14 at 13:22
  • Postgres bulk work incredibly fast without transformations. With transformations look like it insert records one by one with `COPY` for `EXTRACT` and `LOAD` transform_point. Is any way check my theory? If it's try, can I set up transformations to bulk inserting? – tbicr Jan 16 '14 at 14:44
  • Can you share what type of transforms you are running? Are you querying the DB? – Austin Brougher Jan 16 '14 at 17:50
  • Load transforms do not run with the bulk loader. Your transforms will have to run on the extract side. Some transforms can take longer to run than others. If you share your transform I can help suggest optimizations. – Austin Brougher Jan 16 '14 at 18:05
  • Updated post with example. – tbicr Jan 17 '14 at 09:31
  • The BSH interpreter can be slow when you hare transforming lots of data. Your script is short but its expensive to run the interpreter over and over. The way around this would be to create a custom transform. If you are familiar with Java it shouldn't be difficult. Before you try a custom transform I'd try changing your BSH transforms into a 'copy' and run your initial load again to verify that is your bottle neck. http://www.symmetricds.org/doc/3.5/html-single/user-guide.html#extensions-column-transforms – Austin Brougher Jan 17 '14 at 18:58
  • When I change `bsh` to `copy` I steed up x2 transformation time. But however look like inserting records with `COPY` still one by one. As I understood `symmetric` doesn't (or `bsh` can't) precompile and cache `bsh` scripts? Also can `symmetric` cache lookup transformation request results? Can you describe how I can connect my own transformation (`IColumnTransform` implementation) to `symmetric`? – tbicr Jan 18 '14 at 07:12
  • BSH doesn't support precompiling. We are planning on supporting other languages in the future including the ability to compile java code on the fly. For now we suggest creating a custom extension point. Create an object that implements ISingleValueColumnTransform. You will need to compile this and drop it in the lib folder. When SymmetricDS launches it will search for anything implementing an extension point and load it. You should see a message indicating that your extension point has loaded. You can then reference your new extension by the name you assign it. Look at the CopyColumnTransform. – Austin Brougher Jan 19 '14 at 18:12
  • I checked on the copy running on individual records. I looked at the code and didn't test it but it looks like the default is to queue up 10000 records before writing to the DB. You can change this setting by updating your engines .property file in the ./engines folder. Add the following line to increase the number of records to 100,000. "postgres.bulk.load.max.rows.before.flush=100000" – Austin Brougher Jan 20 '14 at 14:55
  • I added own column transformation, with this bean example: http://www.symmetricds.org/doc/3.5/html-single/user-guide.html#extensions-data-loader-filter. `postgres.bulk.load.max.rows.before.flush=100000` doesn't speed up my initial load. – tbicr Jan 20 '14 at 17:10
  • How many records are you loading and how fast is it running? What are your requirements/expectations? – Austin Brougher Jan 20 '14 at 17:31
  • two tables with test data: 10000 for companies and 930000 for search_results records, take about 12 minutes. Without transformations it take about 2 minutes and have pretty postgres statement log. – tbicr Jan 20 '14 at 17:36
  • Was about 24 minutes with `bsh`. Now it's normal speed, but look like inserting steel not optimal. Now I want just clarify did transformation change inserting batches for bulk insert? – tbicr Jan 20 '14 at 17:45
  • A custom transform on the extract side will not impact the bulk insert on the load side. – Austin Brougher Jan 20 '14 at 18:41
  • I created a feature request for a mathematical expression transform. I played around with this today and saw it to be about 10-20 times faster than BSH on simple expressions. You can track the feature request here. [Issue 1542](http://www.symmetricds.org/issues/view.php?id=1542) – Austin Brougher Jan 22 '14 at 18:22
  • Thanks, but I more concerned why synchronisation with transformations 6 time slower than direct synchronisation for my example. – tbicr Jan 22 '14 at 21:50
0

The key to good insert performance of large data sets in postgresql is to do it all in as few transactions as possible. By default postgresql uses auto commit on each insert, so each insert it itself a transaction. Simply wrapping all the inserts with begin; commit; can give a much faster throughput.

The reasons for copy being faster are therefore twofold. 1: one big transaction, and 2: no parsing of individual insert statements. Parsing overhead is minimal, but transactional overhead, relatively speaking, is huge for single statements.

Is there a way to make it do this:

begin;
insert ...;
insert ...;
insert ...;
commit;
Scott Marlowe
  • 8,490
  • 3
  • 23
  • 21
  • `SymmetricDS` use transactions (by default 5000 records per transaction) as I understood. However `COPY` will work faster even `INSERT` in transaction. – tbicr Jan 16 '14 at 13:17