Database
Given the following PostgreSQL table test
(some columns omitted, e.g. data
which is used in the pipeline):
id (uuid) | updated_at (timestamp with time zone)
652d88d3-e978-48b1-bd0f-b8188054a920 | 2018-08-08 11:02:00.000000
50cf7942-cd18-4730-a65e-fc06f11cfd1d | 2018-08-07 15:30:00.000000
Logstash
Given Logstash 6.3.2 (via Docker) with the following pipeline (jdbc_*
omitted):
input {
jdbc {
statement => "SELECT id, data, updated_at FROM test WHERE updated_at > :sql_last_value"
schedule => "* * * * *"
use_column_value => true
tracking_column => "updated_at"
tracking_column_type => "timestamp"
}
}
filter {
mutate { remove_field => "updated_at" }
}
output {
elasticsearch {
hosts => "elasticsearch:9200"
index => "test"
document_id => "%{id}"
}
}
Problem
When this pipeline runs the very first time (or with clean_run => true
) I'd expect it to process both database rows (because sql_last_value
is 1970-01-01 00:00:00.000000
) and set the value of the tracking column stored in .logstash_jdbc_last_run
to 2018-08-08 11:02:00.000000000 Z
(= the latest of all found updated_at
timestamps). It'll be set to 2018-08-07 15:30:00.000000000 Z
though, which is the earlier of the two given timestamps. This means that in the 2nd run the other of the two rows will be processed again, even if it hasn't changed.
Is this the expected behaviour? Do I miss some other configuration which controls this aspect?
Edit
It seems that the updated_at
of the very last row returned will be used (just tried it with more rows). So I'd have to add an ORDER BY updated_at ASC
which I believe isn't that great in terms of DB query performance.
Logs, etc.
sh-4.2$ cat .logstash_jdbc_last_run
cat: .logstash_jdbc_last_run: No such file or directory
[2018-08-09T14:38:01,540][INFO ][logstash.inputs.jdbc ] (0.001254s) SELECT id, data, updated_at FROM test WHERE updated_at > '1970-01-01 00:00:00.000000+0000'
sh-4.2$ cat .logstash_jdbc_last_run
--- 2018-08-07 15:30:00.000000000 Z
[2018-08-09T14:39:00,335][INFO ][logstash.inputs.jdbc ] (0.001143s) SELECT id, data, updated_at FROM test WHERE updated_at > '2018-08-07 15:30:00.000000+0000'
sh-4.2$ cat .logstash_jdbc_last_run
--- 2018-08-08 11:02:00.000000000 Z
[2018-08-09T14:40:00,104][INFO ][logstash.inputs.jdbc ] (0.000734s) SELECT id, data, updated_at FROM test WHERE updated_at > '2018-08-08 11:02:00.000000+0000'
sh-4.2$ cat .logstash_jdbc_last_run
--- 2018-08-08 11:02:00.000000000 Z