8

I'm quite unclear of what sql_last_value does when I give my statement as such:

statement => "SELECT * from mytable where id > :sql_last_value"

I can slightly understand the reason behind using it, where it doesn't browse through the whole db table in order to update fields instead it only updates the records which were added newly. Correct me if I'm wrong.

So what I'm trying to do is, creating the index using logstash as such:

input {
    jdbc {
        jdbc_connection_string => "jdbc:mysql://hostmachine:3306/db" 
        jdbc_user => "root"
        jdbc_password => "root"
        jdbc_validate_connection => true
        jdbc_driver_library => "/path/mysql_jar/mysql-connector-java-5.1.39-bin.jar"
        jdbc_driver_class => "com.mysql.jdbc.Driver"
        schedule => "* * * * *"
        statement => "SELECT * from mytable where id > :sql_last_value"
        use_column_value => true
        tracking_column => id
        jdbc_paging_enabled => "true"
        jdbc_page_size => "50000"
    }
}

output {
    elasticsearch {
        #protocol => http
        index => "myindex"
        document_type => "message_logs"
        document_id => "%{id}"
        action => index
        hosts => ["http://myhostmachine:9402"]
    }
}

Once I do this, the docs aren't getting uploaded at all to the index. Where am I going wrong?

Any help could be appreciated.

Kulasangar
  • 9,046
  • 5
  • 51
  • 82
  • 1
    Do you have a timestamp column in your table that gets updated on each record update? – Val Nov 01 '16 at 17:07
  • @Val nop I don't. Do I have to have one in order to update each record? – Kulasangar Nov 01 '16 at 17:09
  • It's easier to get the newly updated records that way. When you update a record the id won't change, and you might not get the update records. – Val Nov 01 '16 at 17:12
  • In your home folder, you might try to delete the `.logstash_jdbc_last_run` file and see if it works better. – Val Nov 01 '16 at 17:13
  • I'll try it out with a timestamp value and get back. You meant the home folder of `logstash`? – Kulasangar Nov 01 '16 at 17:14

3 Answers3

8

If you have a timestamp column in your table (e.g. last_updated), you should preferably use it instead of the ID one. So that when a record gets updated, you modify that timestamp as well and the jdbc input plugin will pick up the record (i.e. the ID column won't change its value and the updated record won't get picked up)

input {
    jdbc {
        jdbc_connection_string => "jdbc:mysql://hostmachine:3306/db" 
        jdbc_user => "root"
        jdbc_password => "root"
        jdbc_validate_connection => true
        jdbc_driver_library => "/path/mysql_jar/mysql-connector-java-5.1.39-bin.jar"
        jdbc_driver_class => "com.mysql.jdbc.Driver"
        jdbc_paging_enabled => "true"
        jdbc_page_size => "50000"
        schedule => "* * * * *"
        statement => "SELECT * from mytable where last_updated > :sql_last_value"
    }
}

If you decide to stay with the ID column nonetheless, you should delete the $HOME/.logstash_jdbc_last_run file and try again.

Val
  • 207,596
  • 13
  • 358
  • 360
  • I added a `varchar` timestamp column where I've inserted values manually for testing purposes as such (2016-09-01 00:00:00) to my table and tried creating the index, but still none of the records are getting uploaded to the index. I deleted `logstash_jdbc_last_run` as well. – Kulasangar Nov 01 '16 at 17:32
  • 1
    the timestamp column should not be a varchar one, but a timestamp one or date or date_time – Val Nov 01 '16 at 17:33
  • I recreated the scenario with a datetime type column. The value can be like this (ie: 2016-09-01 00:00:00) right? Still the docs aren't getting uploaded to the index. I could provide the logstash conf and the index docs count output. – Kulasangar Nov 01 '16 at 17:44
  • can you run logstash with the `--debug` command line switch ? – Val Nov 01 '16 at 17:45
  • Yup sure, you want me to show you the output in the console? Part of the output looks like this http://pastebin.com/5XnSx2Za – Kulasangar Nov 01 '16 at 17:54
  • 1
    As you can see the query that is ran is `SELECT count(*) AS count FROM (SELECT * from TEST where time > '2016-11-01 17:45:18') AS t1 LIMIT 1` so your timestamps need to be bigger than `2016-11-01 17:45:18` – Val Nov 01 '16 at 18:02
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/127123/discussion-between-kulasangar-and-val). – Kulasangar Nov 01 '16 at 18:20
  • @Val and what if I use `statement_filepath` and have my `query.sql` in separate file? Where do I use `> :sql_last_value`? In the `query.sql` or within `jdbc input`? – frank17 Sep 11 '21 at 20:46
  • @frank17 please ask a new question to highlight your case, it'll get much more attention than in a comment. Thanks – Val Sep 12 '21 at 16:07
4

In simple words, sql_last_value allows you to persist data from your last sql run as its name sugets.

This value is specially useful when you schedule your query. But why ... ? Because you can create your sql statement condition based on the value stored in sql_last_value and avoid to retrieve rows that were already ingested for your logstash input or updated after last pipeline execution.

Things to keep in mind when using sql_last_value

  • By default, this variable stores a timestamp of last run. Useful when you need to ingest data based in columns like creation_date last_update etc..
  • You can define the value of sql_last_value by tracking it with a specific table's column value. Useful when you need to ingest auto increment data based. For that, you need to specify use_column_value => true and tracking_column => "column_name_to_track".

The following example will store the last mytable row's id into :sql_last_value to ingest in the next execution the rows that were not ingested previously, it means the rows which its id is greater than the last ingested id.

input {
    jdbc {
        # ...
        schedule => "* * * * *"
        statement => "SELECT * from mytable where id > :sql_last_value"
        use_column_value => true
        tracking_column => id
    }
}

Extremely important !!!

When you use multiple inputs in your pipeline, each input block will overwrite the value of sql_last_value of the last one. For avoiding that behaviour, you can use last_run_metadata_path => "/path/to/sql_last_value/of_your_pipeline.yml" option, which means that each pipepline will stores its own value in a different file.

Community
  • 1
  • 1
Juan-Kabbali
  • 1,961
  • 15
  • 20
  • 1
    Yes that extremely important part gave me a headache but sorted this out thinking something like that was the reason – Muhammad Jan 23 '20 at 06:10
  • What about setting use_column_value to false. It should still allow you to incrementally update but instead store the last checkpoint in the metadata file.... or have I misunderstood? if you set true, everytime logstash will restart it will reset the value of sql_last_value – Vincent Teyssier Jan 10 '21 at 11:29
3

There are a few things to take care of:

  1. If you have run Logstash earlier without the schedule, then before running Logstash with schedule, delete the file:

    $HOME/.logstash_jdbc_last_run
    

    In Windows, this file is found at:

    C:\Users\<Username>\.logstash_jdbc_last_run
    
  2. The "statement =>" in Logstash config should have "order by" the tracking_column.

  3. tracking_column should be given correctly.

Here is an example of the Logstash config file:

    input {
jdbc {
    # MySQL DB jdbc connection string to our database, softwaredevelopercentral
    jdbc_connection_string => "jdbc:mysql://localhost:3306/softwaredevelopercentral?autoReconnect=true&useSSL=false"
    # The user we wish to execute our statement as
    jdbc_user => "root"
    # The user password
    jdbc_password => ""
    # The path to our downloaded jdbc driver
    jdbc_driver_library => "D:\Programs\MySQLJava\mysql-connector-java-6.0.6.jar"
    # The name of the driver class for MySQL DB
    jdbc_driver_class => "com.mysql.cj.jdbc.Driver"
    # our query
    schedule => "* * * * *"
    statement => "SELECT * FROM student WHERE studentid > :sql_last_value order by studentid"
    use_column_value => true
    tracking_column => "studentid"
}
}
output {
stdout { codec => json_lines }
elasticsearch { 
   hosts => ["localhost:9200"]
   index => "students"
   document_type => "student"
   document_id => "%{studentid}"
   }

}

To see a working example of the same you can check my blog post: http://softwaredevelopercentral.blogspot.com/2017/10/elasticsearch-logstash-kibana-tutorial.html