2

I am trying to parse an XML file in Logstash. I want to use XPath to do the parsing of documents in XML. So when I run my config file the data loads into elasticsearch but It is not in the way I want to load the data. The data loaded in elasticsearch is each line in xml document

Structure of my XML file

enter image description here

What I want to achieve:

create fields in elasticsearch that stores the follwing

ID =1
Name = "Finch"

My Config file:

input{
    file{
        path => "C:\Users\186181152\Downloads\stations.xml"
        start_position => "beginning"
        sincedb_path => "/dev/null"
        exclude => "*.gz"
        type => "xml"
    }
}
filter{
    xml{
        source => "message"
        store_xml => false
        target => "stations"
        xpath => [
            "/stations/station/id/text()", "station_id",
            "/stations/station/name/text()", "station_name"
        ]
    }
}

output{
    elasticsearch{
        codec => json
        hosts => "localhost"
        index => "xmlns"
    }
    stdout{
        codec => rubydebug
    }
}

Output in Logstash:

{
    "station_name" => "%{station_name}",
    "path" => "C:\Users\186181152\Downloads\stations.xml",
    "@timestamp" => 2018-02-09T04:03:12.908Z,
    "station_id" => "%{station_id}",
    "@version" => "1",
    "host" => "BW",
    "message" => "\t\r",
    "type" => "xml"
}
A.A Noman
  • 5,244
  • 9
  • 24
  • 46
KARAN SHAH
  • 303
  • 1
  • 3
  • 16
  • I don't think `dev/null` is supported on Windows. – baudsp Feb 09 '18 at 17:02
  • Is the whole xml file on the same line, i.e. no line break? Because if it's not the case, the file will be treated line by line (as indicated in the doc), thus causing the empty `station_id` and `station_name`. – baudsp Feb 09 '18 at 17:06
  • @baudsp Dev/null works fine. I tried a csv file and it loaded the data correctly – KARAN SHAH Feb 09 '18 at 17:11
  • @baudsp the whole xml file is not on same line. the file follows standard xml file conventions. one tag on one line – KARAN SHAH Feb 09 '18 at 17:13
  • What I meant is that setting `sincedb_path => "/dev/null"` will not have the same behavior as on a linux system. The purpose of setting `sincedb_path => "/dev/null"` is that the sincedb file will not be written, so logstash will not remember what in each file has been read. But it won't prevent logstash from running. – baudsp Feb 09 '18 at 17:15
  • 1
    The file input read line by line, creating one message per line, explaining your result. You'll have to use the multiline codec on your input. See this https://stackoverflow.com/questions/34800559/how-to-parse-multi-line-xml-in-logstash/34896295#34896295 – baudsp Feb 09 '18 at 17:18
  • @baudsp thanks. I will try the multiline codec and let you know if that works our for me – KARAN SHAH Feb 09 '18 at 18:36
  • @baudsp. I tried the multiline codec – KARAN SHAH Feb 09 '18 at 19:23
  • @baudsp. I tried the multiline codec with following pattern ibelow type= xml and the it does not even create an index anymore. What should the sincedb path be in winodws operating system. codec => multiline { pattern => "" negate => "true" what => "previous" } – KARAN SHAH Feb 09 '18 at 19:37
  • From what I've read elsewhere, you can use `nul` as sincedb path to the same effect as unix `dev\null`. – baudsp Feb 12 '18 at 08:49
  • I think that, since logstash has already read the file, it won't do anything with it. You'll have to add lines to it or use another file. Or find the `since_db` file and delete it. Or use another sincedb path – baudsp Feb 12 '18 at 08:51
  • @baudsp the Multiline filter solves the problem and yes I am facing the problem of since_db. if there are some possible turn workaround let me know to fix since_db. Also, I would request you to answer this question so I can mark it completed – KARAN SHAH Feb 12 '18 at 13:57
  • The `sincedb_path => "nul"` works, I've just tested it, you can use this so that logstash don't remember what has been read. [You can answer your own question](https://stackoverflow.com/help/self-answer), I don't know if I'll time to answer this one. – baudsp Feb 12 '18 at 14:11
  • @baudsp Yes sincedb_path works then why does my logstash load data only when my sytem restarts or boots up. I mean I tried another configuration on the same file and it works flawlessly. Can you help me out with that – KARAN SHAH Feb 12 '18 at 14:27

1 Answers1

6

The multiline filter allows to create xml file as a single event and we can use xml-filter or xpath to parse the xml to ingest data in elasticsearch. In the multiline filter, we mention a pattern( in below example) that is used by logstash to scan your xml file. Once the pattern matches all the entries after that will be considered as a single event.

The following is an example of working config file for my data

input {
    file {
        path => "C:\Users\186181152\Downloads\stations3.xml"
        start_position => "beginning"
        sincedb_path => "/dev/null"
        exclude => "*.gz"
        type => "xml"
        codec => multiline {
            pattern => "<stations>" 
            negate => "true"
            what => "previous"
        }
    }
}

filter {
    xml {
        source => "message"
        store_xml => false
        target => "stations"
        xpath => [
            "/stations/station/id/text()", "station_id",
            "/stations/station/name/text()", "station_name"
        ]
    }
}

output {
    elasticsearch {
        codec => json
        hosts => "localhost"
        index => "xmlns24"
    }
    stdout {
        codec => rubydebug
    }
}   
Daniel Rotter
  • 1,998
  • 2
  • 16
  • 33
KARAN SHAH
  • 303
  • 1
  • 3
  • 16