0

The problem is that I need to get Logstash to ingest a csv file that I have stored in a public GitHub repository (viewed as a raw file). The csv does not have a timestamp and I want Logstash to add one otherwise it does not connect to Grafana.

The following is my pipline code in Logstash

input {
  http_poller {
    urls => {
      csv_data => {
        method => get
        test1 => "https://raw.githubusercontent.com/shujah-TRN-Infosys/dag-test/main/dag_meta_data.csv"
            headers => {
                Accept => "text/csv"
            }
        }
    }
    request_timeout => 60
    schedule => { cron => "* * * * * UTC" } #EveryMinute
    codec => "plain"
  }
}

filter {
  csv {
    separator => ","
    columns => ["dag_id","batch", "sor", "consumer", "application", "depends_on_ingestion", "depends_on_curation", "dag_type"] # Specify column names here
  }
  ruby {
    code => "event.set('@timestamp', Logstash::Timestamp.now)"
    }
}

output {
  elasticsearch {
    hosts => [ "xxxxxxx" ]
    user => "xxxxx" 
    password => "xxxx" 
    index => "meta_csv_data-%{+YYYY.MM.dd}"
  }
  #stdout { codec => rubydebug }
}

I have tried the pipeline code (sensitive info redacted) and waited a minute to see if any indices popped up in the index management section of elastic cloud, there were none, so I am assuming it is not working. Any ideas on how to approach this or if there is something wrong with my pipeline code.

Paulo
  • 8,690
  • 5
  • 20
  • 34
  • Hi there, why don't you first check whether your data injection is working ? Just inject the *.csv data to elastic first, if there is no issue then you can look for another issue related to timestamp~ – Farkhod Abdukodirov Jun 23 '23 at 01:06

1 Answers1

0

Tldr;

I believe this is a simple mistake when reading the documentation. You have set a key named test where it should have been named url.

Indeed as per the documentation:

input {
  http_poller {
    urls => {
      test1 => "http://localhost:9200"
      test2 => {
        # Supports all options supported by ruby's Manticore HTTP client
        method => get
        user => "AzureDiamond"
        password => "hunter2"
        url => "http://localhost:9200/_cluster/health" <= this can not be custom
        headers => {
          Accept => "application/json"
        }
     }
    }
    request_timeout => 60
    # Supports "cron", "every", "at" and "in" schedules by rufus scheduler
    schedule => { cron => "* * * * * UTC"}
    codec => "json"
    # A hash of request metadata info (timing, response headers, etc.) will be sent here
    metadata_target => "http_poller_metadata"
  }
}

output {
  stdout {
    codec => rubydebug
  }
}

Also, since you are processing lines, you will want to have a message per \n. You may want to use the codec line.

Solution:

Your input should look like so

input {
  http_poller {
    urls => {
      csv_data => {
        method => get
        url => "https://raw.githubusercontent.com/shujah-TRN-Infosys/dag-test/main/dag_meta_data.csv"
      }
    }
    request_timeout => 60
    schedule => { cron => "* * * * * UTC" } #EveryMinute
    codec => "line"
  }
}

To reproduce

As your url is not public, I created a test of my own

input {
  http_poller {
    urls => {
      csv_data => {
        method => get
        url => "https://raw.githubusercontent.com/elastic/ecs/main/generated/csv/fields.csv"
      }
    }
    request_timeout => 60
    schedule => { cron => "* * * * * UTC" } #EveryMinute
    codec => "line"
  }
}

filter {
  csv {
    separator => ","
    columns => ["ECS_Version","Indexed","Field_Set","Field","Type","Level","Normalization","Example","Description"] # Specify column names here
  }
  ruby {
    code => "event.set('@timestamp', Logstash::Timestamp.now)"
    }
}

output {
  stdout { codec => rubydebug }
}
Paulo
  • 8,690
  • 5
  • 20
  • 34