0

Brand new to Elasticsearch. I've been doing tons of reading, but I am hoping that the experts on SO might be able to weigh in on my cluster configuration to see if there is something that I am missing.

Currently I am using ES (1.7.3) to index some very large text files (~700 million lines) per file and looking for one index per file. I am using logstash (V2.1) as my method of choice for indexing the files. Config file is here for my first index:

input {
    file {
        path => "L:/news/data/*.csv"
        start_position => "beginning"       
        sincedb_path => "C:/logstash-2.1.0/since_db_news.txt"
    }
}

filter {
    csv {
        separator => "|"
        columns => ["NewsText", "Place", "Subject", "Time"]
    }
mutate {
    strip => ["NewsText"]
    lowercase => ["NewsText"]
}
}


output {
    elasticsearch {
        action => "index"
        hosts => ["xxx.xxx.x.xxx", "xxx.xxx.x.xxx"]
        index => "news"
        workers => 2
        flush_size => 5000
    }
    stdout {}
}

My cluster contains 3 boxes running on Windows 10 with each running a single node. ES is not installed as a service and I am only standing up one master node:

Master node: 8GB RAM, ES_HEAP_SIZE = 3500m, Single Core i7

Data Node #1: 8GB RAM, ES_HEAP_SIZE = 3500m, Single Core i7

This node is currently running the logstash instance with LS_HEAP_SIZE= 3000m

Data Node #2: 16GB RAM, ES_HEAP_SIZE = 8000m, Single Core i7

I have ES currently configured at the default 5 shards + 1 duplicate per index.

At present, each node is configured to write data to an external HD and logs to another.

In my test run, I am averaging 10K events per second with Logstash. My main goal is to optimize the speed at which these files are loaded into ES. I am thinking that I should be closer to 80K based on what I have read.

I have played around with changing the number of workers and flush size, but can't seem to get beyond this threshold. I think I may be missing something fundamental.

My questions are two fold:

1) Is there anything that jumps out as fishy about my cluster configuration or some advice that may improve the process?

2) Would it help if I ran an instance of logstash on each data node indexing separate files?

Thanks so much for any and all help in advance and for taking the time to read.

-Zinga

1 Answers1

0

Is there anything that jumps out as fishy about my cluster configuration or some advice that may improve the process?

I'd say run Logstash on the Master node, so that it can make better use of the resources (RAM) it has, and leave the Data nodes to their primary job of indexing in ES.

You're likely going to be CPU-bound before anything else, but I could be wrong depending on the speed and kind of disks you have on the Data nodes. You mention that you write data to an external HD. If it's connected via USB, it might not deal with with the high IO rate required to index all your documents.

Would it help if I ran an instance of Logstash on each Data node, indexing separate files?

I wouldn't think so. You're not doing a whole lot of work in Logstash (no grokking, only basic mutates), so you're going to end up being bound the speed of your data nodes, and asking them to do more than they already are likely isn't going to help.

As for other pointers, maybe try reducing the number of shards to 3 and not having any replicas, since that should speed things up a little. You can always reconfigure your indices to have replicas once the indexing is done.

Finally, you should watch the Resource Monitor for your systems while indexing to get an idea which resources are being taxed the most (CPU, RAM, disk, network?), working to fix the bottleneck and repeating till you're happy with the indexing performance.

GregL
  • 9,370
  • 2
  • 25
  • 36
  • 1
    this has been a great help. Using your suggestions I managed to tweak some settings and have now indexed just south of 1 billion documents in a three days. Thank you so much for taking the time to help me. – NationalDonut Dec 28 '15 at 21:39