Brand new to Elasticsearch. I've been doing tons of reading, but I am hoping that the experts on SO might be able to weigh in on my cluster configuration to see if there is something that I am missing.
Currently I am using ES (1.7.3) to index some very large text files (~700 million lines) per file and looking for one index per file. I am using logstash (V2.1) as my method of choice for indexing the files. Config file is here for my first index:
input {
file {
path => "L:/news/data/*.csv"
start_position => "beginning"
sincedb_path => "C:/logstash-2.1.0/since_db_news.txt"
}
}
filter {
csv {
separator => "|"
columns => ["NewsText", "Place", "Subject", "Time"]
}
mutate {
strip => ["NewsText"]
lowercase => ["NewsText"]
}
}
output {
elasticsearch {
action => "index"
hosts => ["xxx.xxx.x.xxx", "xxx.xxx.x.xxx"]
index => "news"
workers => 2
flush_size => 5000
}
stdout {}
}
My cluster contains 3 boxes running on Windows 10 with each running a single node. ES is not installed as a service and I am only standing up one master node:
Master node: 8GB RAM, ES_HEAP_SIZE = 3500m, Single Core i7
Data Node #1: 8GB RAM, ES_HEAP_SIZE = 3500m, Single Core i7
This node is currently running the logstash instance with LS_HEAP_SIZE= 3000m
Data Node #2: 16GB RAM, ES_HEAP_SIZE = 8000m, Single Core i7
I have ES currently configured at the default 5 shards + 1 duplicate per index.
At present, each node is configured to write data to an external HD and logs to another.
In my test run, I am averaging 10K events per second with Logstash. My main goal is to optimize the speed at which these files are loaded into ES. I am thinking that I should be closer to 80K based on what I have read.
I have played around with changing the number of workers and flush size, but can't seem to get beyond this threshold. I think I may be missing something fundamental.
My questions are two fold:
1) Is there anything that jumps out as fishy about my cluster configuration or some advice that may improve the process?
2) Would it help if I ran an instance of logstash on each data node indexing separate files?
Thanks so much for any and all help in advance and for taking the time to read.
-Zinga