1

I need to write an app to account for bandwidth that comes from a sensor, it gives details on data flow in a table captured as below:

[ElasticsearchType(Name = "trafficSnapshot")]
public class TrafficSnapshot
{
    // use epoch_second @ https://mixmax.com/blog/30x-faster-elasticsearch-queries
    [Date(Format = "epoch_second")]
    public long TimeStamp { get; set; }

    [Nested]
    public Sample[] Samples { get; set; }
}

[ElasticsearchType(Name = "sample")]
public class Sample
{
    public ulong Bytes { get; set; }
    public ulong Packets { get; set; }
    public string Source { get; set; }
    public string Destination { get; set; }
}

There will be potentially a lot of log entries especially at high traffic flows every second, I believe we can contain the growth by sharding/indexing by mm/dd/yyyy (and discard unneeded days by deleting old indexes) - however when i create an index with a date string i get the error Invalid NEST response built from a unsuccessful low level call on PUT: /15%2F12%2F2017. How should i define the index if i want to split in to dates?

If i log the data in this format, is it then possible for me to perform a summation per IP address for the total data send and total data received (over a date range which can be defined), or am i better off storing/indexing my data with a different structure before i progress further?

My full code is below and first stab tonight, pointers appreciated (or if i am going off track and may be better using logstash or similar please do let me know).

public static class DateTimeEpochHelpers
{
    public static DateTime FromUnixTime(this long unixTime)
    {
        var epoch = new DateTime(1970, 1, 1, 0, 0, 0, DateTimeKind.Utc);
        return epoch.AddSeconds(unixTime);
    }

    public static long ToUnixTime(this DateTime date)
    {
        var epoch = new DateTime(1970, 1, 1, 0, 0, 0, DateTimeKind.Utc);
        return Convert.ToInt64((date - epoch).TotalSeconds);
    }
}

public static class ElasticClientTrafficSnapshotHelpers
{
    public static void IndexSnapshot(this ElasticClient elasticClient, DateTime sampleTakenOn, Sample[] samples)
    {
        var timestamp = sampleTakenOn.ToUniversalTime();
        var unixTime = timestamp.ToUnixTime();
        var dateString = timestamp.Date.ToShortDateString();

        // create the index if it doesn't exist
        if (!elasticClient.IndexExists(dateString).Exists)
        {
            elasticClient.CreateIndex(dateString);
        }

        var response = elasticClient.Index(
            new TrafficSnapshot
            {
                TimeStamp = unixTime,
                Samples = samples
            },
            p => p
                .Index(dateString)
                .Id(unixTime)
        );
    }
}

class Program
{
    static void Main(string[] args)
    {
        var node = new Uri("http://localhost:9200");

        var settings = new ConnectionSettings(node);              
        var elasticClient = new ElasticClient(settings);

        var timestamp = DateTime.UtcNow;

        var samples = new[]
        {
            new Sample() {Bytes = 100, Packets = 1, Source = "193.100.100.5", Destination = "8.8.8.8"},
            new Sample() {Bytes = 1022, Packets = 1, Source = "8.8.8.8", Destination = "193.100.100.5"},
            new Sample() {Bytes = 66, Packets = 1, Source = "193.100.100.1", Destination = "91.100.100.1"},
            new Sample() {Bytes = 554, Packets = 1, Source = "193.100.100.10", Destination = "91.100.100.2"},
            new Sample() {Bytes = 89, Packets = 1, Source = "9.9.9.9", Destination = "193.100.100.20"},
        };

        elasticClient.IndexSnapshot(timestamp, samples);
    }
}
morleyc
  • 2,169
  • 10
  • 48
  • 108

1 Answers1

1
// use epoch_second @ https://mixmax.com/blog/30x-faster-elasticsearch-queries
[Date(Format = "epoch_second")]
public long TimeStamp { get; set; }

I would evaluate if this still holds true in newer versions of Elasticsearch. Also, is second precision sufficient for your use case? You can index a date in multiple ways to serve different purposes e.g. for sorting, range queries, exact values, etc. You may also want to use a DateTime or DateTimeOffset type, and define a custom JsonConverter to serialize and deserialize to epoch_millis/epoch_second.

There will be potentially a lot of log entries especially at high traffic flows every second, I believe we can contain the growth by sharding/indexing by mm/dd/yyyy (and discard unneeded days by deleting old indexes)

Creating indices per time interval is a very good idea for time series data. Often, newer data e.g. the last day, the last week, is searched/aggregated on more often than older data. By indexing into time-based indices, it allows you to take advantage of a hot/warm architecture with shard allocation, whereby the most recent indices can live on more powerful nodes with better IOPs, and older indices can live on less powerful nodes with less IOPs. When you no longer need to aggregate on such data, you can snapshot those indices into cold storage.

when i create an index with a date string i get the error Invalid NEST response built from a unsuccessful low level call on PUT: /15%2F12%2F2017. How should i define the index if i want to split in to dates?

Don't use an index name containing / as you have. You may want to use a format such as <year>-<month>-<day> e.g. 2017-12-16. You'll almost certainly want to take advantage of index templates to ensure that the correct mapping is applied for newly created indices, and a couple of approaches you may want to consider:

If i log the data in this format, is it then possible for me to perform a summation per IP address for the total data send and total data received (over a date range which can be defined), or am i better off storing/indexing my data with a different structure before i progress further?

Yes. Consider whether it makes sense to have a collection of samples nested on one document, or to denormalize to a document per sample. Looking at the model, it looks like samples could logically be separate documents since the only shared data is timestamp. It's possible to aggregate on both top level and nested documents but there may be some queries more easily expressed with top level documents. I suggest experimenting with both approaches to see which better fits your use case. Also, take a look at the IP data type for indexing IP addresses, and also check out the ingest-geoip plugin for getting geo data from IP addresses.

My full code is below and first stab tonight, pointers appreciated (or if i am going off track and may be better using logstash or similar please do let me know).

There are many ways that you can approach this. If you're looking to do this using a client, I would suggest using the bulk API to index multiple documents per request and put a message queue in front of the indexing component, to provide a layer of buffering. Logstash can be useful here, especially if you need to perform additional enrichment and filtering. You may also want to look at Curator for index management.

Russ Cam
  • 124,184
  • 33
  • 204
  • 266