0

I have somewhat of a unique problem that looks similar to the problem here :

https://news.ycombinator.com/item?id=8368509

I have a high-speed traffic analysis box that is capturing at about 5 Gbps, and picking out specific packets from this to save into some format in a C++ program. Each day there will probably be 1-3 TB written to disk. Since it's network data, it's all time series down to the nanosecond level, but it would be fine to save it at second or millisecond level and have another application sort the embedded higher-resolution timestamps afterwards. My problem is deciding which format to use. My two requirements are:

  1. Be able to write to disk at about 50 MB/s continuously with several different timestamped parameters.
  2. Be able to export chunks of this data into MATLAB (HDF5).
  3. Query this data once or twice a day for analytics purposes

Another nice thing that's not a hard requirement is :

  1. There will be 4 of these boxes running independently, and it would be nice to query across all of them and combine data if possible. I should mention all 4 of these boxes are in physically different locations, so there is some overhead in sharing data.

The second one is something I cannot change because of legacy applications, but I think the first is more important. The types of queries I may want to export into matlab are something like "Pull metric X between time Y and Z", so this would eventually have to go into an HDF5 format. There is an external library called MatIO that I can use to write matlab files if needed, but it would be even better if there wasn't a translation step. I have read the entire thread mentioned above, and there are many options that appear to stand out: kdb+, Cassandra, PyTables, and OpenTSDB. All of these seem to do what I want, but I can't really figure out how easy it would be to get it into the MATLAB HDF5 format, and if any of these would make it harder than others.

If anyone has experience doing something similar, it would be a big help. Thanks!

Manish Patel
  • 4,411
  • 4
  • 25
  • 48
  • 1
    Just in case you weren't aware, kdb+ isn't cheap (yes, there is a free 32bit version but it has a 4gb memory limit which would make it very difficult to achieve what you're looking for). All of your other options seem to be free/open-source. If money isn't an issue, then kdb+ is definitely the best option for any time-series dataset. – terrylynch Feb 25 '15 at 13:44
  • What do you mean by "isn't cheap"? I've heard that several times, but haven't seen numbers. I submitted two requests for their sales team to contact me, but haven't heard back yet. – user3324172 Feb 26 '15 at 15:01
  • Not really my place to quote numbers but I'm sure their sales team will get back to you. Ask about the starter pack too, that will be a bit cheaper (but again a bit more restricted) – terrylynch Feb 26 '15 at 16:31

1 Answers1

1

A KDB+ tickerplant is certainly capable of capturing data at that rate, however there's lots of things you need to make sure (whatever solution you pick)

  • Do the machine(s) that are capturing the data have enough cores? Best to taskset a tickerplant, for example, to a core that nothing else will contend with
  • Similarly with disk - SSD, be sure there is no contention on the bus
  • Separate the workload - can write different types of data (maybe packets can be partioned by source or stream?) to different cpus/disks/tickerplant processes.

Basically there's lots of ways you can cut this. I can say though that with the appropriate hardware KDB+ could do the job. However, given you want HDF5 it's probably even better to have a simple process capturing the data and writing/converting to disk on the fly.

Manish Patel
  • 4,411
  • 4
  • 25
  • 48
  • Thanks! The cores are not an issue and we can go as big or as small as necessary, but I'm leaning towards a 12-core Xeon. Also, the hardware should be able to do some level of filtering on the packets. – user3324172 Feb 26 '15 at 15:02