The Problem
My task is to analyze two 1-dimensional matrices. The first one is the data matrix, holding around 500 million floats. The second one is the query matrix, holding around 4000 floats.
The goal at the end is to find (sub-)sequences in data and query which are "similar". Matching (sub-)sequences are of equal length but their positions are independently of each other. Needless to say the result must be calculated with lightning speed.
First Try
To achieve this goal I've tried dynamic time warping first but failed as it searches for the whole query in total. Nevertheless, it can be very fast in finding matches like this link shows.
Actual Try
At this try I want to organize the data matrix in two ways. First as a std::multiset<float>
and secondly as an std::vector<float>
. I've choosen the boost::multi_index
container to do that:
typedef multi_index_container <
float
, indexed_by <
ordered_non_unique<identity<float>> // similar to std::multiset
, sequenced<> // similar to std::vector
>
> Floats;
The query matrix is a simple one:
std::vector<float>
Actual Idea
In human words I want to do following:
- take the n_th element (qn) of the query
- search for a set of elements (S_qn) in data
- within a range (e.g. +- 0.001) around qn
- via ordered_non_unique index (because it's fast)
- go to elements in data (N_S_qn)
- which are next to the elements of S_qn
- without searching
- but via sequence index (because it's fast)
- store the results (see example output)
- qn = qn+1
- go to step 1.
- find (sub-)sequences in results (see explanation)
Example Output
q_pos q_val d_pos d_val
0 0.00221354 416 0.000512153
1 0.000512153 413 0.00480252
1 0.000512153 419 0
1 0.000512153 418 0.00046875
1 0.000512153 417 0.000494792
1 0.000512153 415 0.00221354
2 0.000494792 413 0.00480252
2 0.000494792 419 0
2 0.000494792 418 0.00046875
2 0.000494792 417 0.000494792
2 0.000494792 415 0.00221354
3 0.241994 289 0.151168
3 0.241994 390 0.543963
3 0.241994 374 0.237778
3 0.241994 285 0.0940256
3 0.241994 153 0.136803
4 0.0940256 139 0.0823915
4 0.0940256 135 0.102841
4 0.0940256 239 0.0704384
4 0.0940256 145 0.11661
4 0.0940256 286 0.19742
4 0.0940256 193 0.136753
4 0.0940256 190 0.0665234
4 0.0940256 130 0.111161
4 0.0940256 357 0.156578
5 0.19742 350 0.202847
5 0.19742 287 0.301233
6 0.00221354 416 0.000512153
7 0.000512153 413 0.00480252
7 0.000512153 419 0
7 0.000512153 418 0.00046875
7 0.000512153 417 0.000494792
7 0.000512153 415 0.00221354
8 0.000494792 413 0.00480252
8 0.000494792 419 0
8 0.000494792 418 0.00046875
8 0.000494792 417 0.000494792
8 0.000494792 415 0.00221354
Explanation
The output shows three short subsequences
0 0.00221354 416 0.000512153
1 0.000512153 417 0.000494792
2 0.000494792 418 0.00046875
3 0.241994 285 0.0940256
4 0.0940256 286 0.19742
5 0.19742 287 0.301233
6 0.00221354 416 0.000512153
7 0.000512153 417 0.000494792
8 0.000494792 418 0.00046875
The subsequences differ from the other ones. E.g. The number right to 286
is equal to the number right to 5
AND 286-predecessor(285)==1 which is equal to 4-predecessor(3)==1
Code
#include <iostream>
#include <vector>
#include <deque>
#include <iterator>
#include <boost/multi_index_container.hpp>
#include <boost/multi_index/ordered_index.hpp>
#include <boost/multi_index/sequenced_index.hpp>
#include <boost/multi_index/identity.hpp>
using namespace boost::multi_index;
typedef multi_index_container <
float
, indexed_by <
ordered_non_unique<identity<float>>
, sequenced<>
>
> Floats;
Floats data {
0.0880295, 0.226261, 0.391886, 0.222645, 0.245799, 0.217122,
0.346469, 0.247485, 0.13814, 0.097832, 0.146678, 0.125512,
0.144755, 0.222828, 0.223468, 0.0831966, 0.0876042, 0.256921,
0.127172, 0.266851, 0.144833, 0.120677, 0.0984158, 0.149169,
0.193381, 0.189253, 0.0868424, 0.0873459, 0.087309, 0.0881337,
0.0682639, 0.0847938, 0.0902474, 0.0776411, 0.0660243, 0.064056,
0.180571, 0.0884874, 0.167904, 0.116372, 0.11505, 0.125082,
0.0733008, 0.0693511, 0.07523, 0.0887326, 0.103388, 0.192474,
0.146061, 0.193763, 0.148796, 0.126345, 0.310319, 0.292378,
0.15633, 0.238895, 0.314837, 0.103016, 0.229716, 0.264258,
0.214961, 0.166955, 0.282539, 0.375046, 0.117567, 0.202693,
0.159527, 0.106814, 0.114234, 0.109835, 0.125579, 0.113058,
0.105328, 0.07727, 0.0995747, 0.0735286, 0.217444, 0.12372,
0.113071, 0.136882, 0.223724, 0.0816884, 0.0726367, 0.0786393,
0.0903125, 0.14577, 0.104147, 0.0639258, 0.0664323, 0.0600347,
0.204054, 0.208231, 0.275519, 0.0914431, 0.0875499, 0.075549,
0.0773872, 0.0961089, 0.200859, 0.156671, 0.182509, 0.194792,
0.143105, 0.143895, 0.118826, 0.0815885, 0.0691775, 0.0650543,
0.077322, 0.167313, 0.0872938, 0.118657, 0.0793251, 0.235447,
0.175475, 0.108485, 0.125725, 0.0840929, 0.133828, 0.136571,
0.159488, 0.161682, 0.128987, 0.105339, 0.0722765, 0.0852756,
0.077602, 0.0875152, 0.102541, 0.0948633, 0.111161, 0.114935,
0.112914, 0.0963824, 0.0935395, 0.102841, 0.0886719, 0.107177,
0.093316, 0.0823915, 0.0716254, 0.11276, 0.110408, 0.0929405,
0.0939779, 0.11661, 0.194408, 0.302181, 0.144759, 0.475888,
0.341254, 0.259182, 0.242298, 0.136803, 0.260516, 0.301554,
0.282263, 0.193691, 0.319655, 0.265002, 0.174405, 0.213088,
0.151474, 0.117726, 0.123275, 0.177355, 0.28036, 0.220469,
0.116304, 0.171237, 0.0682943, 0.0786306, 0.0988672, 0.113257,
0.0676953, 0.0730534, 0.0844835, 0.0783919, 0.0744401, 0.118225,
0.0797266, 0.141914, 0.0825282, 0.0807509, 0.104989, 0.0984028,
0.190592, 0.13076, 0.141476, 0.0946745, 0.0665234, 0.0721311,
0.0941819, 0.136753, 0.0672287, 0.0794444, 0.0927517, 0.0761372,
0.0795877, 0.107576, 0.0785113, 0.0746376, 0.078342, 0.0666298,
0.0685286, 0.0639453, 0.0668533, 0.0527062, 0.0772743, 0.069796,
0.0708854, 0.0855469, 0.0715712, 0.0567209, 0.0524653, 0.101489,
0.12847, 0.0838455, 0.180063, 0.153225, 0.0834701, 0.0778407,
0.0876563, 0.0802648, 0.081122, 0.153535, 0.130744, 0.0985937,
0.111478, 0.130879, 0.105336, 0.12355, 0.0956402, 0.131895,
0.0957661, 0.0646072, 0.0774067, 0.0981641, 0.0936827, 0.0704384,
0.0611675, 0.0550391, 0.0525434, 0.186207, 0.186398, 0.121753,
0.131758, 0.138077, 0.131426, 0.200039, 0.281137, 0.203429,
0.178511, 0.25569, 0.155723, 0.148943, 0.1202, 0.10926,
0.183735, 0.183743, 0.255571, 0.941289, 0.89924, 0.214197,
0.17352, 0.300163, 0.186087, 0.153168, 0.156487, 0.272346,
0.112641, 0.330271, 0.177431, 0.22755, 0.247396, 0.297569,
0.229143, 0.178937, 0.377181, 0.303188, 0.107593, 0.0850846,
0.0823806, 0.189846, 0.241994, 0.0940256, 0.19742, 0.301233,
0.241302, 0.151168, 0.387852, 0.152539, 0.0814909, 0.100688,
0.12049, 0.100516, 0.110681, 0.160803, 0.192363, 0.21661,
0.126094, 0.463813, 0.422639, 0.71377, 0.394594, 0.190033,
0.744705, 0.646013, 0.534883, 0.327589, 0.468882, 0.210421,
0.804282, 0.477276, 0.686155, 0.745506, 0.678203, 0.557526,
0.585575, 0.142582, 0.194462, 1.37214, 0.838103, 0.895616,
0.902333, 0.437261, 0.529586, 0.28717, 0.16661, 0.154312,
0.409946, 0.574219, 0.617726, 0.600436, 0.439466, 0.302841,
0.315707, 0.188863, 0.116543, 0.155879, 0.201241, 0.245734,
0.181951, 0.265951, 0.286096, 0.527272, 0.564462, 0.60875,
0.173657, 0.196421, 0.202847, 0.261918, 0.257372, 0.247153,
0.175794, 0.125009, 0.0950022, 0.156578, 0.413086, 0.401313,
0.171013, 0.164028, 0.116259, 0.134299, 0.158809, 0.135697,
0.161725, 0.100851, 0.126265, 0.114529, 0.490764, 0.474581,
0.27171, 0.241534, 0.237778, 0.205924, 0.263615, 0.378711,
0.237318, 0.522326, 0.604104, 0.578088, 0.318513, 0.418655,
0.467813, 0.584076, 0.258585, 0.56207, 0.239733, 0.241463,
0.543963, 0.595384, 0.470145, 0.415202, 0.341942, 0.593281,
0.370286, 0.250712, 0.209967, 0.320931, 0.476341, 0.239861,
0.266044, 0.140451, 0.359811, 0.283312, 0.330161, 0.205534,
0.273644, 1.87506, 0.0116385, 0.0199392, 0.000299479,0.00480252,
0.000601128,0.00221354, 0.000512153,0.000494792,0.00046875
};
std::vector<float> query {
0.00221354, 0.000512153, 0.000494792, 0.241994, 0.0940256,
0.19742, 0.00221354, 0.000512153, 0.000494792
};
int main() {
Floats::nth_index<0>::type::iterator lower;
Floats::nth_index<0>::type::iterator upper;
Floats::nth_index<1>::type::iterator next;
std::cout << "q_pos q_val d_pos d_val";
for (auto i=query.begin(),j=query.end();i<j;++i) {
lower=data.get<0>().lower_bound(*i-0.001);
upper=data.get<0>().lower_bound(*i+0.001);
while (lower!=upper) {
next=++(data.project<1>(lower++));
std::cout
<< "\n" << std::distance(query.begin(),i)
<< " " << *i
<< " " << std::distance(get<1>(data).begin(),next)
<< " " << *next
;
}
}
// TODO find an intelligent way through output to detect sequences
}
Question
Is the show implementation the fastest way to achieve my goal??