2

Consider the following matrix, where the first column is the chronological index and the second and third columns contain the data.

Data=

1     5     100       
2     10    100      
3     5     300    
4     15    200     
5     5     500    
6     15    0    
7     10    400    
8     5     300   
9     10    200    
10    10    0    
11    5     300
12    10    100
13    15    1000    
...   ...   ...
T     ...   ...

These data may be thought as orders in an auction or exchange for a single good, where at each time point t (column 1) a new order with "price" (column 2) arrives and the total amount of units demanded at this price at this particular time point is in the column 3. So, for instance, consider an auction where buyers submit their bids for a good, then the data above means:

Row 1: at time t=1 the new order arrives, the price of this order is 5 and the total number of units demanded at this price is 100.

Row 2: at time t=2 ... order with price 10 - the total demand at price 10 is 100

Summary: at time t=2 there are 100 units demanded at price 5 and 100 at price 10

Row 3: at time 3 ... order with price 5 which demands additional 200 units, so the total number of units demanded at price 5 is 300

Summary: at time t=3 there are 300 units demanded at price 5 and 100 at price 10

Row 4: t=4 ... order with price 15 for 200 units, the total demand at price 15 is 200

Summary: t=4 there are 300 units demanded at price 5, 100 at 10, and 200 at 15

...

Summary: t=5 there are 500 units demanded at price 5, 100 at 10, and 200 at 15

Row 6: t=6 the price is 15 but there are 0 units in the third column, meaning that the order was canceled, and there is no demand at price 15

Summary: t=6 there are 500 units demanded at price 5 and 100 at 10

I want to allocate the data to the following two Tx3 matrices, where each row represents the "Summary:" line above:

           [Price=5][Price = 10][Price = 15]
[Time = 1]       5      NaN      NaN
[Time = 2]       5      10       NaN
[Time = 3]       5      10       NaN
[Time = 4]       5      10       15
[Time = 5]       5      10       15
[Time = 6]       5      10       NaN
[Time = 7]       5      10       NaN
[Time = 8]       5      10       NaN
[Time = 9]       5      10       NaN
[Time = 10]      5      NaN      NaN 
[Time = 11]      5      NaN      NaN 
[Time = 12]      5      10       NaN
[Time = 13]      5      10       15
      ...       ...     ...      ...
[Time = T]      ...     ...      ...

           [Price=5][Price = 10][Price = 15]
[Time = 1]       100      NaN       NaN
[Time = 2]       100      100       NaN
[Time = 3]       300      100       NaN
[Time = 4]       300      100       200
[Time = 5]       500      100       200
[Time = 6]       500      100       NaN
[Time = 7]       500      400       NaN
[Time = 8]       300      400       NaN
[Time = 9]       300      200       NaN
[Time = 10]      300      NaN       NaN 
[Time = 11]      300      NaN       NaN 
[Time = 12]      300      100       NaN
[Time = 13]      300      100       1000
      ...        ...      ...       ...
[Time = T]       ...      ...       ...

Basically, the two matrices above allow me to get the "prices" and "units" for any point of "time". Note, that each "price" may have discontinuities which appear once the "units" are 0 - so "price=15" appears only at t=4 and exists for two periods only: t=4, t=5 (the order is canceled at t=6) further reappearing at t=13.

I proceed as follows:

1.) Sort the Data matrix by column 2 ("prices") and get the index for unique values in column 2:

Data=sortrows(Data, [2 1]);
[~,~, IndexPrice]=unique(Data(:,2));

Data=                      IndexPrice=

1     5     100                   1
3     5     300                   1
5     5     500                   1
8     5     300                   1 
11    5     300                   1
2     10    100                   2
7     10    400                   2
9     10    200                   2
10    10    0                     2
12    10    100                   2
4     15    200                   3
6     15    0                     3
13    15    1000                  3
...   ...   ...                  ...
T     ...   ...                  ...

2.) Allocate values to the two output matrices:

OutputPrice=NaN(size(Data,1), max(IndexPrice));             %Preallocate matrix
for j=1:max(IndexPrice)                                     %Go column-wise
    TempData=Data(IndexPrice==j,:);                         %Submatrix for unique "price"
    for i=1:size(TempData,1)
        if TempData(i,3)~=0                                 %Check for discontinuity (0 in col 3)
            OutputPrice(TempData(i,1):end,j)=TempData(1,2); %Fill wiht values
        else
            OutputPrice(TempData(i,1):end,j)=NaN;           % If there is 0 fill with NaNs
        end
    end
end

OutputUnits=NaN(size(Data,1), max(IndexPrice)); 
for j=1:max(IndexPrice)
    TempData=Data(IndexPrice==j,:);
    for i=1:size(TempData,1)
        if TempData(i,3)~=0
            OutputUnits(TempData(i,1):end,j)=TempData(i,3); %The "units" change in contrast to the "prices"
        else
            OutputUnits(TempData(i,1):end,j)=NaN;
        end
    end
end

The key point, of course, is the performance of the code - it seems to be a "brute force" approach to the problem. I would appreciate any suggestions about more efficient ways to solve it.

Thorbjörn
  • 125
  • 6
  • 1
    Hm. You didn't actually explain how one would get to your final outputs from reading `Data`. It seems it has something to do with `NaN`s as soon as some value is zero. You should probably elaborate more using text instead of these huge matrix examples. Also: Why is `Time=13`, `Price=10` `NaN` in the first matrix, but not `NaN` in the other? What does the line `1 5 100` mean? At time t=1 there are 100 goods of type ???, each for the price of 5? – knedlsepp Feb 05 '15 at 00:07
  • Thanks, I've elaborated more on the meaning of the data making an auction example. There was a wrong entry in the first matrix - corrected it. Once a zero appears in the third column at some time t, it means that the 'order' disappears, so the columns corresponding to the price in output matrices are NaN, until a new order arrives later. – Thorbjörn Feb 05 '15 at 01:00
  • Ok. I guess this question could possibly be asked a bit simpler, but you show a lot of effort and this deserves an answer! +1 – knedlsepp Feb 05 '15 at 01:07
  • To summarize: The entries of the matrices should be the price resp. demands if there was a last demand and it was greater than zero and should be `NaN` if there was no last demand (in the beginning) or the last demand was zero. – knedlsepp Feb 05 '15 at 01:15
  • Exactly, and in the second matrix the row-entries are total demands. – Thorbjörn Feb 05 '15 at 01:22

1 Answers1

1

I don't think this version is a lot clearer than yours, but it is log-linear rather than quadratic, so it will show performance improvements for large datasets. The idea is for each price to build a vector, that has the same number of rows as Data and for each entry will give the value of the last time something at this price was ordered. This is the line posOfDemands(idxLastDemand(hasLastDemand)). [By the way: This is actually the answer to one of your earlier questions]. In your example for price==5 this will yield the vector [1 1 3 3 5 5 5 8 8 8 11 11 11]. Using this vector we get the last demands/prices and then will just have to replace them with NaN if they are zero:

%%// Rename the variables
prices = Data(:,2);
demands = Data(:,3);
%%// Find number of different prices
uniquePrices = unique(prices);
nUniquePrices = length(uniquePrices);
nData = size(prices,1);
[OutputUnits, OutputPrices] = deal(zeros(nData,nUniquePrices));
%%// For each price do:
for i = 1:nUniquePrices
    %%// Find positions of all demands
    posOfDemands = find(prices==uniquePrices(i));
    idxLastDemand = cumsum(prices==uniquePrices(i));
    hasLastDemand = idxLastDemand~=0;
    %%// Get the values of the last demands/prices
    OutputUnits(hasLastDemand,i) = demands(posOfDemands(idxLastDemand(hasLastDemand)));
    OutputPrices(hasLastDemand,i) = prices(posOfDemands(idxLastDemand(hasLastDemand)));
end
%%// Convert 0s to NaNs
OutputPrices(OutputUnits == 0) = NaN;
OutputUnits(OutputUnits == 0) = NaN;

Vectorized version:

Here is a vectorized version for even more speed:

prices = Data(:,2);
demands = Data(:,3);
uniquePrices = unique(prices);
nUniquePrices = length(uniquePrices);
%%// Introduce leading demands of value 0 to get the zeros in the beginning
isDemanded = [true(1,nUniquePrices); bsxfun(@eq, prices, uniquePrices.')];
demands = [0; demands];
%%// Find positions of all demands
[rowOfDemands,ignore_] = find(isDemanded);
idxLastDemand = reshape(cumsum(isDemanded(:)),[],nUniquePrices);
%%// Get the values of the last demands/prices
OutputUnits = demands(rowOfDemands(idxLastDemand(2:end,:)));
OutputUnits(OutputUnits == 0) = NaN;
OutputPrices = ones(size(OutputUnits,1),1)*uniquePrices(:).';
OutputPrices(isnan(OutputUnits)) = NaN;
Community
  • 1
  • 1
knedlsepp
  • 6,065
  • 3
  • 20
  • 41
  • There is a typo in `OutputPrices(OutputPrices == 0) = NaN`, it should be `OutputPrices(OutputUnits == 0) = NaN`. I have also made some performance comparisons with artificial data: let T be the number of rows in the input matrix, and P be the number of unique prices in the second column, then the output matrices are TxP. In the case of 200000x200, the results are 3.5 seconds for your code and 99 for mine. For 400000x400 - 14.5 and 494. Thanks a lot! – Thorbjörn Feb 06 '15 at 17:31
  • Corrected the NaN-part. I guess it could be improved a bit if we got rid of the for loop using vectorization, but I'm not sure if it'd be worth the hassle. :-) – knedlsepp Feb 06 '15 at 17:53
  • Tested the vectorized version. On average it 25-30% faster than your previous suggestion - I wouldn't call it 'a bit' :-) Interstingly, though, it seems to consume slightly more RAM. Btw, in the case of large input data memory becomes an issue; So I am thinking about getting only 'N' highest prices with corresponding demands for each time point, in order to avoid the complete expansion of the input data across (Time)x(unique prices) dimensions - but this is a completely another problem. – Thorbjörn Feb 09 '15 at 01:32