0

I have a partitioned table in my hdb that includes a column containing large lists of floats (at most 400 floats per element). eg each element looks like

 (100.0 1.0 ...)

When trying to select on this column from days where there are particularly high numbers of rows I get an error saying

'./2015.02.07/table/column# Cannot allocate memory

The same error arises from a query like:

select column[;0] from table where date=2015.02.07

even though on days with fewer rows this query returns the first value of each element in the column.

Is there a way to stream this column in a select to decrease the memory requirements of holding the whole column in memory for a large day?

EDIT

.Q.ind on large days fails with the same error.

ie given I can work with 2015.02.01 but not 2015.02.02:

.Q.ind[select from table where date=2015.02.01;enlist 1]

is fine but

.Q.ind[select from table where date=2015.02.02;enlist 1]

fails with

{0!$[#.Q.pm;p3;(?).]@[x;0;p1[;y;z]]}
'./2015.02.10/table/column2#: Cannot allocate memory
@
.[?]
(+`time`sym`column1`column2!`:./2015.02.02/table;();0b;()) 

I should note I am using the free 32-bit version

Donald_W
  • 1,773
  • 21
  • 35
nightTrevors
  • 639
  • 4
  • 11
  • 24

3 Answers3

1

Nested columns make querying in the usual way difficult, as the # file also needs to be loaded into memory (even with a [;0])

Your best bet is to select map a date partition in, and then select within that chunk by chunk, e.g. a million rows at a time (or whatever is sensible given the size of nested floats).

Perhaps also consider 32bit floats, if some decimal accuracy can be sacrificed.

EDIT

So after comments I guess the best way is to go each partition a number of lines at a time with .Q.ind

Manish Patel
  • 4,411
  • 4
  • 25
  • 48
  • Can you please expand on your fix? I cannot access rows with the index format until the whole column for the day is in memory – nightTrevors Feb 09 '15 at 22:33
  • `select from table where date=xxx` does not pull the whole table into memory, it gets mapped. You can then use the result with `i` or, better still, use .Q.ind – Manish Patel Feb 10 '15 at 18:56
  • I hit the same error when select in this way. explicitly my error: {0!$[#.Q.pm;p3;(?).]@[x;0;p1[;y;z]]} './2015.02.10/table/column2#: Cannot allocate memory @ .[?] (+`time`sym`column1`column2!`:./2015.02.10/table;();0b()) – nightTrevors Feb 11 '15 at 01:13
  • Even if you select 1 row? It's a matter of finding that threshold given your memory limitation – Manish Patel Feb 11 '15 at 13:59
  • Yes even if I select one row from a large day using .Q.ind. I am using 32bit, is that the issue? – nightTrevors Feb 11 '15 at 14:32
  • Yes - that nested vector must be huge if it can't allocate enough even 1 row. I would rethink how you model and store this data. – Manish Patel Feb 12 '15 at 10:11
  • Yes I'm absolutely rethinking it, but each row is approximately the same size, which is why I'm confused about why the date it is from matters. Surely if I can allocate the memory for one row I can allocate it for any row. Each row contains approximately 400 floats – nightTrevors Feb 12 '15 at 13:53
  • What happens when you simply do `.Q.ind[table;enlist 0j]` (i.e. no select) – Manish Patel Feb 12 '15 at 18:51
  • also, what sort of row counts are we talking about here? – terrylynch Feb 12 '15 at 18:57
  • @user2393012 `.Q.ind[market;enlist 0]` works fine, and @terrylynch the full table (about 30 days) is close to 9 million rows. Each row has a sym column, two float columns, and two nested list of floats columns, each with at most 400 floats – nightTrevors Feb 12 '15 at 19:45
  • .Q.ind works fine but only up until I get to dates that are too large to store into memory. So if 2015.02.02 is too large, and starts on index 1000, .Q.ind[table;enlist 1000] fails – nightTrevors Feb 13 '15 at 00:18
  • I see... the reason is that it has to load in the *entire* `#columnName` file for that partition because it doesn't know where to seek and read, unlike an ordinary column file, which is a simple vector. I can't think of a way round this because even when you do `get\`:columnName` it'll read that `#` file in. – Manish Patel Feb 13 '15 at 11:13
1

I think this is all just a combination of the free-32bit memory limitation, the fact that your row counts are possibly large and the fact that (unavoidably) something must be pulled entirely into memory when retrieving data from a column, whether it is the column itself that gets entirely pulled in (in the non-nested case) or if its the nested-index column that gets entirely pulled in.

Another thing to consider is that kdb uses powers-of-two (buddy) memory allocation. Even if todays table only contains one more row than yesterdays, the memory requirements per column could double. Take a simple example:

In the free 32bit version (windows) you can create this many floats and it only uses ~1.07gb of memory

q)\ts 134217726?1.0
3093 1073741952

However, try to generate one extra float and you hit a memory limit

q)\ts 134217727?1.0
wsfull

So even a small amount of rows in the difference between one day and the next can be very significant if you're near the boundary of allocatable powers of two.

--DISCLAIMER-- the following is hacky and is only intended for debugging!

You can actually manually try to access the data from the nested list, though you may still have memory issues here anyway.

Create a nested table and splay it

q)tab:([] col1:(101 102 103f;104 105f;106 107 108 109 110f;111 112f))
q)tab
col1
--------------------
101 102 103f
104 105f
106 107 108 109 110f
111 112f
q)
q)`:test/ set tab
`:test/

You can try to read in the indices from the nested-index file

q)2_first (enlist "j";enlist 8)1:`:test/col1
3 5 10 12

So the indices for splitting the full list of floats (the col1# file) is index 3, index 5, 10 etc etc

Say I want the first 3 rows

q)myrows:3#2_first (enlist "j";enlist 8)1:`:test/col1
q)myrows
3 5 10

then I know that I need the first 10 floats from the col1# file and need to split them at index 3 and 5. Then I can read the col1# file partially and split it correctly

q)(0,-1_myrows) cut raze (enlist "f";enlist 8)1:(`$":test/col1#";0;8*last myrows)
101 102 103f
104 105f
106 107 108 109 110f

But this is precisely what KDB does under the covers anyway so I suspect that you'll still have trouble even reading in the nested-index file in the first place.

Check this debug/hack and see if you can partially read that way. But obviously it's not a long-term solution!

terrylynch
  • 11,844
  • 13
  • 21
  • Thank you for your help, can you please expand on how the two functions you used under disclaimer work? I was able to read a few rows manually from a previously inaccessible date, but I'm not quite sure what's happening. Why enlist 8? Why do you read the column with no "#" first? What are the 3 variables you give after 1: in the second function? – nightTrevors Feb 13 '15 at 00:29
  • 1
    Firstly, you may already be aware that when you have a nested column of say floats, kdb stores the entire *flattened* list of floats on disk as the "#" file and then stores the indices on which to split the list in the non-# file. So to retrieve data kdb goes to the non-# file first to find out how to split the list, then grabs chunks from the # file based on those indices and then returns the relevant nested lists. I was manually replicating that above. For your questions http://code.kx.com/wiki/Reference/OneColon ....."8" because I'm reading longs/floats, 3 variables are (file;offset;length) – terrylynch Feb 13 '15 at 14:12
  • 1
    EDIT - see corrections/clarifications in my original post – terrylynch Feb 13 '15 at 16:45
  • Thank you for this very complete solution! I'm writing a utility makes the hackiness a little safer – nightTrevors Feb 14 '15 at 21:10
0

Just to give my 2 cents on this, I had a similar error but with a 64-bit instance.

  • I suspected that the memory needed to be de-fragmented as it was running for almost a year.

  • Bouncing the instance solved the issue, and released a lot of virtual memory

Pork Chop
  • 28,528
  • 5
  • 63
  • 77