1

I am creating a new table on a KDB database as a parted splay (parted by date), the new table schema has a column called CCYY, which has a lot of repeating values. I am unsure if I should save it as char or symbols. My main goal is to use least amount of memory.

As a result which one should I use? What is the benefit/disadvantage of saving repeating values as either a char array or a symbol in a parted splayed setup?

mollmerx
  • 648
  • 1
  • 5
  • 18
stretchr
  • 615
  • 2
  • 9
  • 24

2 Answers2

1

It sounds like you should use symbol.

There's a guide to symbols/enumerations here:http://www.timestored.com/kdb-guides/strings-symbols-enumeration#when-to-use quote:

Typically you should follow the guidelines:

  1. If the column is used in where clause equality comparisons e.g. select from t where sym in AB -> Symbol
  2. Short, often repeated strings -> Symbol
  3. Else Long, Non-repeated strings -> String
Ryan Hamilton
  • 2,601
  • 16
  • 17
0

When evaluating whether or not to use symbol for a column, cardinality of that column is key. Length of individual values matters less and, if anything, longer values might be better off as symbol, as they will be stored only once in the sym file, but repeated in the char vector. That consideration is pretty much moot if you compress you data on disk though.

If your values are short enough, don't forget about the possibility of using .Q.j10, .Q.x10, .Q.j12 and .Q.x12. This will use less space than a char vector. And it doesn't rely on a sym file, which in complex environments can save you from having to re-enumerate if you are, say, copying tables between hdbs who's sym files are not in sync.

If space is a concern, always compress the data on disk.

mollmerx
  • 648
  • 1
  • 5
  • 18