2

In the q for mortals chapter on data normalisation, i.e. the task of eliminating duplication in a list, it recommends using enumerations for finding distinct values in a list as its faster to traverse over integers than it is over symbols of variable length

u:`g`ibm`intl`msft / unique list of tickers
v:1000000?u / list with duplicate tickers
k:u?v / positions in u
\t:10 distinct v / performing distinct on symbols 10 times and timing 
\t:10 distinct k / performing distinct on positions 10 times and timing 

I find that distinct v is much faster than distinct k which is not in line with what was promised.

Thanks for the help.

Thomas Smyth - Treliant
  • 4,993
  • 6
  • 25
  • 36
tenticon
  • 2,639
  • 4
  • 32
  • 76

1 Answers1

2

Enumeration is usually used for data saved to disk to aid with compression etc That's where you will see the bigger performance gain.

KDB+ 3.5 2017.04.06 Copyright (C) 1993-2017 Kx Systems

Welcome to kdb+ 32bit edition
For support please see http://groups.google.com/d/forum/personal-kdbplus
Tutorials can be found at http://code.kx.com/wiki/Tutorials
To exit, type \\
To remove this startup msg, edit q.q
u:`g`ibm`intl`msft / unique list of tickers
v:1000000?u / list with duplicate tickers
q)k:`u$v //enumerate v against u
q)k
`u$`g`g`intl`ibm`intl`ibm`intl`msft`intl`ibm`g`msft`ibm`intl`intl`ibm`g`ibm`i..
q)save `:k
`:k
q)save `:u
`:u
q)save `:v
`:v
q)\\

KDB+ 3.5 2017.04.06 Copyright (C) 1993-2017 Kx Systems

Welcome to kdb+ 32bit edition
For support please see http://groups.google.com/d/forum/personal-kdbplus
Tutorials can be found at http://code.kx.com/wiki/Tutorials
To exit, type \\
To remove this startup msg, edit q.q
q)u:get `:u
q)\ts:10 distinct get `:v
462 8388848
q)\ts:10 distinct get `:k
37 4194544
q)

But you do raise an interesting question regards why is distinct faster on a list of symbols (in mem) that a list of ints.

emc211
  • 1,369
  • 7
  • 14