0

I have a list of strings below. It's a simple example for now, with just 3 strings, and each string contains some keys and their corresponding values. For example, string 1 says that expiration year is 2024, birth year is 1947, issue year is 2015, height is 150 cm, etc.

show q4
"eyr:2024 pid:662406624 hcl:#cfa07d byr:1947 iyr:2015 ecl:amb hgt:150cm"
"iyr:2013 byr:1997 hgt:182cm hcl:#ceb3a1 eyr:2027 ecl:gry cid:102 pid:018128"
"hgt:61in iyr:2014 pid:916315544 hcl:#733820 ecl:oth"

From this other thread, terrylynch explained to me that I can use a built-in kdb/q function to extract the key-value pairs as dictionaries kdb/q: How to apply a string manipulation function to a vector of strings to output a vector of strings?

dict: each["S: "0:] q4

dict[0]
eyr    pid         hcl       byr    iyr    ecl   hgt
"2024" "662406624" "#cfa07d" "1947" "2015" "amb" "150cm"

Then I tried to convert this into a table. Conceptually, I understand that kdb/q tables are basically flipped dictionaries and I understand that a table's rows are basically just dictionaries.

t: flip dict

show t
`eyr`pid`hcl`byr`iyr`ecl`hgt                               `iyr`byr`hgt`hcl`e..
("2024";"662406624";"#cfa07d";"1947";"2015";"amb";"150cm") ("2013";"1997";"18..

show t[0]
`eyr`pid`hcl`byr`iyr`ecl`hgt
`iyr`byr`hgt`hcl`eyr`ecl`cid`pid
`hgt`iyr`pid`hcl`ecl

show t[1]
("2024";"662406624";"#cfa07d";"1947";"2015";"amb";"150cm")
("2013";"1997";"182cm";"#ceb3a1";"2027";"gry";"102";"018128")
("61in";"2014";"916315544";"#733820";"oth")

The issue here is that this above table has different columns for each row. How do I get this table to look something like this below, where the table columns are the union of all of the keys and where a row has missing values for a key, it just shows null?

show ideal_t
eyr   iyr    pid        cid..
---------------------------..
2024  2015   662406624  0n ..
2027  2013   018128     102..

Thanks!

JZL
  • 53
  • 9

3 Answers3

3

Here is much more performant solution if many of the dictionaries share keys, since uj is not very performant over a large input list. This solution takes advantage of the fact that a list of conforming dictionaries (same keys in same order) is automatically interpreted as a table in kdb.

By grouping together n dictionaries with the same keys, we end up with m tables, with m as the number of unique key sets over the dictionaries. Using group also allows us to skip enlisting each dictionary.

We can then use the (uj/) solution that Terry suggested to join these m differing tables together, however it will only run m times instead of n times. uj also makes sure that the nulls added to each table are of the correct type.

We then restore the original ordering using iasc on the list of indices (which group returned).

{uj/[x idx] iasc raze idx:value group key each x} (!) ./: dict

Some performance stats using both functions on different input lengths. The outputs of \ts is defined here. When the number of distinct set is low, it scales linearly

dict:n#dict (taking your input and extending it to n rows - m is always 3)

n = 1000, m = 3    
q)\ts (uj/) enlist each (!) ./: dict
27 519472

\ts {uj/[x idx] iasc raze idx:value group key each x} (!) ./: dict
1 304480


n = 10000, m = 3    
q)\ts {uj/[x idx] iasc raze idx:value group key each x} (!) ./: dict
10 3169504

q)\ts (uj/) enlist each (!) ./: dict
1150 6016432


n = 100000, m = 3
q)\ts {uj/[x idx] iasc raze idx:value group key each x} (!) ./: dict
72 37308384


n = 1000000, m = 3
q)\ts {uj/[x idx] iasc raze idx:value group key each x} (!) ./: dict
504 308049376

In the worst case where every dictionary has a different set of keys to each other, n=m and the solution converges to Terry's answer where uj is applied n times

dict:flip ((enlist each neg[n]?`3);n#enlist enlist "test")

n=m=100
q)\ts (uj/) enlist each (!) ./: dict
9 308064

q)\ts {uj/[x idx] iasc raze idx:value group key each x} (!) ./: dict
8 318064


n=m=1000
q)\ts (uj/) enlist each (!) ./: dict
1803 24594240

q)\ts {uj/[x idx] iasc raze idx:value group key each x} (!) ./: dict
1631 24683344


n=m=2000
q)\ts (uj/) enlist each (!) ./: dict
15777 97955584

q)\ts {uj/[x idx] iasc raze idx:value group key each x} (!) ./: dict
16443 98133264

As Jason mentioned, you can extract the set of keys and extract them from each dictionary, which will automatically turn into a table (as outlined in the first paragraph), however when you have different keys with different types, caution will need to be taken.

Indexing into the first dictionary below with d returns an integer null, but d in the second dictionary is a short. As a result, the column will be mixed.

/ a dictionary containing an integer, long and float. Missing keys will return integer nulls
q)(`a`b`c!(1i;2j;3f))[`d]
0Ni
/ Two dictionaries, the first containing a and the second d. Both columns end up mixed
q)dict:((`a`b`c;(1i;2j;3f));(`b`c`d;(4j;5f;6h)))
q){(distinct raze x[;0])#/:(!)./:x}dict
a  b c d  
----------
1i 2 3 0Ni
0N 4 5 6h 

uj automatically accounts for this and will make sure the missing key value are of the correct null as it calls .Q.ff.

Mark Kelly
  • 1,780
  • 7
  • 16
1

Again, this could have been a follow up question to your original topic!

But to answer here: after parsing each field you can convert to dictionary using (!/) and promote to a table using enlist. Then it's just a matter of union joining uj each one iteratively

q)(uj/)each[enlist(!/)"S: "0:] q4
eyr    pid         hcl       byr    iyr    ecl   hgt     cid
--------------------------------------------------------------
"2024" "662406624" "#cfa07d" "1947" "2015" "amb" "150cm" ""
"2027" "018128"    "#ceb3a1" "1997" "2013" "gry" "182cm" "102"
""     "916315544" "#733820" ""     "2014" "oth" "61in"  ""
terrylynch
  • 11,844
  • 13
  • 21
0

You can avoid using a union join by 'taking' the necessary keys from each dictionary

q)d:{(-10?`1;10?0N 4#.Q.a)}@'til 10000
q){(distinct raze x[;0])#/:(!)./:x}d
e      o      i      j      f      n      l    ..
-----------------------------------------------..
"qrst" "abcd" "ijkl" "efgh" "ijkl" "efgh" "ijkl..
""     "ijkl" ""     ""     "abcd" "uvwx" ""   ..
"qrst" ""     "uvwx" ""     "ijkl" "yz"   ""   ..
""     "yz"   ""     "abcd" "uvwx" "qrst" "ijkl..
""     "uvwx" ""     "efgh" "yz"   "uvwx" "qrst..
..
jasonfealy
  • 1,081
  • 3
  • 5
  • This method is great when there you only have a single type, but will need tweaking in other cases to handle null types, since the null value of missing keys in one dictionary may not match the values in the other dictionaries. I have added an example to my answer since comments do not support multiline – Mark Kelly Dec 29 '20 at 23:43