2

When using key to produce a frequency table, e.g

      {⍺,≢⍵}⌸'mississippi'
┌→──┐
↓m 1│
│i 4│
│s 4│
│p 2│
└+──┘

I frequently find myself wanting to 'seed' the result set to provide a count of 0 for any items I know to be missing. If we're doing a letter frequency distribution analysis we might want to provide a 0 as the base value for all of ⎕A. Can anyone recommend a good way to achieve this in Dyalog APL?

Mark
  • 7,785
  • 2
  • 14
  • 34
xpqz
  • 3,617
  • 10
  • 16
  • Not an answer, but I wonder if your needs are better served by something like `⎕A(+/∘.=)'MISSISSIPPI'`. Also I can think of `{⍺,(≢⍵)-1}⌸'MISSISSIPPI',⎕A` but maybe that's silly :) – Lynn Jun 10 '23 at 09:20
  • Not silly at all. I like your two suggestions: why don't you post it as an answer so I can award some points. – xpqz Jun 10 '23 at 09:27

2 Answers2

2

If you're performing a frequency analysis "over" ⎕A, then maybe a result with the same shape as ⎕A is actually easier to consume than the output of ⌸:

⎕A (+/∘.=) 'MISSISSIPPI'

(I think Dyalog knows how to optimize this so that it doesn't actually create a 26-by-n binary array and then sum the rows.)

Another option is to first append the alphabet to the input, and then decrease all counts by 1.

{⍺,(≢⍵)-1}⌸ 'MISSISSIPPI',⎕A
Lynn
  • 10,425
  • 43
  • 75
2

A common technique is to prepend the "dictionary", in order to also achieve the given sort order. Compare:

      ({⍺,≢1↓⍵}⌸(⎕C⎕A),'mississippi') ({⍺,≢1↓⍵}⌸'mississippi',(⎕C⎕A))
┌→────────────┐
│ ┌→──┐ ┌→──┐ │
│ ↓a 0│ ↓m 1│ │
│ │b 0│ │i 4│ │
│ │c 0│ │s 4│ │
│ │d 0│ │p 2│ │
│ │e 0│ │a 0│ │
│ │f 0│ │b 0│ │
│ │g 0│ │c 0│ │
│ │h 0│ │d 0│ │
│ │i 4│ │e 0│ │
│ │j 0│ │f 0│ │
│ │k 0│ │g 0│ │
│ │l 0│ │h 0│ │
│ │m 1│ │j 0│ │
│ │n 0│ │k 0│ │
│ │o 0│ │l 0│ │
│ │p 2│ │n 0│ │
│ │q 0│ │o 0│ │
│ │r 0│ │q 0│ │
│ │s 4│ │r 0│ │
│ │t 0│ │t 0│ │
│ │u 0│ │u 0│ │
│ │v 0│ │v 0│ │
│ │w 0│ │w 0│ │
│ │x 0│ │x 0│ │
│ │y 0│ │y 0│ │
│ │z 0│ │z 0│ │
│ └+──┘ └+──┘ │
└∊────────────┘

For larger data, you get better performance by adjusting the count outside of Key's operand because {⍺,≢⍵} is special-cased: ¯1+@2⍤1{⍺,≢⍵}⌸ — see the comparison below.

While simple, all these solutions can have a performance impact, especially when dealing with large data, as a full memory copy will have to be made to insert additional elements at the front. A more performant way is to post-process with a lookup, and use the feature that an unfound element gives an index of one beyond the last position:

      (k v)←↓⍉{⍺,≢⍵}⌸'mississippi' ⋄ a←⎕C⎕A ⋄ a,⍪(v,0)[k⍳a]
┌→──┐
↓a 0│
│b 0│
│c 0│
│d 0│
│e 0│
│f 0│
│g 0│
│h 0│
│i 4│
│j 0│
│k 0│
│l 0│
│m 1│
│n 0│
│o 0│
│p 2│
│q 0│
│r 0│
│s 4│
│t 0│
│u 0│
│v 0│
│w 0│
│x 0│
│y 0│
│z 0│
└+──┘

Let's compare the performance, including Lynn's solution using outer product:

      'cmpx'⎕CY'dfns'
      OuterProduct←⊣,∘⍪(+/∘.=)
      Prepend1←{⍺,≢1↓⍵}⌸,
      Prepend2←¯1+@2⍤1{⍺,≢⍵}⌸⍤,
      Lookup←{(k v)←↓⍉{⍺,≢⍵}⌸⍵ ⋄ ⍺,⍪(v,0)[k⍳⍺]}
      a←⎕C⎕A

      t←'mississippi' ⋄ cmpx'a OuterProduct t' 'a Prepend1 t' 'a Prepend2 t' 'a Lookup t'
  a OuterProduct t → 7.6E¯6 |     0% ⎕⎕                                      
  a Prepend1 t     → 9.7E¯5 | +1176% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕          
  a Prepend2 t     → 1.3E¯4 | +1624% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  a Lookup t       → 1.4E¯5 |   +86% ⎕⎕⎕⎕                                    

      t←1e2⍴'mississippi' ⋄ cmpx'a OuterProduct t' 'a Prepend1 t' 'a Prepend2 t' 'a Lookup t'
  a OuterProduct t → 2.3E¯5 |    0% ⎕⎕⎕⎕⎕⎕⎕                                 
  a Prepend1 t     → 9.8E¯5 | +320% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕            
  a Prepend2 t     → 1.4E¯4 | +504% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  a Lookup t       → 1.5E¯5 |  -38% ⎕⎕⎕⎕                                    

      t←1e3⍴'mississippi' ⋄ cmpx'a OuterProduct t' 'a Prepend1 t' 'a Prepend2 t' 'a Lookup t'
  a OuterProduct t → 1.2E¯4 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕     
  a Prepend1 t     → 1.3E¯4 |  +4% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕    
  a Prepend2 t     → 1.4E¯4 | +14% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  a Lookup t       → 2.1E¯5 | -83% ⎕⎕⎕⎕⎕⎕                                  

      t←1e4⍴'mississippi' ⋄ cmpx'a OuterProduct t' 'a Prepend1 t' 'a Prepend2 t' 'a Lookup t'
  a OuterProduct t → 1.1E¯3 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  a Prepend1 t     → 2.7E¯4 | -76% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕                              
  a Prepend2 t     → 1.9E¯4 | -83% ⎕⎕⎕⎕⎕⎕⎕                                 
  a Lookup t       → 7.8E¯5 | -94% ⎕⎕⎕                                     
Adám
  • 6,573
  • 20
  • 37