2

I'm reading a file using the "array of lines" mode of Dyalog's ⎕nget:

lines _ _ ← ⎕nget '/usr/share/dict/words' 1

And it appears to work:

          lines[1]
 10th

But the individual elements don't appear to be character arrays:

          line ← lines[1]
          line
 10th
          ≢ line
1
          ⍴ line
     

Here we see that the first line has a tally of 1 and a shape of the empty array. I can't index into it any further; lines[1][1] or line[1] is a RANK ERROR. If I use ⊂ on the RHS I can assign the value to multiple variables at once and get the same behavior for each variable. But if I do a multiple assignment without the left shoe, I get this:

          word rest ← line
          word
10th
          ≢ word
4
          ⍴ word
4

At last we have the character array I expected! Yet it was not evidently separated from anything else hidden in line; the other variable is identical:

          rest
10th
          ≢ rest
4
          ⍴ rest
4
          word ≡ rest
1

Significantly, when I look at word it has no leading space, unlike line. So it seems that the individual array elements in the content matrix returned by ⎕nget are further wrapped in something that doesn't show up in shape or tally, and can't be indexed into, but when I use a destructuring assignment it unwraps them. It feels rather like the multiple-values stuff in Common Lisp.

If someone could explain what's going on here, I'd appreciate it. I feel like I'm missing something incredibly basic.

Mark Reed
  • 91,912
  • 16
  • 138
  • 175

2 Answers2

3

The result of reading a file with "array of lines" mode is a nested array. It is specifically a nested vector of character vectors where each character vector is a line from your text file.

For example, take \tmp\test.txt here:

my text file
has 3
lines

If we read this in, we can inspect the contents

      (content newline encoding) ← ⎕nget'\tmp\test.txt' 1
      ≢ content     ⍝ How many lines?
3
      ≢¨content     ⍝ How long is each line?
12 5 5
      content[2]    ⍝ Indexing returns a scalar (non-simple)
┌─────┐
│has 3│
└─────┘
      2⊃content     ⍝ Use pick to get the contents of the 2nd scalar
has 3
      ⊃content[2]   ⍝ Disclose the non-simple scalar
has 3

As you probably read from the online documentation, the default behaviour of ⎕NGET is to bring in a simple (non-nested) character vector with embedded new line characters. These are typically operating-system dependent.

      (content encoding newline) ← ⎕nget'\tmp\test.txt' 
      newline   ⍝ Unicode code points for line endings in this file  (Microsoft Windows)
13 10
      content
my text file
has 3       
lines       
            
      content ∊ ⎕ucs 10 13
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1

But with "array of lines" mode, you get a nested result.

For a quick introduction to nested arrays and the array model, see Stefan Kruger's LearnAPL book.

RikedyP
  • 632
  • 4
  • 8
  • I understood the nested part, but was missing – or rather, describing without knowing what I was describing – the fact that each array is "enclosed" and needs to be "disclosed" to get at the actual value I sought. Clearly I have some more reading to do, so thanks for the book link. – Mark Reed Nov 12 '21 at 18:35
3

If you turn boxing on it's easier to see what's happening. Each element is an enclosed character vector. Use pick instead of bracket index [] to get the actual item.

  words ← ⊃⎕nget'/usr/share/dict/words'1
  ]box on -s=max
  ⍴words
┌→─────┐
│235886│
└~─────┘
  
  words[10]
┌─────────┐
│ ┌→────┐ │
│ │Aaron│ │
│ └─────┘ │
└∊────────┘
  
  10⊃words ⍝ use pick
┌→────┐
│Aaron│
└─────┘
xpqz
  • 3,617
  • 10
  • 16
  • I should've had boxing on, for sure. But then I would still have been confused that the word was double-boxed but I couldn't get at it with `[1][1]`. Both the ideas of enclosure being distinct from "array of" and pick having different semantics from bracket indexing were new to me... – Mark Reed Nov 12 '21 at 18:39