Given a finite character vocabulary, what is the easiest way to represent arbitrarily long sequences of characters with uniform length?

Question

I am attempting to manipulate a finite state transducer for a project. However, in constructing the FST, I need the output symbols to each be some arbitrarily long sequence of characters from the input symbols, which are simply individual unique characters from an associated corpus of text. Additionally, I need to represent these arbitrarily long sequences uniformly, such that each combination's representation has the same length. Of course, with arbitrary length, the longest possible combination has infinite length, so let us assume that no combination can be longer than the longest document from the associated corpus.

In other words, given an input_vocabulary of ['a', 'b', 'c'], an output_vocabulary of ['a', 'ab', 'acb', 'abcb'] needs each to be represented as some vector of length 4 with each item in each vector being an item from the input_vocabulary. My only idea is to do so with a padded vector, such as, for this example, [ [0, 3, 3, 3], [0, 1, 3, 3], [0, 2, 1, 3], [0, 1, 2, 1] ], where 3 is a pad token, but I am very new to this, so any help would be greatly appreciated.

To clarify, I want to know if there is a way to do this without pad tokens.

No obvious mistakes. Your vector representation seems clear & parsimonious enough, and it seems to give every combination of input characters an unambiguous representation. Beyond that, we'd have to know how these data structures are being consumed to know if they're appropriate for your application -- but, as others will tell you, this is not a forum for design critiques, and opinion-based stuff like that. — Noah, Apr 19 '21 at 01:25
As I just edited in, to clarify, I want to know if there is a way to do this without pad tokens. — DLS, Apr 19 '21 at 01:27
If it is true that each vector must be of length *n*, where *n* is the size of the longest `output_vocabulary` token, I don't see how it's possible not to use some form of padding at some level of representation. If you want to use less memory, because you foresee yourself consuming a lot of space with padding tokens, you might turn to sparse matrices from `numpy`. Or, if you're free to change the representation, you could simply assign `0` to `a`, `1` to `ab`, `2` to `acb`, and so on. — Noah, Apr 19 '21 at 01:37

Given a finite character vocabulary, what is the easiest way to represent arbitrarily long sequences of characters with uniform length?

0 Answers0