1

Recently I have been looking around about Natural Language Processing and its vectorization method and advantages of each vectorizer.

I am into character to vectorize, but it seems like the most concerns about the character vectorizer for each word is the embedding to have fixed length.

I do not want to just embed them with 0, which is well known as 0 padding, for instance, the target fixed length is 100 and 72 characters only exists then all 28 of 0 will be padded at the end.

"The example of paragraphs and phrases.... ... in vectorizer form" < with length 72

becomes

[0, 25, 60, 12, 24, 0, 19, 99, 7, 32, 47, 11, 19, 43, 18, 19, 6, 25, 43, 99, 0, 32, 40, 14, 20, 5, 37, 47, 99, 11, 29, 7, 19, 47, 18, 20, 60, 18, 19, 2, 19, 11, 31, 130, 130, 76, 0, 32, 40, 14, 20, 7, 19, 47, 18, 20, 60, 11, 37, 43, 99, 11, 29, 99, 17, 39, 47, 11, 31, 18, 19, 43, 0, 19, 77, 0, 0, 0, 0, 0, 0, 0, 0, ...., 0, 0, 0, 0, 0, 0]

.

.

I want to make the vectors be in a fair distribution form in N fixed dimensions, not like the one above

If you know any papers or algorithms preferring consider this matter, or common way to produce a fixed length vectors from various length of vectors please share .

.

.

Further information added as gojomo requested;

I am trying to get the character level vectors for words in corpus.

Let say, in above example, "The example of paragraphs...." starts with

T [40]

h [17]

e [3]

e [3]

x [53]

a [1]

m [21]

p [25]

l [14]

e [3]

Notice that each character has its own number (etc, could be ascii) and word represents the vectors of character vectors combination, for example,

The [40, 17, 3]

example [3, 53, 1, 21, 25, 14, 3]

which the vectors are not in same dimension. With the case mention above, many people are padding 0 at the end to make it in uniform size

For example, if someone wants to make the dimension of each word to be 300, then 297 of 0s will be padded to letter "The" and 293 of 0s will be padded to "example"., like

The [40, 17, 3, 0, 0, 0, 0, 0, ...., 0]

example [3, 53, 1, 21, 25, 14, 3, 0, 0, 0, 0, 0, ...., 0]

Now I do not think this padding method is appropriate to my experiments so I want to know if there are any methods to convert its vectors to in uniform form with not sparsed form(if this term is allowed).

Even with the phrase with two words, "The example" only takes 11 characters long , still not long enough either.

Whatever the case is that, I would like to know if there are some well known techniques to convert the informal length of vectors to some fixed length.

Thank you !

Isaac Sim
  • 539
  • 1
  • 7
  • 23
  • It's unclear what you are asking. Can you flesh out your question with more detail, such as the example strings/text you're working with, and your code for vectorization so far? (No commonly-used vectorizations give varying-dimension vectors for different words/texts. For example, the sklearn `DictVectorizer` you mention as a question-tag will, in its `fit()`, learn a fixed-size superset of all features seen, thus giving each later sample a fixed-size one-hot vector. So what you're facing isn't typically an issue in practice – if you're hitting it, you're doing things very idiosyncratically.) – gojomo Apr 17 '18 at 17:18
  • gojomo, I wrote more description in the question. Thank you for the regards – Isaac Sim Apr 19 '18 at 01:53

0 Answers0