Convert string of length n to a matrix of n x len(alphabet)

Question

Suppose we are given a String "AABCD" with length n = 5, from an alphabet {'A', 'B', 'C', 'D', 'E', 'F'} with dimension len(alphabet) = 6. What is a Pythonic way of converting this string to a 5 x 6 matrix?

ie.

#INPUT:
string = "AABCD"
alphabet = {'A', 'B', 'C', 'D', 'E', 'F'}

#OUTPUT
output = 
        A B C D E F
char 1[ 1 0 0 0 0 0 ]
char 2[ 1 0 0 0 0 0 ]
char 3[ 0 1 0 0 0 0 ]
char 4[ 0 0 1 0 0 0 ]
char 5[ 0 0 0 1 0 0 ]

I scoured other answers but have yet to find a question that is similar. Suggestions greatly appreciated!

I asked a similar [question](https://stackoverflow.com/questions/59474970/expanding-numpy-array-while-updating-the-values) on this. See if this helps any way. Good luck! — Yongjun Lee, Dec 28 '19 at 05:28

score 1 · Accepted Answer · answered Dec 28 '19 at 05:25

A simple double for loop will do

string = "AABCD"
alphabet = ['A', 'B', 'C', 'D', 'E', 'F']

matrix = [[0 for _ in range(len(alphabet))] for _ in range(len(string))]

for i, s in enumerate(string):
    for j, a in enumerate(alphabet):
        matrix[i][j] = 1 if s == a else 0

print(matrix)

The output will be

[
[1, 0, 0, 0, 0, 0], 
[1, 0, 0, 0, 0, 0], 
[0, 1, 0, 0, 0, 0], 
[0, 0, 1, 0, 0, 0], 
[0, 0, 0, 1, 0, 0]
]

It can also be done via itertools.product, but it won't look as clean as the for loop.

import itertools

string = "AABCD"
alphabet = ['A', 'B', 'C', 'D', 'E', 'F']

string_iter = zip(list(range(len(string))), string)
alphabet_iter = zip(list(range(len(alphabet))), alphabet)

matrix = [[0 for _ in range(len(alphabet))] for _ in range(len(string))]

for (i, s), (j, a) in itertools.product(string_iter, alphabet_iter):
    matrix[i][j] = 1 if s == a else 0

print(matrix)

Benyamin Karimi · Answer 2 · 2019-12-28T05:35:17.847

1

you can use this code:

string = "AABCD"
#use array insted set type
alphabet = ['A', 'B', 'C', 'D', 'E', 'F']
#global matrix
mat=[]
#get length of string to create one-hot vector for evry  character
l=len(alphabet)
for i in string:
    indx=alphabet.index(i)
    sub=[0] * l
    sub[indx]=1
    mat.append(sub)

output :

[[1, 0, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0],
 [0, 0, 1, 0, 0, 0],
 [0, 0, 0, 1, 0, 0]]

edited Dec 28 '19 at 05:35

answered Dec 28 '19 at 05:30

Benyamin Karimi

133
1
1
9

Thanks, this answer is understandable and in pure Python. I realize other answers are "shorter" and adhere to the spirit of "Pythonic" but they rely on other packages. This answer also has a faster run time (vs. nested for loops) – batlike Dec 28 '19 at 06:03
1

This answer has the same time complexity `O(len(string)*len(alphabet))` as the for loop answers, and this will not work for say `string = "AABCX"` since `alphabet.index(i)` will throw an exception – Devesh Kumar Singh Dec 28 '19 at 07:12
Sorry @DeveshKumarSingh for my mistake! You are right. – batlike Dec 28 '19 at 22:43

Sayandip Dutta · Answer 3 · 2019-12-28T05:41:56.050

For your exact output:

string = "AABCD"
alphabet = ['A', 'B', 'C', 'D', 'E', 'F']

print(f'output = \n\t{" ".join(alphabet)}')
for ix,char in enumerate(string, start=1):
    x = [0]*len(alphabet)
    x[alphabet.index(char)] = 1
    print(f'char {ix} {x}'.replace(',',''))

Output:

output = 
        A B C D E F
char 1 [1 0 0 0 0 0]
char 2 [1 0 0 0 0 0]
char 3 [0 1 0 0 0 0]
char 4 [0 0 1 0 0 0]
char 5 [0 0 0 1 0 0]

score 1 · Answer 4 · answered Dec 28 '19 at 05:36

1

Another solution that is slightly neater and maybe more general:

import numpy as np
alphabet =["A","B","C","D","E","F"]


alphabet_dict = {}
for i,x in enumerate(alphabet):
   alphabet_dict[x] = i


string = ["A", "A", "B", "C", "D"]

output = np.zeros((len(alphabet), len(string)))

for i,x in enumerate(string):
    output[i][alphabet_dict[x]] = 1

Hope this helps.

answered Dec 28 '19 at 05:36

Tank

501
3
19

In case others are looking at this answer: the indeces for the last line should be switched ```output[alphabet_dict[x]][i]``` – batlike Dec 28 '19 at 06:02

Scott Boston · Answer 5 · 2019-12-28T05:49:10.637

You can use pandas a do this is very few lines:

import pandas as pd
string1 = "AABCD"
df = pd.Series([*string1]).str.get_dummies()
df = df.rename(index=lambda x: f'Char {x+1}')
print(df)

Output as pandas dataframe:

        A  B  C  D
Char 1  1  0  0  0
Char 2  1  0  0  0
Char 3  0  1  0  0
Char 4  0  0  1  0
Char 5  0  0  0  1

Note, a piece of syntactic sugar is the unpacking of a string into a list of characters using [*'string'] results in ['s','t','r','i','n','g'].

oppressionslayer · Answer 6 · 2019-12-28T06:56:48.887

Here's mine, it works with different size values too as shown:

df = pd.DataFrame(((pd.Series([*string])*len(alphabet)).str.split("", n=-1, expand=True).drop(columns=[0, len(alphabet)+1]).eq(list(sorted(alphabet)))*1)).rename(index=lambda x: f'Char {x+1}', columns=lambda x: f'{chr(x+64)}')                                                                                                                                                                             

In [1661]: df                                                                                                                                                                                  
Out[1661]: 
        A  B  C  D  E  F
Char 1  1  0  0  0  0  0
Char 2  1  0  0  0  0  0
Char 3  0  1  0  0  0  0
Char 4  0  0  1  0  0  0
Char 5  0  0  0  1  0  0

or

string = 'AABCDEEF'
alphabet = {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'}

df = pd.DataFrame(((pd.Series([*string])*len(alphabet)).str.split("", n=-1, expand=True).drop(columns=[0, len(alphabet)+1]).eq(list(sorted(alphabet)))*1)).rename(index=lambda x: f'Char {x+1}', columns=lambda x: f'{chr(x+64)}')

        A  B  C  D  E  F  G  H
Char 1  1  0  0  0  0  0  0  0
Char 2  1  0  0  0  0  0  0  0
Char 3  0  1  0  0  0  0  0  0
Char 4  0  0  1  0  0  0  0  0
Char 5  0  0  0  1  0  0  0  0
Char 6  0  0  0  0  1  0  0  0
Char 7  0  0  0  0  1  0  0  0
Char 8  0  0  0  0  0  1  0  0

Convert string of length n to a matrix of n x len(alphabet)

6 Answers6