1

Suppose we are given a String "AABCD" with length n = 5, from an alphabet {'A', 'B', 'C', 'D', 'E', 'F'} with dimension len(alphabet) = 6. What is a Pythonic way of converting this string to a 5 x 6 matrix?

ie.

#INPUT:
string = "AABCD"
alphabet = {'A', 'B', 'C', 'D', 'E', 'F'}
#OUTPUT
output = 
        A B C D E F
char 1[ 1 0 0 0 0 0 ]
char 2[ 1 0 0 0 0 0 ]
char 3[ 0 1 0 0 0 0 ]
char 4[ 0 0 1 0 0 0 ]
char 5[ 0 0 0 1 0 0 ]

I scoured other answers but have yet to find a question that is similar. Suggestions greatly appreciated!

batlike
  • 668
  • 1
  • 7
  • 19
  • I asked a similar [question](https://stackoverflow.com/questions/59474970/expanding-numpy-array-while-updating-the-values) on this. See if this helps any way. Good luck! – Yongjun Lee Dec 28 '19 at 05:28

6 Answers6

1

A simple double for loop will do

string = "AABCD"
alphabet = ['A', 'B', 'C', 'D', 'E', 'F']

matrix = [[0 for _ in range(len(alphabet))] for _ in range(len(string))]

for i, s in enumerate(string):
    for j, a in enumerate(alphabet):
        matrix[i][j] = 1 if s == a else 0

print(matrix)

The output will be

[
[1, 0, 0, 0, 0, 0], 
[1, 0, 0, 0, 0, 0], 
[0, 1, 0, 0, 0, 0], 
[0, 0, 1, 0, 0, 0], 
[0, 0, 0, 1, 0, 0]
]

It can also be done via itertools.product, but it won't look as clean as the for loop.

import itertools

string = "AABCD"
alphabet = ['A', 'B', 'C', 'D', 'E', 'F']

string_iter = zip(list(range(len(string))), string)
alphabet_iter = zip(list(range(len(alphabet))), alphabet)

matrix = [[0 for _ in range(len(alphabet))] for _ in range(len(string))]

for (i, s), (j, a) in itertools.product(string_iter, alphabet_iter):
    matrix[i][j] = 1 if s == a else 0

print(matrix)
Devesh Kumar Singh
  • 20,259
  • 5
  • 21
  • 40
1

you can use this code:

string = "AABCD"
#use array insted set type
alphabet = ['A', 'B', 'C', 'D', 'E', 'F']
#global matrix
mat=[]
#get length of string to create one-hot vector for evry  character
l=len(alphabet)
for i in string:
    indx=alphabet.index(i)
    sub=[0] * l
    sub[indx]=1
    mat.append(sub)

output :

[[1, 0, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0],
 [0, 0, 1, 0, 0, 0],
 [0, 0, 0, 1, 0, 0]]
Benyamin Karimi
  • 133
  • 1
  • 1
  • 9
  • Thanks, this answer is understandable and in pure Python. I realize other answers are "shorter" and adhere to the spirit of "Pythonic" but they rely on other packages. This answer also has a faster run time (vs. nested for loops) – batlike Dec 28 '19 at 06:03
  • 1
    This answer has the same time complexity `O(len(string)*len(alphabet))` as the for loop answers, and this will not work for say `string = "AABCX"` since `alphabet.index(i)` will throw an exception – Devesh Kumar Singh Dec 28 '19 at 07:12
  • Sorry @DeveshKumarSingh for my mistake! You are right. – batlike Dec 28 '19 at 22:43
1

For your exact output:

string = "AABCD"
alphabet = ['A', 'B', 'C', 'D', 'E', 'F']

print(f'output = \n\t{" ".join(alphabet)}')
for ix,char in enumerate(string, start=1):
    x = [0]*len(alphabet)
    x[alphabet.index(char)] = 1
    print(f'char {ix} {x}'.replace(',',''))

Output:

output = 
        A B C D E F
char 1 [1 0 0 0 0 0]
char 2 [1 0 0 0 0 0]
char 3 [0 1 0 0 0 0]
char 4 [0 0 1 0 0 0]
char 5 [0 0 0 1 0 0]
Sayandip Dutta
  • 15,602
  • 4
  • 23
  • 52
1

Another solution that is slightly neater and maybe more general:

import numpy as np
alphabet =["A","B","C","D","E","F"]


alphabet_dict = {}
for i,x in enumerate(alphabet):
   alphabet_dict[x] = i


string = ["A", "A", "B", "C", "D"]

output = np.zeros((len(alphabet), len(string)))

for i,x in enumerate(string):
    output[i][alphabet_dict[x]] = 1

Hope this helps.

Tank
  • 501
  • 3
  • 19
  • In case others are looking at this answer: the indeces for the last line should be switched ```output[alphabet_dict[x]][i]``` – batlike Dec 28 '19 at 06:02
1

You can use pandas a do this is very few lines:

import pandas as pd
string1 = "AABCD"
df = pd.Series([*string1]).str.get_dummies()
df = df.rename(index=lambda x: f'Char {x+1}')
print(df)

Output as pandas dataframe:

        A  B  C  D
Char 1  1  0  0  0
Char 2  1  0  0  0
Char 3  0  1  0  0
Char 4  0  0  1  0
Char 5  0  0  0  1

Note, a piece of syntactic sugar is the unpacking of a string into a list of characters using [*'string'] results in ['s','t','r','i','n','g'].

Scott Boston
  • 147,308
  • 15
  • 139
  • 187
0

Here's mine, it works with different size values too as shown:

df = pd.DataFrame(((pd.Series([*string])*len(alphabet)).str.split("", n=-1, expand=True).drop(columns=[0, len(alphabet)+1]).eq(list(sorted(alphabet)))*1)).rename(index=lambda x: f'Char {x+1}', columns=lambda x: f'{chr(x+64)}')                                                                                                                                                                             

In [1661]: df                                                                                                                                                                                  
Out[1661]: 
        A  B  C  D  E  F
Char 1  1  0  0  0  0  0
Char 2  1  0  0  0  0  0
Char 3  0  1  0  0  0  0
Char 4  0  0  1  0  0  0
Char 5  0  0  0  1  0  0

or

string = 'AABCDEEF'
alphabet = {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'}

df = pd.DataFrame(((pd.Series([*string])*len(alphabet)).str.split("", n=-1, expand=True).drop(columns=[0, len(alphabet)+1]).eq(list(sorted(alphabet)))*1)).rename(index=lambda x: f'Char {x+1}', columns=lambda x: f'{chr(x+64)}')

        A  B  C  D  E  F  G  H
Char 1  1  0  0  0  0  0  0  0
Char 2  1  0  0  0  0  0  0  0
Char 3  0  1  0  0  0  0  0  0
Char 4  0  0  1  0  0  0  0  0
Char 5  0  0  0  1  0  0  0  0
Char 6  0  0  0  0  1  0  0  0
Char 7  0  0  0  0  1  0  0  0
Char 8  0  0  0  0  0  1  0  0

oppressionslayer
  • 6,942
  • 2
  • 7
  • 24