1

I have a dataframe with the following structure:

prod_sec     
A    
AA    
AAAAAAAAAAB    
AAAABCCCAA    
AACC   
ABCCCBAC

df = pd.DataFrame({'prod_sec': ['A','AA','AAAAAAAAAAB','AAAABCCCAA','AACC','ABCCCBAC']})

Each string is a sequence made up of letters (A to C in this example).
I would like to create a list for each letter that counts the occurrences in each position down the entire pandas column.

For example in the first string A is only in the first position/index and it's not in the other locations.
In the second string the A in the first two positions and it's not in the other locations
In the third string the A has all the positions until the last one. Etc...
I want a total count, for the column, by position.
Here is an example for A:

A            ->     [1,0,0,0,0,0,0,0,0,0,0]    
AA                  [1,1,0,0,0,0,0,0,0,0,0]
AAAAAAAAAAB   ->    [1,1,1,1,1,1,1,1,1,1,0] 
AAAABCCCAA          [1,1,1,1,0,0,0,0,0,0,1]
AACC                [1,1,0,0,0,0,0,0,0,0,0]
ABCCCBAC    ->      [1,0,0,0,0,0,1,0,0,0,0]

so for A, I would want an output similar to the following... A [6,4,2,2,1,1,2,1,1,1,0]
In the end, I'm trying to get a matrix with a row for each character.

                    [6,4,2,2,1,1,2,1,1,1,0]
                    [0,1,0,0,1,1,0,0,0,0,1]
                    [0,0,1,1,0,1,2,0,0,0,0]
Jonathan Hay
  • 195
  • 11

1 Answers1

1

The following should work. You can adjust the result, depending on your exact needs (numpy array, data frame, dictionary, etc). Tell me if you need more help with that.

max_length=max([len(i) for i in df.prod_sec])

d={'A':[0]*max_length, 'B':[0]*max_length, 'C':[0]*max_length}

for i in df.prod_sec:
    for k in range(len(i)):
        d[i[k]][k]+=1

result=pd.DataFrame.from_dict(d, orient='index')
IoaTzimas
  • 10,538
  • 2
  • 13
  • 30
  • You are counting the characters for each cell. I would like to have a count by position down the entire Pandas column. I have clarified my question above. – Jonathan Hay Sep 05 '20 at 19:53
  • Ok, i have updated my answer. Tell me if you need more help with the result. – IoaTzimas Sep 06 '20 at 06:26
  • Incredible. How would this be generalized instead of hardcoding the A-C? I can get all of the possibilities by `string = ''.join(set(df['prod_sec'].dropna().to_string(index = False))).strip()` but I'm not sure how to generate the dictionary. – Jonathan Hay Sep 06 '20 at 12:36
  • You can do something like this: ```for i in df['prod_sec']: for k in i: d[k]=[0]*max_length``` – IoaTzimas Sep 06 '20 at 13:34