1

I want to find the number of times each number appears at each index position in a list of 6 number sets when I do not know what the numbers will be, but they will range from 0-99 only.

Example list:

data = [['22', '45', '6', '72', '1', '65'], ['2', '65', '67', '23', '98', '1'], ['13', '45', '98', '4', '12', '65']]

Eventually I will be putting the resulting counts into a pandas DataFrame to look something like this:

num numofoccurances position numoftimesinposition
01         02            04            01
01         02            05            01
02         01            00            01
04         02            03            01
06         01            02            01
12         01            04            01
13         01            00            01
and so on...

The resulting data will be a little different due to the num repeating for each time it appears in a different index position, but hopefully this helps you understand what I'm looking for.

So far, this is what I've started:

data = json.load(f)
numbers = []
contains = []

'''
This section is simply taking the data from the json file and putting it all into a list of lists containing the 6 elements I need in each list
'''
for i in data['data']:
    item = [i[9], i[10]]
#   print(item)
    item = [words for segments in item for words in segments.split()]
    numbers.append(item)

'''
This is my attempt to count to number of occurrences for each number in the range then add it to a list.
'''
x = range(1,99)
for i in numbers:
    if x in i and not contains:
        contains.append(x)
Akshay Sehgal
  • 18,741
  • 3
  • 21
  • 51
Josh Crouse
  • 343
  • 1
  • 13
  • can you describe what each of the 3 columns are? the first value 22 in numofoccurances.. is that just the direct number that exists in the data? And what is num.. is that 0-99? or its the index for the 6 length lists? – Akshay Sehgal Jul 30 '20 at 00:34
  • @AkshaySehgal num is the number that occurs in the list. numofoccurrences is the number of times that number occurs in total. position is the index position that number occurred in. numoftimesinpostion is the number of times that number occurred in that specific index position. – Josh Crouse Jul 30 '20 at 00:40
  • 1
    Could you please edit the table so that it matches the data give? – Onyambu Jul 30 '20 at 00:41
  • @Onyambu the DataFrame example is arbitrary. It is solely for an understanding of what I will be doing with the data once i get it. I can create the DataFrame I really just need a way to get the data for the DataFrame. – Josh Crouse Jul 30 '20 at 00:49
  • whats the difference between numoccurance and numtimesinposition – Akshay Sehgal Jul 30 '20 at 00:52
  • @AkshaySehgal numofoccurrences is the total number of times that number appears in all the sets whereas numtimesinposition is the total number of times a number appears in that specific index position. – Josh Crouse Jul 30 '20 at 00:59
  • got it! do check my solution. – Akshay Sehgal Jul 30 '20 at 01:07

2 Answers2

2
import pandas as pd
num_pos = [(num,pos) for i in data for pos,num in enumerate(i)]
df = pd.DataFrame(num_pos,columns = ['number','position']).assign(numoftimesinposition = 1)
df = df.astype(int).groupby(['number','position']).count().reset_index()

df1 = df.groupby('number').numoftimesinposition.sum().reset_index().\
    rename(columns = {'numoftimesinposition':'numofoccurences'}).\
    merge(df, on='number')

print(df1)
    number  numofoccurences  position  numoftimesinposition
0        1                2         4                     1
1        1                2         5                     1
4        2                1         0                     1
7        4                1         3                     1
9        6                1         2                     1
2       12                1         4                     1
3       13                1         0                     1
5       22                1         0                     1
6       23                1         3                     1
8       45                2         1                     2
10      65                3         1                     1
11      65                3         5                     2
12      67                1         2                     1
13      72                1         3                     1
14      98                2         2                     1
15      98                2         4                     1

if the code above feels slow, then use Counter from collections:

import pandas as pd
from collections import Counter

num_pos = [(int(num),pos) for i in data for pos,num in enumerate(i)]

count_data = [(num,pos,occurence) for (num,pos), occurence in Counter(num_pos).items()]

df = pd.DataFrame(count_data, columns = ['num','pos','occurence']).sort_values(by='num')

df['total_occurence'] = [Counter(df.num).get(num) for num in df.num]
print(df)
Onyambu
  • 67,392
  • 3
  • 24
  • 53
  • So, when I run this I only get 3 columns in the DataFrame. number, position, and numoftimesinposition. Is there another step that needs to be taken? – Josh Crouse Jul 30 '20 at 02:06
  • I got it, but I like the edit too. In the previous the DataFrame wasnt updating with the additional column. I just added df= to the df.groupby section and it works. I'll try out Counter too and see how that goes. – Josh Crouse Jul 30 '20 at 02:11
1

This should solve your query (should be faster than the extremely slow groupby (which you will need 2 of) and other pandas operations for larger data) -

#get the list of lists into a 2d numpy array
dd = np.array(data).astype(int)

#get vocab of all unique numbers
vocab = np.unique(dd.flatten())

#loop thru vocab and get sum of occurances in each index position
df = pd.DataFrame([[i]+list(np.sum((dd==i).astype(int), axis=0)) for i in vocab])

#rename cols
df.columns = ['num', 0, 1, 2, 3, 4, 5] 

#create total occurances of the item
df['numoccurances'] = df.iloc[:,1:].sum(axis=1)  
 
#Stack the position counts and rename cols
stats = pd.DataFrame(df.set_index(['num','numoccurances']).\
                     stack()).reset_index().\
                     set_axis(['num', 'numoccurances', 'position', 'numtimesinposition'], axis=1)

#get only rows with occurances
stats = stats[stats['numtimesinposition']>0].reset_index(drop=True) 
stats
    num  numoccurances  position  numtimesinposition
0     1              2         4                   1
1     1              2         5                   1
2     2              1         0                   1
3     4              1         3                   1
4     6              1         2                   1
5    12              1         4                   1
6    13              1         0                   1
7    22              1         0                   1
8    23              1         3                   1
9    45              2         1                   2
10   65              3         1                   1
11   65              3         5                   2
12   67              1         2                   1
13   72              1         3                   1
14   98              2         2                   1
15   98              2         4                   1

As the results show -

1 comes a total of 2 times in the sample data you shared, and it occurs 1 time each in 5th and 6th position. Similarly 2 comes 1 times in total and that too at the 1st position.

Akshay Sehgal
  • 18,741
  • 3
  • 21
  • 51