0

I was actually trying to solve analytics vidya recent Hackathon LTFS(Bank Data), and there I faced something unique problem, actually not too unique. Let me explain

Problem

There are few columns in a Bureau dataset named REPORTED DATE - HIST, CUR BAL - HIST, AMT OVERDUE - HIST & AMT PAID - HIST which consists blank value ,, or more than one value in a row, and also there is not the same number of value in each row

Here is the part of the dataset (it's not original data, because of the big row size)

**Requested Date - Hist**                                                                   
20180430,20180331,
20191231,20191130,20191031,20190930,20190831,20190731,20190630,20190531,20190430,20190331
,
20121031,20120930,20120831,20120731,20120630,20120531,20120430,

----------------x-----------2nd column------------x-----------------------------------

**AMT OVERDUE**
37873,,
,,,,,,,,,,,,,,,,,,,,1452,,
0,0,0,
,,
0,,0,0,0,0,3064,3064,3064,2972,0,2802,0,0,0,0,0,2350,2278,2216,2151,2087,2028,1968,1914,1663,1128,1097,1064,1034,1001,976,947,918,893,866

-----x--other columns are similar---x---------------------

Seeking for a better option, if possible

Previously when I solved this kind of problem, it was genres of Movielens project and there I use used dummy column concept, it worked there because there had not too many values in genres columns and also some of the values are repeating value in many rows, so it was quite easy. But here it seems quite hard here because of two reasons

1st reason because it contains lots of value and at the same time it may contain no value

2nd reason how to create a column for each unique value or a row like in Movielens genre case

**genre**
action|adventure|comedy
carton|scifi|action
biopic|adventure|comedy
Thrill|action

# so here I had extracted all unique value and created columns 

**genre**                 | **action** | **adventure**| **Comedy**| **carton**| **sci-fi**| and so on...
action|adventure|comedy   |   1        |     1        |      1    |     0     |      0    |    
carton|scifi|action       |   1        |     0        |      0    |     1     |      1    |
biopic|adventure|comedy   |   0        |     1        |      1    |     0     |      0    |
Thrill|action             |   1        |     0        |      0    |     0     |      0    |

# but here it's different how can I deal with this, I have no clue
**AMT OVERDUE**
37873,,
,,,,,,,,,,,,,,,,,,,,1452,,
0,0,0,
,,
0,,0,0,0,0,3064,3064,3064,2972,0,2802,0,0,0,0,0,2350,2278,2216,2151,2087,2028,1968,1914,1663,1128,1097,1064,1034,1001,976,947,918,893,866
Ch3steR
  • 20,090
  • 4
  • 28
  • 58
Darkstar Dream
  • 1,649
  • 1
  • 12
  • 23

1 Answers1

0

When in recommender is common to have sparse matrixes. Those can be very consuming space (too many zeros, or empty spaces), perhaps good to move to sparse matrix scipy representation, as in here. As mentioned it is common in recommenders, please find here excellent example.

Unfortunately I cannot use the original data, perhaps good to have a smaller example in csv. So I will use the example from recommender, since is as well very commmon.

import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix

df = pd.DataFrame({
    'genres' : ["action|adventure|comedy", "carton|scifi|action","biopic|adventure|comedy","Thrill|action"],
})
print(df)
                    genres
0  action|adventure|comedy
1      carton|scifi|action
2  biopic|adventure|comedy
3            Thrill|action

Let see how that looks like as a matrix:

# To identify the genres so we can create our columns
genres = []
for G in df['genres'].unique():
    for i in G.split("|"):
        print(i)
        genres.append(i)
# To remove duplicates
genres = list(set(genres))

# Create a column for each genere
for g in genres:
    df[g] = df.genres.transform(lambda x: int(g in x))

# This is the sparse matrix with many 0
movie_genres = df.drop(columns=['genres'])
print(movie_genres)
   comedy  carton  adventure  Thrill  biopic  action  scifi
0       1       0          1       0       0       1      0
1       0       1          0       0       0       1      1
2       1       0          1       0       1       0      0
3       0       0          0       1       0       1      0

We do not need to create that matrix, as a matter of fact, it is better to avoid same it could be very resource consuming.

We should convert that into a csr_matrix, with just a portion of the size:

from scipy.sparse import csr_matrix

M = df.index.__len__()
N = genres.__len__()

user_mapper = dict(zip(np.unique(df.index), list(range(M))))
genres_mapper = dict(zip(genres, list(range(N))))

user_inv_mapper = {user_mapper[i]:i for i in user_mapper.keys()}
genres_inv_mapper = {genres_mapper[i]:i for i in genres_mapper.keys()}

user_index = []
genre_index = []
for user in df.index:
    print(user)
    print(df.loc[user,'genres'])
    for genre in df.loc[user,'genres'].split('|'):
        genre_index.append(genres_mapper[genre])
        user_index.append(user_mapper[user])

X = csr_matrix((np.ones(genre_index.__len__()),
                (user_index,genre_index)), shape=(M,N))

That looks like:

print(X)
  (0, 0)    1.0
  (0, 2)    1.0
  (0, 5)    1.0
  (1, 1)    1.0
  (1, 5)    1.0
  (1, 6)    1.0
  (2, 0)    1.0
  (2, 2)    1.0
  (2, 4)    1.0
  (3, 3)    1.0
  (3, 5)    1.0

With the above you can see the process with a smaller data set.

Rafael Valero
  • 2,736
  • 18
  • 28
  • as I said, previously I handled that genre column, so my problem is not with the genre column, it's with the AMT OVERDUE, do you have any idea to solve that. – Darkstar Dream Feb 15 '21 at 04:08
  • Using scipy.sparse matrixes – Rafael Valero Feb 15 '21 at 07:28
  • oh, okay but I don't understand the concept of, what you are doing here, can you explain me little bit here pls, also if possible pls explain me about spare matrix or any reference will be highly appreciated – Darkstar Dream Feb 15 '21 at 08:55
  • You mentioned movielens --> so I create the movie_genres dataframe, you can see above how it looks like. It is a sparse matrix because it has everythin zeroes but a few 1. It is the same that your problem but intead of spliting by ```|``` you could by ```,```. – Rafael Valero Feb 15 '21 at 08:58
  • However that is too much spaces, and I guess in your case too. So people in recommenders create sparse.scipy matrixes that looks like ```X``` above. The idea behind those is that you asume everything is zeros but a few parts, so you have lots of size. Using Scipy.sparse matrix you could run operations in that matrix easily too. – Rafael Valero Feb 15 '21 at 09:00
  • by the way, did you see the original data which I linked, you may get an idea, what I am talking about – Darkstar Dream Feb 15 '21 at 09:07