1

after applying levenshtein distance algorithm I get a dataframe like this:

Elemento_lista Item_ID Score idx ITEM_ID_Coincidencia
4 691776 100 5 691777
4 691776 100 6 691789
4 691776 100 7 691791
5 691777 100 4 691776
5 691777 100 6 691789
5 691777 100 7 691791
6 691789 100 4 691776
6 691789 100 5 691777
6 691789 100 7 691791
7 691791 100 4 691776
7 691791 100 5 691777
7 691791 100 6 691789
9 1407402 100 10 1407424
10 1407424 100 9 1407402

Elemento_lista column is the index of the element that is compared to others, Item_ID is the id of the element, Score is Score generated by the algorithm, idx is the index of the element that was found as similar (same as Elemento_lista , but for elements that were found as similar), ITEM_ID_Coincidencia is the id of the element found as similar

It´s a small sample of the real DF (More than 300000 rows), I´ll need to drop lines that are the same , for example...if Elemento_lista 4, is equal to idx 5,6,and 7...they are all the same, so I don't need lines where 5 is equal to 4, 6 and 7/ 6 is equal to 4,5,7 and 7 is equal to 4,5,6. The same for each Elemento_Lista : value=9 is equal to idx 10, so...I don't need the line Elemento_Lista 10 is equal to idx 9...How could I drop these lines in order to reduce DF len ???

Final DF should be:

Elemento_lista Item_ID Score idx ITEM_ID_Coincidencia
4 691776 100 5 691777
4 691776 100 6 691789
4 691776 100 7 691791
9 1407402 100 10 1407424

I don´t know how to do this...is it possible?

Thanks in advance

mozway
  • 194,879
  • 13
  • 39
  • 75
Maximiliano Vazquez
  • 196
  • 1
  • 2
  • 12

2 Answers2

1

Preparing data like example:

a = [
[4,691776,100,5,691777],
[4,691776,100,6,691789],
[4,691776,100,7,691791],
[5,691777,100,4,691776],
[5,691777,100,6,691789],
[5,691777,100,7,691791],
[6,691789,100,4,691776],
[6,691789,100,5,691777],
[6,691789,100,7,691791],
[7,691791,100,4,691776],
[7,691791,100,5,691777],
[7,691791,100,6,691789],
[9,1407402,100,10,1407424],
[10,1407424,100,9,1407402]
]
c = ['Elemento_lista', 'Item_ID', 'Score', 'idx', 'ITEM_ID_Coincidencia']
df = pd.DataFrame(data = a, columns = c)
df

Now, you insert one column: it will contain an array of 2 sorted indexes.

tuples_of_indexes = [sorted([x[0], x[3]]) for x in df.values]
df.insert(5, 'tuple_of_indexes', (tuples_of_indexes))

Then all the dataframe is sorted by the inserted column:

df = df.sort_values(by=['tuple_of_indexes'])

Then you eliminate rows that repeat inserted column:

df = df[~df['tuple_of_indexes'].apply(tuple).duplicated()]

For last, you eliminate inserted column: 'tuple_of_indexes':

df.drop(['tuple_of_indexes'], axis=1)

The output is:

Elemento_lista  Item_ID Score   idx ITEM_ID_Coincidencia
0   4   691776  100 5   691777
1   4   691776  100 6   691789
2   4   691776  100 7   691791
4   5   691777  100 6   691789
5   5   691777  100 7   691791
8   6   691789  100 7   691791
12  9   1407402 100 10  1407424

output result

  • thanks for the comment, but it´s not working for me.In Your example, a is a list...I have a DF with lot of columns (more than 40) how could I do the same with a DF, not a list...? – Maximiliano Vazquez May 14 '22 at 01:39
1

This can be approached using graph theory.

You have the following relationships between your IDs:

graph

So what you need to do is find the subgraphs.

For this we can use networkx's connected_components function:

# pip install networkx
import networkx as nx
G = nx.from_pandas_edgelist(df, source='Elemento_lista', target='idx')

# get "first" (arbitrary) node for each subgraph
# note that sets (unsorted) are used
# so there is no guarantee on any node being "first" item
nodes = [tuple(g)[0] for g in nx.connected_components(G) if g]
# [4, 9]

# filter DataFrame
df2 = df[df['Elemento_lista'].isin(nodes)]

output:

    Elemento_lista  Item_ID  Score  idx  ITEM_ID_Coincidencia
0                4   691776    100    5                691777
1                4   691776    100    6                691789
2                4   691776    100    7                691791
12               9  1407402    100   10               1407424

update: real data

You real data is hyperconnected, forming in fine only 2 groups.

real data

You can change strategy here and use a directed graph and strongly_connected_components

import networkx as nx
#df = pd.read_csv('ADIDAS_CALZADO.csv', index_col=0)
G = nx.from_pandas_edgelist(df, source='Elemento_lista', target='idx', create_using=nx.DiGraph)

# len(list(nx.strongly_connected_components(G)))
# 150 subgraphs

nodes = [tuple(g)[0] for g in nx.strongly_connected_components(G) if g]

df2 = df[df['Elemento_lista'].isin(nodes)]

# len(df2)
# only 2,910 nodes left out of the 25,371 initial ones

new graph on the filtered df2:

filtered graph

mozway
  • 194,879
  • 13
  • 39
  • 75
  • mozway, it's A great contribution! yes, tihis graph theory. but I think it´s not working fine..I have a DF, with 1332 Elemento_lista unique values and as nodes I´m getting only element 0 and 388...I should upload real DF... – Maximiliano Vazquez May 14 '22 at 02:14
  • 1
    what is the output of `G = nx.from_pandas_edgelist(df, source='Elemento_lista', target='idx') ; print(list(nx.connected_components(G)))`? My guess is, you have actually highly connected components and you thus remove most of them. – mozway May 14 '22 at 02:18
  • 1
    You can provide your whole dataset if you want me to have a look. – mozway May 14 '22 at 02:20
  • print(G) = Graph with 1333 nodes and 18079 edges / print(nodes)= [0, 388] /print(list(nx.connected_components(G))) = [{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, ,...too long forcomment – Maximiliano Vazquez May 14 '22 at 02:24
  • 1
    That's what I thought, you have so many edges that it forms a single subgraph. Can you provide the dataset, depending on the connections I might have an alternative. – mozway May 14 '22 at 02:28
  • 1
    I uploaded the .csv : https://github.com/mvazquezmassaro/Files it´s named ADIDAS_CALZADO.csv – Maximiliano Vazquez May 14 '22 at 02:32
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/244731/discussion-between-maximiliano-vazquez-and-mozway). – Maximiliano Vazquez May 14 '22 at 02:39
  • 1
    I updated the answer, this leaves ~3K nodes out of the initial ~25K. You can probably improve the filtering depending on your exact use case, but this is probably a different question now ;) – mozway May 14 '22 at 03:00
  • 1
    It has been of great help!!! You have no idea what u have done for me!. Thanks a lot!. Have a nice a day. I ll probably ask again about this topic in a different question!! – Maximiliano Vazquez May 14 '22 at 03:50