0

I currently have this Dataframe and I would like to grouby the value that appear in the list of others.

dataframe = 
foo    [1,2,3,4]
bar    [4]
bob    [2]
ere    [7]

I would like my dataframe to look like this:

dataframe = 
foo,bar,bob    [1,2,3,4]
ere            [7]

thank you!

*** this is the code to create the dataframe*** The data comes from a fasta-like file like this

>foo
1
2
3
4
>bar
4
>bob
2
>ere
7

My code to create df

import pandas as pd


input1 = "final.fasta"
fasta = open(input1,"r")

records = [record for record in fasta]
# gets the numbers in a list
ids = [list(x[1]) for x in itertools.groupby(records,lambda x: '>' in x) if not x[0]]
#gets the name in a list
ref_seqs = [list(x[1]) for x in itertools.groupby(records,lambda x: '>' not in x) if not x[0]]


# transform into a df
df = pd.DataFrame({'refseq':ref_seqs,'ids':ids})

1 Answers1

3

More like a network problem after explode detail explanation

s = df.explode('col2')
import networkx as nx
G = nx.from_pandas_edgelist(s, 'col1', 'col2')
l = list(nx.connected_components(G))
L = [dict.fromkeys(y,x) for x, y in enumerate(l)]
d = {k: v for d in L for k, v in d.items()}

out = s.groupby(s['col1'].map(d)).agg({'col1':lambda x : ','.join(set(x)),'col2':'unique'})
Out[334]: 
             col1          col2
col1                           
0     bar,foo,bob  [1, 2, 3, 4]
1             ere           [7]
BENY
  • 317,841
  • 20
  • 164
  • 234