I have a dataframe
sample1 0 0 0 0 0 1 1 1 1 1 1 1 1 L1
sample2 0 0 0 0 0 1 1 1 1 1 0 0 0 L1-1
sample3 0 0 0 0 0 1 1 0 0 0 0 0 0 L1-1-1
sample4 0 0 0 0 0 1 0 0 0 0 0 0 0 L1-1-1-1
sample5 0 0 0 0 0 0 0 1 1 0 0 0 0 L1-1-2
sample6 0 0 0 0 0 0 0 1 0 0 0 0 0 L1-1-2-1
sample7 0 0 0 0 0 0 0 0 0 1 0 0 0 L1-1-3
sample8 0 0 0 0 0 0 0 0 0 0 1 1 1 L1-2
sample9 0 0 0 0 0 0 0 0 0 0 1 1 0 L1-2-1
sample10 0 0 0 0 0 0 0 0 0 0 0 0 1 L1-2-2
sample11 1 1 1 1 1 0 0 0 0 0 0 0 0 L2
sample12 1 1 1 0 0 0 0 0 0 0 0 0 0 L2-1
sample13 1 1 0 0 0 0 0 0 0 0 0 0 0 L2-1-1
sample14 1 0 0 0 0 0 0 0 0 0 0 0 0 L2-1-1-1
sample15 0 0 0 1 0 0 0 0 0 0 0 0 0 L2-2
sample16 0 0 0 0 1 0 0 0 0 0 0 0 0 L2-3
As you can see, each row is clustered.
I want to name "lineage-based" labeling to each sample.
For example, sample1 will be lin1 because it is first to appear, sample2 will be lin1-1.
Sample3 will be lin1-1-1, sample4 will be lin1-1-1-1.
Next, sample5 will be lin1-2, sample6 will be lin1-2-1...
Sample11 will be a new start for the lineage, lin2.
My original idea for the naming was.
"sample1 is lin1, if next sample is included in the previous sample, lin1 + "-1" if not, lin(1+1)"
sample1 -> lin1
sample2 -> lin1-1 (sample2 is included in sample1)
sample3 -> lin1-1-1 (sample3 is included in sample2)
sample4 -> lin1-1-1-1 (sample4 is included in sample3)
sample5 -> lin1-1-2 (sample5 is not included in sample4) .... logic like this.
I couldn't make this logic into a python script.