I have a list of more than 10k os string that look like different versions of this (HN5ML6A02FL4UI_3 [14 numbers or letters_1-6]), where some are duplicates except for the _1 to _6.
I am trying to find a way to list these and remove the duplicate 14 character (that comes before the _1-_6).
Example of part of the list:
HN5ML6A02FL4UI_3
HN5ML6A02FL4UI_1
HN5ML6A01BDVDN_6
HN5ML6A01BDVDN_1
HN5ML6A02GVTSV_3
HN5ML6A01CUDA2_1
HN5ML6A01CUDA2_5
HN5ML6A02JPGQ9_5
HN5ML6A02JI8VU_1
HN5ML6A01AJOJU_5
I have tried versions of scripts using Reg Expressions: var n = /\d+/.exec(info)[0];
into the following that were posted into my previous question. and
I also used a modified version of the code from : How can I strip the first 14 characters in an list element using python?
More recently I used this script and I am still not getting the correct output.
import os, re
def trunclist('rhodopsins_play', 'hope4'):
with open('rhodopsins_play','r') as f:
newlist=[]
trunclist=[]
for line in f:
if line.strip().split('_')[0] not in trunclist:
newlist.append(line)
trunclist.append(line.split('_')[0])
print newlist, trunclist
# write newlist to file, with carriage returns
with open('hope4','w') as out:
for line in newlist:
out.write(line)
My inputfile.txt contains more than 10k of data which looks like the list above, where the important part are the characters are in front of the '_' (underscore), then outputting a file of the uniquified ABCD12356_1
.
Can someone help?
Thank you for your help