-2

I have a list of more than 10k os string that look like different versions of this (HN5ML6A02FL4UI_3 [14 numbers or letters_1-6]), where some are duplicates except for the _1 to _6.

I am trying to find a way to list these and remove the duplicate 14 character (that comes before the _1-_6).

Example of part of the list:

HN5ML6A02FL4UI_3

HN5ML6A02FL4UI_1

HN5ML6A01BDVDN_6

HN5ML6A01BDVDN_1

HN5ML6A02GVTSV_3

HN5ML6A01CUDA2_1

HN5ML6A01CUDA2_5

HN5ML6A02JPGQ9_5

HN5ML6A02JI8VU_1

HN5ML6A01AJOJU_5

I have tried versions of scripts using Reg Expressions: var n = /\d+/.exec(info)[0]; into the following that were posted into my previous question. and

I also used a modified version of the code from : How can I strip the first 14 characters in an list element using python?

More recently I used this script and I am still not getting the correct output.

import os, re

def trunclist('rhodopsins_play', 'hope4'):
    with open('rhodopsins_play','r') as f:
        newlist=[]
        trunclist=[]
        for line in f:
            if line.strip().split('_')[0] not in trunclist:
                newlist.append(line)
                trunclist.append(line.split('_')[0])
    print newlist, trunclist

    # write newlist to file, with carriage returns
    with open('hope4','w') as out:
        for line in newlist:
            out.write(line)

My inputfile.txt contains more than 10k of data which looks like the list above, where the important part are the characters are in front of the '_' (underscore), then outputting a file of the uniquified ABCD12356_1.

Can someone help?

Thank you for your help

Bill the Lizard
  • 398,270
  • 210
  • 566
  • 880
hdliv
  • 1
  • 2
  • for the example i would want an output file HN5ML6A02FL4UI_3 HN5ML6A01BDVDN_6 HN5ML6A02GVTSV_3 HN5ML6A01CUDA2_1 HN5ML6A02JPGQ9_5 HN5ML6A02JI8VU_1 HN5ML6A01AJOJU_5 – hdliv Dec 08 '14 at 13:32

1 Answers1

0

Import python and run this script that is similar to the above. It is slitting at the '_' This worked on the file

def trunclist(inputfile, outputfile):
with open(inputfile,'r') as f:
    newlist=[]
    trunclist=[]
    for line in f:
        if line.strip().split('_')[0] not in trunclist:
            newlist.append(line)
            trunclist.append(line.split('_')[0])
print newlist, trunclist

# write newlist to file, with carriage returns
with open(outputfile,'w') as out:
    for line in newlist:
        out.write(line)
hdliv
  • 1
  • 2