-1

I am trying to remove lines in a file that start with the same 5 characters, however, the first 5 characters are random (I don't know what they will be)?

I have a code that reads the last 5 characters of the first line of a file and matches them to the FIRST 5 characters on a random line in the file that has the same 5 characters. The problem is, when there are two or more matches that have the same first 5 characters the code messes up. I need something that reads all the lines in the file and removes one of the two lines that have the same 5 first characters.

Example (issue):

CCTGGATGGCTTATATAAGAT***GTTAT***

***GTTAT***ATAATATACCACCGGGCTGCTT

***GTTAT***ATAGTTACAGCGGAGTCTTGTGACTGGCTCGAGTCAAAAT

What I need as result after one is taken out of file:

CCTGGATGGCTTATATAAGAT***GTTAT***

***GTTAT***ATAATATACCACCGGGCTGCTT

(no third line)

I will greatly appreciate it if you could explain how I could go about this with words as well.

quant
  • 2,184
  • 2
  • 19
  • 29
Alpa Luca
  • 13
  • 5
  • Welcome to StackOverflow. Please read and follow the posting guidelines in the help documentation, as suggested when you created this account. [On topic](http://stackoverflow.com/help/on-topic), [how to ask](http://stackoverflow.com/help/how-to-ask), and [... the perfect question](https://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-question/) apply here. StackOverflow is not a design, coding, research, or tutorial resource. However, if you follow whatever resources you find on line, make an honest coding attempt, and run into a problem, you'd have a good example to post. – Prune Nov 15 '18 at 20:12
  • Hi and welcome to SO. Your posted question does not appear to include any attempt at all to solve the problem. StackOverflow expects you to try to solve your own problem first, as your attempts help us to better understand what you want. Please edit the question to show what you've tried, so as to illustrate a specific problem you're having in a [MCVE]. For more information, please see [Ask] and take the [Tour]. – quant Nov 15 '18 at 20:16
  • Show us the code you wrote so far so we can see how it can be improved – Milo Bem Nov 15 '18 at 20:19

1 Answers1

0

You can do this for example like so:

FILE_NAME = "data.txt"                       # the name of the file to read in
NR_MATCHING_CHARS = 5                        # the number of characters that need to match

lines = set()                                # a set of lines that contain the beginning of the lines that have already been outputted
with open(FILE_NAME, "r") as inF:            # open the file
    for line in inF:                         # for every line
        line = line.strip()                  # that is
        if line == "": continue              # not empty
        beginOfSequence = line[:NR_MATCHING_CHARS]
        if not (beginOfSequence in lines):   # and the beginning of this line was not printed yet
            print(line)                      # print the line
            lines.add(beginOfSequence)       # remember that the beginning of the line
quant
  • 2,184
  • 2
  • 19
  • 29