Read a text file into python by splitting the file into list items according to a set of characters

Question

I have a plain text file with the following contents:

@M00964: XXXXX
YYY
+
ZZZZ 
@M00964: XXXXX
YYY
+
ZZZZ
@M00964: XXXXX
YYY
+
ZZZZ

and I would like to read this into a list split into items according to the ID code @M00964, i.e. :

['@M00964: XXXXX
YYY
+
ZZZZ' 
'@M00964: XXXXX
YYY
+
ZZZZ'
'@M00964: XXXXX
YYY
+
ZZZZ']

I have tried using

in_file = open(fileName,"r")
sequences = in_file.read().split('@M00964')[1:]
in_file.close()

but this removes the ID sequence @M00964. Is there any way to keep this ID sequence in?

As an additional question is there any way of maintaining white space in a list (rather than have /n symbols).

My overall aim is to read in this set of items, take the first 2, for example, and write them back to a text file maintaining all of the original formatting.

`.read()` reads 1 line. Try with `.readlines()` which reads all the lines, and then split on `'\n'` — fredtantini, Mar 25 '14 at 15:23
Can you elaborate on what you mean by "maintaining white space in a list (rather than have \n symbols)"? The `\n` is just shorthand for "the newline character", which is white space. — Two-Bit Alchemist, Mar 25 '14 at 15:23
What does an item look like? You example list above has 1 item only. — Simon, Mar 25 '14 at 15:24
Looks like FASTQ, where a record is _always_ 4 lines. If you want the first 2 records, just print the first `2*4` lines. — Brave Sir Robin, Mar 25 '14 at 15:24
@fredtantini Err, I think you need to check the [documentation](http://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects). Read takes a size argument and, given none, reads _the entire file_. — Two-Bit Alchemist, Mar 25 '14 at 15:25
Hi, sorry by maintaining whitespace I mean when I have taken the top 2 items and print it back to a text file it has whitespace rather than the actual newline character "/n". As an example currently when I print it off again I get @M00964: XXXXX/nYYY/n+ZZZZ/n@M00964 and so on. Does that make sense? — PaulBarr, Mar 25 '14 at 15:27
It is a FASTQ file, the one I have contains 155,000 sequences and I want the user to input a percentage (10% for example) and a new file to be made with 10% of the reads, i.e. the first 15,500 — PaulBarr, Mar 25 '14 at 15:28
@fredtantini rly, read docs. It reads all lines (whole file) if you don't give argument, as it have been pointed out already. — m.wasowski, Mar 25 '14 at 15:32
@Two-BitAlchemist my bad. I was thinking of `readline` indeed. — fredtantini, Mar 25 '14 at 15:32
@fredtantini Just correcting the record. If you had every single method in every single language memorized, I would be worried about you. :) — Two-Bit Alchemist, Mar 25 '14 at 15:39

score 3 · Answer 1 · answered Mar 25 '14 at 15:33

3

If your file is large and you don't want hold the whole thing in memory you can just iterate over individual records using this helper function:

def chunk_records(filepath)
    with open(filepath, 'r') as f:
        record = []
        for line in f:
            # could use regex for more complicated matching
            if line.startswith('@M00964') and record:
                yield ''.join(record)
                record = []
            else:
                record.append(line)
        if record:
            yield ''.join(record)

Use it like

for record in chunk_records('/your/filename.txt'):
    ...

Or if you want the whole thing in memory:

records = list(chunk_records('/your/filename.txt'))

answered Mar 25 '14 at 15:33

Steven Rumbalski

44,786
9
89
119

The newlines in the file are important, so OP should join them with `'\n'` instead of `''` – Brave Sir Robin Mar 25 '14 at 15:37
@Steven Rumbalski, thankyou I will have a look at using this now. – PaulBarr Mar 25 '14 at 15:43
@rmartinjak do you mean replace '' in the above code with '/n'? – PaulBarr Mar 25 '14 at 15:44
1

@martinjak: The newlines are never stripped so they need not be added back in. When iterating over lines in a file python leaves the line ending in. – Steven Rumbalski Mar 25 '14 at 15:51
Oh indeed, I totally forgot about that – Brave Sir Robin Mar 27 '14 at 14:15

Simon · Answer 2 · 2014-03-25T15:35:50.190

0

Just split on the @ sign instead:

with open(fileName,"r") as in_file:
    sequences = in_file.read().replace("@","###@").split('###')

edited Mar 25 '14 at 15:35

answered Mar 25 '14 at 15:23

Simon

2,840
2
18
26

Hi, This still removes the @ symbol, ideally I want to keep the entire ID of @M00946 – PaulBarr Mar 25 '14 at 15:30
@Steven Rumbalski's is more RAM efficient if the file is large. I went for the simplest solution. – Simon Mar 25 '14 at 15:36

sshashank124 · Accepted Answer · 2014-03-25T16:25:54.467

0

Specific to your example, can't you just do something as follows:

in_file = open(fileName, 'r')
file = in_file.readlines()

new_list = [''.join(file[i*4:(i+1)*4]) for i in range(int(len(file)/4))]
list_no_n = [item.replace('\n','') for item in new_list]

print new_list
print list_no_n

[EXPANDED FORM]

new_list = []
for i in range(int(len(file)/4)): #Iterates through 1/4 of the length of the file lines.
                                  #This is because we will be dealing in groups of 4 lines
    new_list.append(''.join(file[i*4:(i+1)*4])) #Joins four lines together into a string and adds it to the new_list

[Writing to new file]

write_list = ''.join(new_list).split('\n')
output_file = open(filename, 'w')
output_file.writelines(write_list)

edited Mar 25 '14 at 16:25

answered Mar 25 '14 at 15:25

sshashank124

31,495
9
67
76

Hi, sorry I have only learnt python for the last month or so, could you explain what this code actually does? Thankyou – PaulBarr Mar 25 '14 at 15:31
@user3460300, The code groups every four lines together and joins them into a string. Then it takes the groups of four lines and combines them into a list. I will update my answer to show an expanded form of what I am doing. – sshashank124 Mar 25 '14 at 15:32
@user3460300, Note: I have also created another variable called `list_no_n`. This is the same as `new_list` but does not contain any `\n` characters and therefore can be used for data processing within the script. If you want to write the string to another file, you can use the original `new_list` string with its retained `\n` characters and it will keep its formatting. Of course, you can choose to write maybe only the first two sets of data since it is a list. – sshashank124 Mar 25 '14 at 15:42
so when I write the first two back to a text file how would I do that in a way that would maintain the formatting in the text file rather than having it all one on line with /n characters? Also, thankyou I have managed to get the first bit to work perfectly!! – PaulBarr Mar 25 '14 at 15:54
@user3460300, You would split the list back based on the \n strings using split('\n') and then you would write it line by line to your new file – sshashank124 Mar 25 '14 at 15:56
sorry I know this is a really basic question now but how do I do that. I have a list called new_list where each item is the sequence plus the ID as I needed. I have used raw input and some quick calculations to get the first 10% of items into a subset list. I then what to write this item to a new file line by line. I have then done the following: fileName = raw_input("What do you want to call the new filtered file? ") How do I then write the entire subset list line by line? – PaulBarr Mar 25 '14 at 16:12
@paulbarr, there I updated my answer. Hope that helps. – sshashank124 Mar 25 '14 at 16:31
thank you, finally finished my script! Would upvote you but only joined stackoverflow today – PaulBarr Mar 25 '14 at 16:34
@paulbarr, no problem. And welcome to stackoverflow. Hope you have a nice time – sshashank124 Mar 25 '14 at 16:37

Read a text file into python by splitting the file into list items according to a set of characters

3 Answers3