1

I'm trying to read a big file (1.1GB) into python. There will be word 'HERE' in the file. I don't know on which line I'll find the word. I read the file into chunks. My first chunk is data upto word 'HERE'. My code is working fine till here. (that is storing the data before 'HERE' and processing it) However I'm unable to proceed with reading the data after 'HERE' because the data after 'HERE' is too large. Is there any way so that I can read the data after 'HERE' line by line? I referred to the reference: Reading a file until a specific character in python My code is:

def each_chunk(stream, separator):
  buffer = ''
  while True:  # until EOF
    chunk = stream.read()  # I propose 4096 or so
    if not chunk:  # EOF?
      yield buffer
      break
    buffer += chunk
    while True:  # until no separator is found
      try:
        part, buffer = buffer.split(separator, 1)
      except ValueError:
        break
      else:
        yield part

def first_chunk(chunk):
    .... #my function

def chunk_after(data_line_by_line):
    .... #my function

global This_1st_chunk
This_1st_chunk=True

myFile= open(r"C:\Users\Mavis\myFile.txt","r")
for chunk in each_chunk(myFile, separator='HERE'):
    if This_1st_chunk:
        first_chunk(chunk)
        This_1st_chunk=False
    elif not This_1st_chunk:
        print('*******after 1st chunk*********')
        #**I WANT TO READ THE DATA LINE BY LINE HERE.**
        chunk_after(data_line_by_line)

3 Answers3

1

As I understood the question It think you want to separate a text file into smaller chunks in python on HERE marks in the txt file, if what I said is true try this

with open(myFile, "r") as file:
    Data = file.read()
    # will create a list where each item is the text between 
    # HERE's not including them
    DataList = Data.split("HERE")
    for i in DataList:
        with open("Random.txt", "w") as f:
            f.write(i)

this will seperate different "Chunks" into files you can do the same but with this for new lines:

DataList = Data.split("\n") # a list containing every line
for i in DataList:
    print (i) # will print every line 

You can also use

Data.readline() # returns 1 line

You can Re-join them with this method:

"string between the items".join(DataList)

Hope this Helps!

12ksins
  • 307
  • 1
  • 12
1

The problem is that .read() method reads, by default, the whole file. If the file is large enough your memory will explode. As written in official documentation:

to read a file’s contents, call f.read(size), which reads some quantity of data and returns it as a string (in text mode) or bytes object (in binary mode). size is an optional numeric argument. When size is omitted or negative, the entire contents of the file will be read and returned; it’s your problem if the file is twice as large as your machine’s memory. Otherwise, at most size characters (in text mode) or size bytes (in binary mode) are read and returned. If the end of the file has been reached, f.read() will return an empty string ('').

You can find further information here: https://docs.python.org/3/tutorial/inputoutput.html.

Instead, as documentation suggests, you can either provide a size parameter to read() method or use readline() to get one line.

Code documentation examples:

$ f.read()
'This is the entire file.\n'
$ .read()
'This is the entire file.\n'
$ f.readline()
'This is the first line of the file.\n'
$ f.readline()
'Second line of the file\n'
MaBekitsur
  • 171
  • 8
0

It's probably simpler to read the file line by line up to the first chunk (delimited by "HERE"), then gather all the lines, process that chunk, and keep reading the file line by line afterwards.

Something like this:

with open(r"C:\Users\Mavis\myFile.txt","r") as myFile:
    chunk = []
    first_chunk_found = False
    while not first_chunk_found:
        line = myFile.readline()
        if "HERE" in line:
            first_chunk_found = True
            line, remainder = line.split("HERE")
            line += "HERE"  # current line up to "HERE"
        chunk.append(line)
    chunk = ''.join(chunk)
    # do whatever you want with the first chunk here.
    # also, the variable remainder has the rest of the line
    # that contained the word "HERE", in case you want it
    for line in myFile:
        # now we process the rest of the file line by line
jfaccioni
  • 7,099
  • 1
  • 9
  • 25