0

I am working with a very large text file (500MB+) and the code I have is outputting perfectly but I am getting a lot of duplicates. What I am looking to do is check the output file to see if the output exists before it writes to the file. I am sure it is just one line in an if statement, but I do not know python well and cannot figure out the syntax. Any help would be greatly appreciated.

Here is the code:

authorList = ['Shakes.','Scott']

with open('/Users/Adam/Desktop/Poetrylist.txt','w') as output_file:
    with open('/Users/Adam/Desktop/2e.txt','r') as open_file:
            the_whole_file = open_file.read()
            for x in authorList:
                start_position = 0 
                while True:
                   start_position = the_whole_file.find('<A>'+x+'</A>', start_position)
                   if start_position < 0:
                       break
                   end_position = the_whole_file.find('</W>', start_position)
                   output_file.write(the_whole_file[start_position:end_position+4])
                   output_file.write("\n")    
                   start_position = end_position + 4
English Grad
  • 1,365
  • 5
  • 21
  • 40

4 Answers4

1

I suggest that you simply keep track of which author data you have already seen, and only write it if you haven't seen it before. You can use a dict to keep track.

authorList = ['Shakes.','Scott']
already_seen = {} # dict to keep track of what has been seen

with open('/Users/Adam/Desktop/Poetrylist.txt','w') as output_file:
    with open('/Users/Adam/Desktop/2e.txt','r') as open_file:
            the_whole_file = open_file.read()
            for x in authorList:
                start_position = 0 
                while True:
                   start_position = the_whole_file.find('<A>'+x+'</A>', start_position)
                   if start_position < 0:
                       break
                   end_position = the_whole_file.find('</W>', start_position)
                   author_data = the_whole_file[start_position:end_position+4]
                   if author_data not in already_seen:
                       output_file.write(author_data + "\n")
                       already_seen[author_data] = True
                   start_position = end_position + 4
steveha
  • 74,789
  • 21
  • 92
  • 117
  • +1 for the best answer so far. But using a `set` would be better than a `dict`. – Thomas K Jul 22 '11 at 23:31
  • @steveha According to what English Grad wrote in this post (http://stackoverflow.com/questions/6790915/searching-txt-files-in-python) , its file is so big that he can't do ``the_whole_file = open_file.read()`` So I don't understand why he consider that the above code he presents in his question is working perfectly. – eyquem Jul 23 '11 at 00:05
0

Create a list holding every string to write. If you append it, check first if the item you append is already in the list or not.

Niklas R
  • 16,299
  • 28
  • 108
  • 203
  • 1
    A `dict` is better than a `list` because the `dict` is O(1) access (due to hashing) while the `list` is O(n) where n is the length of the list. A `set` would also be O(1). – steveha Jul 22 '11 at 23:26
  • What does the 'O' mean ? The time to acess an item ? Can't imagine a dictionary is faster than a list. It's keys and values have to be stores in a list as well, or is this wrong ? – Niklas R Jul 22 '11 at 23:31
  • Yes, O is to do with time - O(n) means the time increases in proportion with n. With a list, you have to check every entry to see if a value is in there. With a set or a dict, you only check those with the same `hash()`. – Thomas K Jul 22 '11 at 23:34
  • @Niklas Rosenstein, please see the Wikipedia page about the O notation: http://en.wikipedia.org/wiki/Big_O_notation – steveha Jul 25 '11 at 20:27
0

My understanding is, you wish to skip the lines in the open_file which contains name of your authors when you want to write to output_file. If this is what you intend to do, then do it this way.

authorList = ['Shakes.','Scott']

with open('/Users/Adam/Desktop/Poetrylist.txt','w') as output_file:
    with open('/Users/Adam/Desktop/2e.txt','r') as open_file:
         for line in open_file:
              skip = 0
              for author in authorList:
                   if author in line:
                       skip = 1
              if not skip:
                   output_file.write(line)
Senthil Kumaran
  • 54,681
  • 14
  • 94
  • 131
  • There is no newline in its text. English Grad wrote this important info in a post in another thread. – eyquem Jul 22 '11 at 23:41
0

I think you should process your file with an appropriate tool to treat a text: regular expressions.

import re

regx = re.compile('<A>(.+?)</A>.*?<W>.*?</W>')

with open('/Users/Desktop/2e.txt','rb')         as open_file,\
     open('/Users/Desktop/Poetrylist.txt','wb') as output_file:

    remain = ''
    seen = set()

    while True:
        chunk = open_file.read(65536) # 65536 == 16 x 16 x 16 x 16
        if not chunk:  break
        for mat in regx.finditer(remain + chunk):
            if mat.group(1) not in seen:
                output_file.write( mat.group() + '\n' )
                seen.add(mat.group(1))
        remain = chunk[mat.end(0)-len(remain):]
eyquem
  • 26,771
  • 7
  • 38
  • 46