0

I have about 30 files, the size of each is around 300MB. There are some information I'm interested in in each file, such as usernames. Now I want to find the usernames using regex, then find the most common usernames. Here's my code:

rList=[]
for files in os.listdir("."):
    with open(files,'r') as f:
        for line in f:
            m=re.search('PATTERN TO FIND USERNAME',line)
            if m:
                rList.append(m.group())             
c=Counter(rList)
print c.most_common(10)

Now as you can see, I add every username I find to a list and then call Counter(). This way it takes about several minutes to finish. I've tried removing the c=Counter(rList) and calling c.update() every time I finish reading a file, but it won't make any differnce, will it?

SO, is this the best practice? Are there any ways to improve the performance? Thanks!

ChandlerQ
  • 1,428
  • 2
  • 21
  • 29
  • It's possible that the bottleneck is reading 9 GB from disk. You should definitely precompile the regex, though. – Sneftel Sep 08 '13 at 15:05
  • Try to run it through a profiler, it will give you some idea about which part you need to optimize. – fjarri Sep 08 '13 at 15:12
  • 1
    You seem to have a mistake in the code, the `with open...` line and the next 4 lines should be indented I think. You can use the [`timeit`](http://docs.python.org/2/library/timeit.html) module to figure out where the bottleneck in your code is. @Ben Actually, python will cache the most recent values of regex, so precompiling won't be necessary for this snippet, see [the docs](http://docs.python.org/2/library/re.html#re.compile). However, if there are other regexes in the whole program, compiling might help. – darthbith Sep 08 '13 at 15:15
  • Huh, you learn something new every day. Good tip! – Sneftel Sep 08 '13 at 15:17
  • Use [mmap](http://docs.python.org/2/library/mmap.html) to read lines from files. [Example](http://stackoverflow.com/a/8152106/1288306) on SO. – P̲̳x͓L̳ Sep 08 '13 at 15:19
  • @darthbith yeah,my mistake with editing and thanks for the regex caching thing! – ChandlerQ Sep 08 '13 at 15:25
  • No problem! I think it is also more "pythonic" to say `if m is not None:` when you want to see if the regex had any output, but that may be just a style thing, see [PEP8](http://www.python.org/dev/peps/pep-0008/#programming-recommendations). Also, sorry I don't have a real answer :-P – darthbith Sep 08 '13 at 15:44
  • Off on a tangent, but have you compared the speed of `grep` from the shell instead of going line-by-line through the file...? – beroe Sep 09 '13 at 03:33
  • @beroe well, some of my mates suggest using linux shell and `grep`, I'm also interested in that but now I have to implement it in Python. I'll give `grep` a try later. Thank you for the suggestion. – ChandlerQ Sep 09 '13 at 16:02

2 Answers2

1

Profiling will show you that there is significant overhead involved with looping over each line of the file one by one. If the files are always around the size you specified and you can spend the memory, get them into memory with a single call to .read() and then use a more complex, pre-compiled regexp (that takes line-breaks into account) to extract all usernames at once. Then .update() your counter-object with the groups from the matched regexp. This will be about as efficient as it can get.

user2722968
  • 13,636
  • 2
  • 46
  • 67
0

If you have the memory then:

  1. Use mmap
  2. Use implicit loops as much as possible

The following fragment should be fast but needs memory:

# imports elided

patternString = rb'\b[a-zA-Z]+\b' # byte string creating a byte pattern
pattern = re.compile(patternString)
c = Counter()

for fname in os.listdir("."):
    with open(fname, "r+b") as f:
        mm = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
        c.update(pattern.findall(mm))
print(c.most_common(10))

patternString should be your pattern.

Paddy3118
  • 4,704
  • 27
  • 38