Why use regex finditer() rather than findall()

Question

What is the advantage of using finditer() if findall() is good enough? findall() returns all of the matches while finditer() returns match object which can't be processed as directly as a static list.

For example:

import re
CARRIS_REGEX = (r'<th>(\d+)</th><th>([\s\w\.\-]+)</th>'
                r'<th>(\d+:\d+)</th><th>(\d+m)</th>')
pattern = re.compile(CARRIS_REGEX, re.UNICODE)
mailbody = open("test.txt").read()
for match in pattern.finditer(mailbody):
    print(match)
print()
for match in pattern.findall(mailbody):
    print(match)

Output:

<_sre.SRE_Match object at 0x00A63758>
<_sre.SRE_Match object at 0x00A63F98>
<_sre.SRE_Match object at 0x00A63758>
<_sre.SRE_Match object at 0x00A63F98>
<_sre.SRE_Match object at 0x00A63758>
<_sre.SRE_Match object at 0x00A63F98>
<_sre.SRE_Match object at 0x00A63758>
<_sre.SRE_Match object at 0x00A63F98>

('790', 'PR. REAL', '21:06', '04m')
('758', 'PORTAS BENFICA', '21:10', '09m')
('790', 'PR. REAL', '21:14', '13m')
('758', 'PORTAS BENFICA', '21:21', '19m')
('790', 'PR. REAL', '21:29', '28m')
('758', 'PORTAS BENFICA', '21:38', '36m')
('758', 'SETE RIOS', '21:49', '47m')
('758', 'SETE RIOS', '22:09', '68m')

I ask this out of curiosity.

The match object contains a great deal more information about the match. — BrenBarn, Sep 10 '16 at 01:53
Why use `(i for i in iterable )` and not `[i for i in iterable]`? Also as BrenBarn commented, you can use the many attributes like `.start`.`end` etc.. using finditer, it is basically horses for courses and what you want to do and for certain situations it matters little what you choose. — Padraic Cunningham, Sep 10 '16 at 01:54
`findall` is indeed sufficient for many use cases. There is no need to use `finditer` unless you want more details from the match object or are processing data large enough that short-circuiting the search or reducing the size of the returned data matters. Usually it doesn't. — tdelaney, Sep 10 '16 at 02:02

Soviut · Answer 1 · 2016-09-10T02:18:23.977

finditer() returns an iterator while findall() returns an array. An iterator only does work when you ask it to by calling .next(). A for loop knows to call .next() on iterators, meaning if you break from the loop early, any following matches won't be performed. An array, on the other hand, needs to be fully populated, meaning every match must be found up front.

Iterators can be be far more memory and CPU efficient since they only needs to load one item at a time. If you were matching a very large string (encyclopedias can be several hundred megabytes of text), trying to find all matches at once could cause the browser to hang while it searched and potentially run out of memory.

score 3 · Accepted Answer · edited May 23 '17 at 12:08

3

Sometimes it's superfluous to retrieve all matches. If the number of matches is really high you could risk filling up your memory loading them all.

Using iterators or generators is an important concept in modern python. That being said, if you have a small text (e.g this web page) the optimization is minuscule.

Here is a related question about iterators: Performance Advantages to Iterators?

edited May 23 '17 at 12:08

Community

1
1

answered Sep 10 '16 at 01:56

Jonatan

1,096
1
18
28

You would probably take a memory hit more than a performance hit. – Padraic Cunningham Sep 10 '16 at 01:59

Why use regex finditer() rather than findall()

2 Answers2