Extracting numbers from a string using regex in python

Question

I have a list of urls that I would like to parse:

['https://www.richmondfed.org/-/media/richmondfedorg/press_room/speeches/president_jeff_lacker/2017/pdf/lacker_speech_20170303.pdf','http://www.federalreserve.gov/newsevents/speech/powell20160929a.htm','http://www.federalreserve.gov/newsevents/speech/fischer20161005a.htm']

I would like to use a Regex expression to create a new list containing the numbers at the end of the string and any letters before punctuation (some strings contain numbers in two positions, as the first string in the list above shows). So the new list would look like:

['20170303', '20160929a', '20161005a']

This is what I've tried with no luck:

code = re.search(r'?[0-9a-z]*', urls)

Update:

Running -

[re.search(r'(\d+)\D+$', url).group(1) for url in urls]

I get the following error -

AttributeError: 'NoneType' object has no attribute 'group'

Also, it doesn't seem like this will pick up a letter after the numbers if a letter is there..!

Maybe [`re.search(r'.*\D(\d\w*)', s)`](https://regex101.com/r/gZpX4t/2) will do. — Wiktor Stribiżew, Jun 23 '17 at 16:41

dawg · Answer 1 · 2017-06-23T16:57:10.800

0

Given:

>>> lios=['https://www.richmondfed.org/-/media/richmondfedorg/press_room/speeches/president_jeff_lacker/2017/pdf/lacker_speech_20170303.pdf','http://www.federalreserve.gov/newsevents/speech/powell20160929a.htm','http://www.federalreserve.gov/newsevents/speech/fischer20161005a.htm']

You can do:

for s in lios:
    m=re.search(r'(\d+\w*)\D+$', s)
    if m:
        print m.group(1)

Prints:

20170303
20160929a
20161005a

Which is based on this regex:

(\d+\w*)\D+$
  ^              digits
     ^           any non digits
        ^        non digits
           ^     end of string

edited Jun 23 '17 at 16:57

answered Jun 23 '17 at 16:47

dawg

98,345
23
131
206

Did you look at the expected output? – logi-kal Jun 23 '17 at 16:48

score 0 · Answer 2 · answered Jun 23 '17 at 16:51

0

You can use this regex (\d+[a-z]*)\. :

regex demo

Outputs

20170303
20160929a
20161005a

answered Jun 23 '17 at 16:51

Youcef LAIDANI

55,661
15
90
140

glegoux · Accepted Answer · 2017-06-23T17:04:22.847

0

# python3

from urllib.parse import urlparse
from os.path import basename

def extract_id(url):
    path = urlparse(url).path
    resource = basename(path)
    _id = re.search('\d[^.]*', resource)
    if _id:
        return _id.group(0)

urls =['https://www.richmondfed.org/-/media/richmondfedorg/press_room/speeches/president_jeff_lacker/2017/pdf/lacker_speech_20170303.pdf','http://www.federalreserve.gov/newsevents/speech/powell20160929a.htm','http://www.federalreserve.gov/newsevents/speech/fischer20161005a.htm']

# /!\ here you have None if pattern doesn't exist ;) in ids list
ids = [extract_id(url) for url in urls]

print(ids)

Output:

['20170303', '20160929a', '20161005a']

edited Jun 23 '17 at 17:04

answered Jun 23 '17 at 16:53

glegoux

3,505
15
32

This worked well, except that the output for the first string in the example did not skip the fist 2017 -- the output was: ['2017/pdf/lacker_speech_20170303', '20160929a', '20161005a'] – Graham Streich Jun 23 '17 at 16:57
You must have changed something in the regex because now it works, thanks! – Graham Streich Jun 23 '17 at 16:57

Jugurtha Hadjar · Answer 4 · 2017-06-23T17:42:11.567

import re

patterns = {
    'url_refs': re.compile("(\d+[a-z]*)\."),  # YCF_L
}

def scan(iterable, pattern=None):
    """Scan for matches in an iterable."""
    for item in iterable:
        # if you want only one, add a comma:
        # reference, = pattern.findall(item)
        # but it's less reusable.
        matches = pattern.findall(item)
        yield matches

You can then do:

hits = scan(urls, pattern=patterns['url_refs'])
references = (item[0] for item in hits)

Feed references to your other functions. You can go through larger sets of stuff this way, and do it faster I suppose.

Extracting numbers from a string using regex in python

4 Answers4

regex demo