0

I have a list of urls that I would like to parse:

['https://www.richmondfed.org/-/media/richmondfedorg/press_room/speeches/president_jeff_lacker/2017/pdf/lacker_speech_20170303.pdf','http://www.federalreserve.gov/newsevents/speech/powell20160929a.htm','http://www.federalreserve.gov/newsevents/speech/fischer20161005a.htm']

I would like to use a Regex expression to create a new list containing the numbers at the end of the string and any letters before punctuation (some strings contain numbers in two positions, as the first string in the list above shows). So the new list would look like:

['20170303', '20160929a', '20161005a']

This is what I've tried with no luck:

code = re.search(r'?[0-9a-z]*', urls)

Update:

Running -

[re.search(r'(\d+)\D+$', url).group(1) for url in urls]

I get the following error -

AttributeError: 'NoneType' object has no attribute 'group'

Also, it doesn't seem like this will pick up a letter after the numbers if a letter is there..!

Youcef LAIDANI
  • 55,661
  • 15
  • 90
  • 140
Graham Streich
  • 874
  • 3
  • 15
  • 31

4 Answers4

0

Given:

>>> lios=['https://www.richmondfed.org/-/media/richmondfedorg/press_room/speeches/president_jeff_lacker/2017/pdf/lacker_speech_20170303.pdf','http://www.federalreserve.gov/newsevents/speech/powell20160929a.htm','http://www.federalreserve.gov/newsevents/speech/fischer20161005a.htm']

You can do:

for s in lios:
    m=re.search(r'(\d+\w*)\D+$', s)
    if m:
        print m.group(1)

Prints:

20170303
20160929a
20161005a

Which is based on this regex:

(\d+\w*)\D+$
  ^              digits
     ^           any non digits
        ^        non digits
           ^     end of string
dawg
  • 98,345
  • 23
  • 131
  • 206
0

You can use this regex (\d+[a-z]*)\. :

regex demo

Outputs

20170303
20160929a
20161005a
Youcef LAIDANI
  • 55,661
  • 15
  • 90
  • 140
0
# python3

from urllib.parse import urlparse
from os.path import basename

def extract_id(url):
    path = urlparse(url).path
    resource = basename(path)
    _id = re.search('\d[^.]*', resource)
    if _id:
        return _id.group(0)

urls =['https://www.richmondfed.org/-/media/richmondfedorg/press_room/speeches/president_jeff_lacker/2017/pdf/lacker_speech_20170303.pdf','http://www.federalreserve.gov/newsevents/speech/powell20160929a.htm','http://www.federalreserve.gov/newsevents/speech/fischer20161005a.htm']

# /!\ here you have None if pattern doesn't exist ;) in ids list
ids = [extract_id(url) for url in urls]

print(ids)

Output:

['20170303', '20160929a', '20161005a']
glegoux
  • 3,505
  • 15
  • 32
  • This worked well, except that the output for the first string in the example did not skip the fist 2017 -- the output was: ['2017/pdf/lacker_speech_20170303', '20160929a', '20161005a'] – Graham Streich Jun 23 '17 at 16:57
  • You must have changed something in the regex because now it works, thanks! – Graham Streich Jun 23 '17 at 16:57
-1
import re

patterns = {
    'url_refs': re.compile("(\d+[a-z]*)\."),  # YCF_L
}

def scan(iterable, pattern=None):
    """Scan for matches in an iterable."""
    for item in iterable:
        # if you want only one, add a comma:
        # reference, = pattern.findall(item)
        # but it's less reusable.
        matches = pattern.findall(item)
        yield matches

You can then do:

hits = scan(urls, pattern=patterns['url_refs'])
references = (item[0] for item in hits)

Feed references to your other functions. You can go through larger sets of stuff this way, and do it faster I suppose.

Jugurtha Hadjar
  • 441
  • 3
  • 7