Extracting 'year' from a column

Question

I have a working code, but I think my logic isn't on the right path (although it works). I just need some help with optimizing it. Essentially, to see if what I did was an acceptable way of doing what I am doing or if there's a better way. I am rooting for the latter, because I know what I did isn't the "right" way.

I have a pd column of strings with "year" in it and I am trying to extract it from it. The problem is that a few entries do not have a year listed. So something like this:

Index	string_values
0	String A (1995)
1	String B (1995)
2	String C (1995)
3	String D has no year
4	String E has (something in braces) AND also the year (2003)

re.search('\d{4}', df['string_values '][0]).group(0) works, but in a for loop, it throws this error (I guess when it hit the non-4-digit string): AttributeError: 'NoneType' object has no attribute 'group'. I think this because len(_temp) gives 15036 and it has the years listed. Just that it's throwing this error.

Here's the for loop:

_temp = []
for i in df['string_values']:
    year = re.search("\d{4}", i)
    if year.group():
        _temp.append(year.group())
    else:
        _temp.append(None)

Then I also tried the Try-Except way to do it, and that works - len(<var>) gives 62423, which is also the total row in the df. And here's the code:

_without_year = []
_with_year = []
for i in df['string_values']:
    year = re.search("\d{4}", i)
    try:
        if year.group():
            # _with_year.append(year.group())
            pass
    except:
        _without_year.append(i)

I just need to know if what I did is acceptable. It works, like I said. _without_year does display all the entries without the year.

The thing with the Try-Except block is that I am passing on the if condition catching the ith error.

I think a better approach is to create a function, something like year_search(row). Use the apply function to grab your information. E.g. df['Year'] = df['String_values'].apply(year_search). This is much cleaner than a loop. — Robert, Jun 01 '22 at 06:35
@Robert - Yes, but I'd like to find out the strings that do not have `year` in it. I suppose a loop is the only way to do it? — Anonymous Person, Jun 01 '22 at 06:50

Nick · Accepted Answer · 2022-06-01T07:05:47.340

3

You can use extract to extract the year values directly:

df['string_values'].str.extract(r'(?<=\()(\d{4})(?=\))', expand=False)

Output:

0    1995
1    1995
2    1995
3     NaN
4    2003
Name: string_values, dtype: object

Note I've used forward and backward lookarounds to assert that the year occurs inside parentheses; if you don't want that but just to match a 4-digit string replace them with \b (word break) e.g.

df['string_values'].str.extract(r'\b(\d{4})\b', expand=False)

To convert the output to a list, you can use tolist:

df['string_values'].str.extract('(?<=\()(\d{4})(?=\))', expand=False).tolist()

Output:

['1995', '1995', '1995', nan, '2003']

To find the string values that don't contain a year, you can use contains to find matches and invert that to use as an index:

df[~df['string_values'].str.contains(r'(?<=\()\d{4}(?=\))')]

Output:

   Index          string_values
3       3  String D has no year

edited Jun 01 '22 at 07:05

answered Jun 01 '22 at 06:35

Nick

138,499
22
57
95

Ok yes. This is nice, but I'd like to find out the strings that do not have the year in it. Not that `they do not have a year in it so NaN`. I hope it makes sense. – Anonymous Person Jun 01 '22 at 06:49
1

@AnonymousPerson perhaps `df[~df['string_values'].str.contains('(?<=\()\d{4}(?=\))')]`? – Nick Jun 01 '22 at 07:00
If you can edit your answer, I'd accept it. – Anonymous Person Jun 01 '22 at 07:02
Accepted. Thank you, kind stranger named Nick. – Anonymous Person Jun 01 '22 at 07:42

Extracting 'year' from a column

1 Answers1