-1

I've come across a lot of similar questions. However, the answers provided seemed not to be helpful to me.

I'm trying to run a Topic Modeling analysis on an 8000'ish media articles. But I'm getting this error:

Traceback (most recent call last):
  File "extract.py", line 23, in <module>
    if re.compile('^(.*?) - \d{2} [a-zA-Z]{3}. \d{4}$').match(lines[1]):
IndexError: list index out of range

line 23 where referred to, is this:

if re.compile('^(.*?) - \d{2} [a-zA-Z]{3}. \d{4}$').match(lines[1]):
    media = lines[1].split(' - ')[0].replace('*', '')
    article = article.replace('\n' + lines[1], '')
    if article.find(media) > -1:
        containsMediaName.write(filename + '\n')

Can anyone help me ignoring this error somehow?

full code

#!/usr/bin/python
# -*- coding: utf-8 -*-

import os
import re
import string
import textract
import unicodedata
from unidecode import unidecode

if not os.path.isdir('./raw'):
    os.mkdir('./raw')

names = open('./deleted-names.txt', 'w')
containsMediaName = open('./contains-media-name.txt', 'w')

for filename in os.listdir('./data'):
    article = unidecode(textract.process('./data/' + filename).decode('utf-8'))

    article = re.sub('<<', '', article)
    article = re.sub('>>', '', article)
    lines = article.split('\n')
    if re.compile('^(.*?) - \d{2} [a-zA-Z]{3}. \d{4}$').match(lines[1]):
        media = lines[1].split(' - ')[0].replace('*', '')
        article = article.replace('\n' + lines[1], '')
        if article.find(media) > -1:
            containsMediaName.write(filename + '\n')

    if re.match('^Pagina \d{1,5}$', lines[2]):
        article = article.replace('\n' + lines[2], '')

    article = re.sub('\nCopyright(.*?)Alle rechten voorbehouden\n', '\n', article)
    article = re.sub('\n\(Foto:(.*?)\)\n', '\n', article)
    article = re.sub('\n\(Fotograaf:(.*?)\)\n', '\n', article)
    article = article.strip().rstrip(' \t\r\n\0')

    lines = article.split('\n')
    name = lines.pop()
    if len(name.split(' ')) <= 3:
        article = re.sub('\n' + name, '', article)
        names.write(name + ',' + filename + '\n')

        initials = '('
        for namePart in name.split(' '):
            initials += namePart[0]
        initials += ')'

        article = article.strip()
        if(article.endswith(initials)):
            article = re.sub(re.escape(initials), '', article)

    article = article.strip().rstrip(' \t\r\n\0')
    f = open('./raw/' + filename + '.txt', 'w')
    f.write(article)
    f.close()

names.close()
containsMediaName.close()
Maarten Fabré
  • 6,938
  • 1
  • 17
  • 36
  • 1
    Ignoring the error is not the proper solution, you should understand it and resolve it. – omri_saadon Jul 12 '17 at 08:46
  • 1
    post the output of lines.From the list index error I can suspect that it doesn't have 2nd element i.e `lines[1]` – Abhishek L Jul 12 '17 at 08:47
  • Yes you are right @David.Will correct it. – Abhishek L Jul 12 '17 at 08:50
  • @AbhishekL you mean this?: lines = article.split('\n') name = lines.pop() if len(name.split(' ')) <= 3: article = re.sub('\n' + name, '', article) names.write(name + ',' + filename + '\n') initials = '(' for namePart in name.split(' '): initials += namePart[0] initials += ')' article = article.strip() if(article.endswith(initials)): article = re.sub(re.escape(initials), '', article) – M. M. Van Hulle Jul 12 '17 at 08:59
  • No @M.M.VanHulle .I was referring to line 23 from your exception trace.Just add `print line` before this line.Check if it is list of length 2 or more. – Abhishek L Jul 12 '17 at 09:08

1 Answers1

0

The line that's failing is attempting to perform a match against some text fed into it.

The regex code in there is looking to match text that (might) conform to the following test patterns:

something - 10 marc 1974
something else - 11 apri 2001
another match - 99 xxxx 2004

These look somewhat like formats for article titles perhaps? The rest of the code appears to attempt to strip the article name from the date.

Interestingly, what (I'm guessing) to be a date-matching regex code appears to match on 3-character month identifier, plus one addition character [a-zA-Z]{3}. This seems like an inlikely date format - does your data fit this pattern?

If so, then great, but if not, you might want to drop the full-stop after the curly-bracketed {3} and edit the regex to:

'^(.*?) - \d{2} [a-zA-Z]{3} \d{4}$'

On the assumption that your dates are in the more traditional dd Mon yyyy format.

As others have said, you may also want to explore more graceful exception handling to cover situations where your code does not find a date-match, or where the lines object contains fewer than 2 elements.

This is well documented under perhaps a google search for "exception handling in python".

But, to get you up and running, something simple like

try:
    if re.compile('^(.*?) - \d{2} [a-zA-Z]{3}. \d{4}$').match(lines[1]):
        media = lines[1].split(' - ')[0].replace('*', '')
        article = article.replace('\n' + lines[1], '')
        if article.find(media) > -1:
            containsMediaName.write(filename + '\n')
except:
    print ( "The length of the lines object is {ln}".format(ln=len(lines)) )

Which will report the length of the lines object, in case your attempt to pull the element at index 1 (i.e. the 2nd element) doesn't exist - which is what the error message is suggesting is the problem.

[edit] Seeing your code, it appears that the split function fails to find any newline characters for whatever data you're feeding it and this is producing the error when attempting to fetch the 1th element from lines. I also note that you're reading German (?) so my comment about regex reformatting is probably not relevant.

Thomas Kimber
  • 10,601
  • 3
  • 25
  • 42