IndexError: list index out of range while using aws

Question

When I ran this code on Jupyter and Virtual Machine, it ran smoothly. But when I started running in on AWS, it always shows list index out of range. I would like to know how to fix this problem. Thanks!

Code:

from datetime import datetime, timedelta
from time import strptime
import requests
from lxml import html
import re
import time
import os
import sys

from pandas import DataFrame
import numpy as np
import pandas as pd

import sqlalchemy as sa
from sqlalchemy import create_engine
from sqlalchemy.sql import text as sa_text
import pymysql


date_list=[]
for i in range(0,2):
    duration=datetime.today() - timedelta(days=i)
    forma=duration.strftime("%m-%d")
    date_list.append(forma)

print(date_list)



def curl_topic_url_hot():
    url = 'https://www.xxxx.com/topiclist.php?f=397&p=1'
    headers = {'User-Agent': 'aaaaaaaaaaaaaaa'}
    response = requests.get(url, headers=headers)
    tree = html.fromstring(response.text)
    output = tree.xpath("//div[@class='pagination']/a[7]")
    maxPage = int(output[0].text)
    print('There are', maxPage, 'pages.')

    return [maxPage]

topic_url_hot = curl_topic_url_hot()

AWS log:

['02-12', '02-11']
Traceback (most recent call last):
  File "/home/hadoop/ellen_crawl/test0211_mobile.py", line 167, in <module>
    topic_url_hot = curl_topic_url_hot()
  File "/home/hadoop/ellen_crawl/test0211_mobile.py", line 48, in curl_topic_url_hot
    maxPage = int(output[0].text)
IndexError: list index out of range

When I ran this code on Jupyter, it shows:

['02-12', '02-11']
There are 818 pages.

@PatrickArtner when I try to print tree, it shows – Lara19 Feb 12 '19 at 08:05 — Lara19, Feb 12 '19 at 08:05

score 3 · Answer 1 · answered Feb 12 '19 at 08:03

You could either use

if len(output) > 1:
    maxPage = int(output[0].text)

Or

try:
    maxPage = int(output[0].text)
except IndexError:
    # do sth. with the error message

In either case, your original code does not produce the result you think it does.

Patrick Artner · Answer 2 · 2019-02-12T08:24:50.137

3

You can get rid of the error by testing first and only indexing into your result or by try/except-catching the error:

if len(output)>0: 
    maxPage = int(output[0].text)

try:
    maxPage = int(output[0].text)
except IndexError as e:
    pass # log it or do smth with it

Your real problem lies elsewhere:

Your curling does not produce what you think it does - maybe AWS does not support what you want to do and hence the request is blocked and returns nothing? Maybe you have a typo in your url?

Some ideas:

inspect the content of tree
inspect your aws logs.
inspect the response for its errorcode
try the url manually (you did that, this is more for others that find this later on)

edited Feb 12 '19 at 08:24

answered Feb 12 '19 at 08:05

Patrick Artner

50,409
9
43
69

It prints nothing if I do try except, probably I should try it with beautifulsoup! – Lara19 Feb 12 '19 at 08:12
@Lara have a look at https://stackoverflow.com/a/14896505/7505395 - this could help you get to the content you want (not an answer to this question though :o) – Patrick Artner Feb 12 '19 at 08:15

score -2 · Answer 3 · answered Feb 12 '19 at 07:59

-2

Your AWS visit this website and it return error html, check it. https://www.xxxx.com/topiclist.php?f=397&p=1

answered Feb 12 '19 at 07:59

DivideBy0

85
2

do you really think that www.xxxx.com is the _real_ url used here? – Patrick Artner Feb 12 '19 at 08:00
Of course not, You can your aws log and print the html result. – DivideBy0 Feb 12 '19 at 08:01
Lara has given only example url. not the actual URL @DivideBy0 – Nihal Feb 12 '19 at 08:03
The OP obfuscated the URL she is using - the obfuscated URL does not lead to a valid target so it throws an error. Your answer based on that error does not make sense. – Patrick Artner Feb 12 '19 at 08:12
What i mean is check the html result, not the url – DivideBy0 Feb 12 '19 at 08:18

IndexError: list index out of range while using aws

3 Answers3