5

When I ran this code on Jupyter and Virtual Machine, it ran smoothly. But when I started running in on AWS, it always shows list index out of range. I would like to know how to fix this problem. Thanks!

Code:

from datetime import datetime, timedelta
from time import strptime
import requests
from lxml import html
import re
import time
import os
import sys

from pandas import DataFrame
import numpy as np
import pandas as pd

import sqlalchemy as sa
from sqlalchemy import create_engine
from sqlalchemy.sql import text as sa_text
import pymysql


date_list=[]
for i in range(0,2):
    duration=datetime.today() - timedelta(days=i)
    forma=duration.strftime("%m-%d")
    date_list.append(forma)

print(date_list)



def curl_topic_url_hot():
    url = 'https://www.xxxx.com/topiclist.php?f=397&p=1'
    headers = {'User-Agent': 'aaaaaaaaaaaaaaa'}
    response = requests.get(url, headers=headers)
    tree = html.fromstring(response.text)
    output = tree.xpath("//div[@class='pagination']/a[7]")
    maxPage = int(output[0].text)
    print('There are', maxPage, 'pages.')

    return [maxPage]

topic_url_hot = curl_topic_url_hot()

AWS log:

['02-12', '02-11']
Traceback (most recent call last):
  File "/home/hadoop/ellen_crawl/test0211_mobile.py", line 167, in <module>
    topic_url_hot = curl_topic_url_hot()
  File "/home/hadoop/ellen_crawl/test0211_mobile.py", line 48, in curl_topic_url_hot
    maxPage = int(output[0].text)
IndexError: list index out of range

When I ran this code on Jupyter, it shows:

['02-12', '02-11']
There are 818 pages.
Lara19
  • 615
  • 1
  • 9
  • 20

3 Answers3

3

You could either use

if len(output) > 1:
    maxPage = int(output[0].text)

Or

try:
    maxPage = int(output[0].text)
except IndexError:
    # do sth. with the error message

In either case, your original code does not produce the result you think it does.

Jan
  • 42,290
  • 8
  • 54
  • 79
3

You can get rid of the error by testing first and only indexing into your result or by try/except-catching the error:

if len(output)>0: 
    maxPage = int(output[0].text)

try:
    maxPage = int(output[0].text)
except IndexError as e:
    pass # log it or do smth with it

Your real problem lies elsewhere:

Your curling does not produce what you think it does - maybe AWS does not support what you want to do and hence the request is blocked and returns nothing? Maybe you have a typo in your url?

Some ideas:

  • inspect the content of tree
  • inspect your aws logs.
  • inspect the response for its errorcode
  • try the url manually (you did that, this is more for others that find this later on)
Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
  • It prints nothing if I do try except, probably I should try it with beautifulsoup! – Lara19 Feb 12 '19 at 08:12
  • @Lara have a look at https://stackoverflow.com/a/14896505/7505395 - this could help you get to the content you want (not an answer to this question though :o) – Patrick Artner Feb 12 '19 at 08:15
-2

Your AWS visit this website and it return error html, check it. https://www.xxxx.com/topiclist.php?f=397&p=1

DivideBy0
  • 85
  • 2