Trying to scrape apply now and learn more urls but not able to get it using beautiful soup and python

Question

I am scraping this link : https://www.americanexpress.com/in/credit-cards/all-cards/?sourcecode=A0000FCRAA&cpid=100370494&dsparms=dc_pcrid_408453063287_kword_american%20express%20credit%20card_match_e&gclid=Cj0KCQiApY6BBhCsARIsAOI_GjaRsrXTdkvQeJWvKzFy_9BhDeBe2L2N668733FSHTHm96wrPGxkv7YaAl6qEALw_wcB&gclsrc=aw.ds

and get apply now and learn more urls

from urllib.request import urlopen
from bs4 import BeautifulSoup
import json, requests, re


AMEXurl = ['https://www.americanexpress.com/in/credit-cards/all-cards/?sourcecode=A0000FCRAA&cpid=100370494&dsparms=dc_pcrid_408453063287_kword_american%20express%20credit%20card_match_e&gclid=Cj0KCQiApY6BBhCsARIsAOI_GjaRsrXTdkvQeJWvKzFy_9BhDeBe2L2N668733FSHTHm96wrPGxkv7YaAl6qEALw_wcB&gclsrc=aw.ds']
identity = ['filmstrip_container']



html_1 = urlopen(AMEXurl[0])
soup_1 = BeautifulSoup(html_1,'lxml')
address = soup_1.find('div',attrs={"class" : identity[0]})


for x in address.find_all('a',id = 'html-link'):
    print(x)

I am getting output with links that are not working :

<a href="https://global.americanexpress.com/acq/intl/dpa/japa/ind/pers/begin.do?perform=IntlEapp:IND:platinum_charge&amp;intlink=in-amex-cardshop-allcards-apply-AmericanExpressPlatinum-carousel&amp;cpid=100370494&amp;sourcecode=A0000FCRAA" id="html-link"><div><span>Apply Now</span></div></a>
<a href="charge-cards/platinum-card/?linknav=in-amex-cardshop-allcards-learn-AmericanExpressPlatinum-carousel&amp;cpid=100370494&amp;sourcecode=A0000FCRAA" id="html-link"><div><span>Learn More</span></div></a>
<a href="https://global.americanexpress.com/acq/intl/dpa/japa/ind/pers/begin.do?perform=IntlEapp:IND:platinum_charge&amp;intlink=in-amex-cardshop-allcards-apply-AmericanExpressPlatinum-carousel&amp;cpid=100370494&amp;sourcecode=A0000FCRAA" id="html-link"><div><span>Apply Now</span></div></a>
<a href="charge-cards/platinum-card/?linknav=in-amex-cardshop-allcards-learn-AmericanExpressPlatinum-carousel&amp;cpid=100370494&amp;sourcecode=A0000FCRAA" id="html-link"><div><span>Learn More</span></div></a>

The following is the image of the html code from where I am trying to get the learn more and know more urls:

This is the section of page from where I'd like to get the urls:

I'd like to get to know if there are any changes to be made in the code so that I get all the apply now and learn more URLs of all the 7 cards.

score 2 · Answer 1 · answered Feb 12 '21 at 05:25

You can modify this to use your lists and syntax, but this gets you the links I believe you want. Note that using find doesn't get what is needed, but using find_all with href=True and taking the first link does.

nurl  = 'https://www.americanexpress.com/in/credit-cards/all-cards/?sourcecode=A0000FCRAA&cpid=100370494&dsparms=dc_pcrid_408453063287_kword_american%20express%20credit%20card_match_e&gclid=Cj0KCQiApY6BBhCsARIsAOI_GjaRsrXTdkvQeJWvKzFy_9BhDeBe2L2N668733FSHTHm96wrPGxkv7YaAl6qEALw_wcB&gclsrc=aw.ds'
npage = requests.get(nurl)
nsoup = BeautifulSoup(npage.text, "html.parser")

# for link in nsoup.find_all('a'):
for link in nsoup.find_all('a', string=re.compile('Apply Now'), href=True)[0:1]:
    print(link.get('href'))
for link in nsoup.find_all('a', string=re.compile('Learn'), href=True)[0:1]:
    print('https://www.americanexpress.com/in/' + link.get('href'))

Output

https://global.americanexpress.com/acq/intl/dpa/japa/ind/pers/begin.do?perform=IntlEapp:IND:platinum_charge&intlink=in-amex-cardshop-allcards-apply-AmericanExpressPlatinum-carousel&cpid=100370494&sourcecode=A0000FCRAA
https://www.americanexpress.com/in/charge-cards/platinum-card/?linknav=in-amex-cardshop-allcards-learn-AmericanExpressPlatinum-carousel&cpid=100370494&sourcecode=A0000FCRAA

I appreciate the code that you wrote, but I want to somehow get all the Apply Now URLs as well as all the Learn More URLs of all the 7 cards present. — Ali Baba, Feb 12 '21 at 09:15

Martin Evans · Answer 2 · 2021-02-12T17:54:48.233

The URLs you are looking for are not all stored in the HTML. A further request is required which returns the information inside JSON. To do this, a session ID is also needed. For example:

from bs4 import BeautifulSoup
import requests
import json
    
url = 'https://www.americanexpress.com/in/credit-cards/all-cards/?sourcecode=A0000FCRAA&cpid=100370494&dsparms=dc_pcrid_408453063287_kword_american%20express%20credit%20card_match_e&gclid=Cj0KCQiApY6BBhCsARIsAOI_GjaRsrXTdkvQeJWvKzFy_9BhDeBe2L2N668733FSHTHm96wrPGxkv7YaAl6qEALw_wcB&gclsrc=aw.ds'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')

for script in soup.find_all('script'):
    if script.contents and "intlUserSessionId" in script.contents[0]:
        json_raw = script.contents[0][script.contents[0].find('{'):]
        json_data = json.loads(json_raw)
        id = json_data["pageData"]["pageValues"]["intlUserSessionId"]

url2 = 'https://acquisition-1.americanexpress.com/api/acquisition/digital/v1/shop/us/cardshop-api/api/v1/intl/content/compare-cards/in/default'
r2 = requests.get(url2, params={'sessionId':id})
json_data = r2.json()

for entry in json_data:
    cta_group = entry["ctaGroup"][0]
    click_url = cta_group['clickUrl']
    print(f"{cta_group['text']} - {click_url}")

    learn_more = entry['learnMore']['ctaGroup'][0]
    print(f"{learn_more['text']} - {learn_more['clickUrl']}")

This would give you the following links:

Apply Now - https://global.americanexpress.com/acq/intl/dpa/japa/ind/pers/begin.do?perform=IntlEapp:IND:membershiprewards_credit&feePay=P1
Learn more - credit-cards/membership-rewards-card/
Apply Now - https://global.americanexpress.com/acq/intl/dpa/japa/ind/pers/begin.do?perform=IntlEapp:IND:travel_platinum&feePay=T1
Learn more - credit-cards/platinum-travel-credit-card/
Apply Now - https://global.americanexpress.com/acq/intl/dpa/japa/ind/pers/begin.do?perform=IntlEapp:IND:gold_charge&feePay=G4&intlink=mainapplynow
Learn more - charge-cards/gold-card/
Apply Now - https://global.americanexpress.com/acq/intl/dpa/japa/ind/pers/begin.do?perform=IntlEapp:IND:platinum_reserve&feePay=LV&intlink=mainapplynow
Learn more - credit-cards/platinum-reserve-credit-card/
Learn more - credit-cards/jet-airways-platinum-credit-card/
Learn more - credit-cards/jet-airways-platinum-credit-card/
Apply Now - https://global.americanexpress.com/acq/intl/dpa/japa/ind/pers/begin.do?perform=IntlEapp:IND:platinum_charge
Learn more - charge-cards/platinum-card/
Learn more - credit-cards/payback-card/
Learn more - credit-cards/payback-card/
Apply Now - https://global.americanexpress.com/acq/intl/dpa/japa/ind/pers/begin.do?perform=IntlEapp:IND:smart_earn&feepay=ES1
Learn more - credit-cards/smart-earn-credit-card/

The learn more URLs would need the site's base URL adding.

@ Martin Evans I highly appreciate the optimistic code that you wrote, but you missed out the links of the 7th card. I'd appreciate if I could get that as well. — Ali Baba, Feb 12 '21 at 16:13
@ Martin Evans I tried removing **if** statement but the problem still persists. — Ali Baba, Feb 12 '21 at 16:35
@ Martin Evans I'd be glad if you could upvote some of my questions so that I get a chance to ask more questions. — Ali Baba, Feb 13 '21 at 05:48

Trying to scrape apply now and learn more urls but not able to get it using beautiful soup and python

2 Answers2