1

I'm making a web crawler for a recipe website and I would like to get the link for a recipe then use that link to get the ingredients. I am able to do that, but only by manually entering the link to get the recipe. Is there a way to get the link then use this link to look at the ingredients. Also I will take any suggestions on how to make this code better!

def trade_spider():
 url= 'https://tasty.co/topic/best-vegetarian'
 source_code = requests.get(url)
 plain_text = source_code.text
 soup = BeautifulSoup(plain_text, 'lxml')

 for link in soup.find_all('a', {'class':'feed-item analyt-internal-link-subunit'}):
           test = link.get('href')
           print(test)

def ingredient_spider():
 url1= 'https://tasty.co/recipe/peanut-butter-keto-cookies'
 source_code1= requests.get(url1)
 new_text= source_code1.text
 soup1= BeautifulSoup(new_text, 'lxml')
 for ingredients in soup1.find_all("li", {"class": "ingredient xs-mb1 xs-mt0"}):
      print(ingredients.text)
pawello2222
  • 46,897
  • 22
  • 145
  • 209

3 Answers3

1

To do this, ensure that the output of your is set to return rather than print (to understand the difference, try reading the top answer on this post: What is the formal difference between "print" and "return"?)

You can then use the output of the function as either a variable, or put the output directly into the next function. For example

x = tradespider()

or

newFunction(tradespider())
Sam Grosz
  • 11
  • 3
1

You need to call the ingredient_spider function for every link you get from your recipe. Using your example, it would look like this:

def trade_spider():
 url= 'https://tasty.co/topic/best-vegetarian'
 source_code = requests.get(url)
 plain_text = source_code.text
 soup = BeautifulSoup(plain_text, 'lxml')

 for link in soup.find_all('a', {'class':'feed-item analyt-internal-link-subunit'}):
           test = link.get('href')
           ingredient_spider(test)

def ingredient_spider(url):
 source_code1= requests.get(url) #receive url from trade_spider function
 new_text= source_code1.text
 soup1= BeautifulSoup(new_text, 'lxml')
 for ingredients in soup1.find_all("li", {"class": "ingredient xs-mb1 xs-mt0"}):
      print(ingredients.text)

For each link you get from test = link.get('href'), you call the function ingredient_spider(), sending test variable as argument.

0

I am honestly not sure if I understand correctly what you are asking, but if I do you could go with something like this:

  • first create a list of URLs
  • second create a function that can work a url
  • last create a worker that works off that list peace by peace

.

def first():
    URLs = [] 
    ...
    for link in soup.find_all('a', {'class':'feed-item analyt-internal-link-subunit'}): 
    URLs.append(link.get('href'))
    return URLs

def second(url):
    source_code1= requests.get(url)
    new_text= source_code1.text
    soup1= BeautifulSoup(new_text, 'lxml')
    for ingredients in soup1.find_all("li", {"class": "ingredient xs-mb1 xs-mt0"}):
      return ingredients.text
  
def third(URL_LIST):
    for URL in URL_LIST:
         tmp = second(URL)
         print(tmp)

URL_LIST = first()
third(URL_LIST)
xtlc
  • 1,070
  • 1
  • 15
  • 41