0

I am using django-dynamic-scraper in one of my applications, I have gone through the docs and following is my setup:

object class url I am using is : http://www.example.com/products/brandname_products.html

The pagination on the site is something like the following.

page 1: http://www.example.com/products/brandname_products.html page 2: http://www.example.com/products/brandname_products2.html page 3: http://www.example.com/products/brandname_products3.html page 4: http://www.example.com/products/brandname_products4.html

The brandname in the above urls is dynamic and depends on a brand's products page. I cannot have a different scraper for each brand as there are over 10000 brands so I am trying to use a single scraper object.

In the scraper object that I am using I have defined the pagination options as follows:

pagination_type: RANGE_FUNCT
pagination_append_str: _products{page}.html
pagination_page_replace: 1,100,2

but the scraper requests the following pagination urls

http://www.example.com/products/brandname_products.html_products2.html http://www.example.com/products/brandname_products.html_products3.html http://www.example.com/products/brandname_products.html_products4.html

Instead of

http://www.example.com/products/brandname_products2.html http://www.example.com/products/brandname_products3.html http://www.example.com/products/brandname_products4.html

Q: Why is it appending the replace string to the end of the url instead of actually replacing it with _products.html in the object class url ? What am I doing wrong and how can I fix this.

karthikr
  • 97,368
  • 26
  • 197
  • 188
Amyth
  • 32,527
  • 26
  • 93
  • 135

1 Answers1

4

The pagination_append_str option is called like this, because the string is appended to the base url and not replacing it! :-)

So everything is correct, you just have to remove _products_html from your base url so that the final url is build together without doubling url parts.