I am new to web scrapes, and need to learn quickly for work. I am having trouble scraping a clients web page because the content I need to aquire is nested uniquely to each record on the main page (300+ times), some fields on the child pages are not in tags, and a bit of a mess. What would be the best logic for getting the following info. (Also if anyone knows of any newer scrape tools that are free and worth looking into, that'd be awesome. I am able to get all of the records on the parent page. I just dont know how to hop thru each record to access it's child page information, and grab it before moving to the next row on the parent page.
Asked
Active
Viewed 326 times
0
-
2without an example of your clients html or something you've tried but stuck with we cant help, this site is not about "give me kodz" its about "help me kodz" – Lawrence Cherone Dec 12 '11 at 15:40
-
Scraping is not easy, I don't think you can learn it quickly. If often involves normalizing the input using tidy extension, parsing DOM with xpath and on top of that often one or more regex is used. A good knowledge of curl is needed too. You need to know all of these tools well to become a master scraper – Dmitri Snytkine Dec 12 '11 at 15:54
-
see https://github.com/rajanrx/php-scrape/blob/master/Tests/Unit/Extractor/Types/MultipleRowExtractorTest.php#LC53 which might help you getting some idea how it can be done. – Rx Seven Jul 02 '17 at 03:06
1 Answers
1
foreach top level pages {
html = fetch page
data = process html
while (there are more descendant pages) {
html = fetch next page using data
data = process html
}
save this data chain
}
But if you're struggling with the above logic, I think I'd have to recommend you skip the code and focus your time on learning one of the existing tools. You're almost certain to save time. Espescially if you'll be scraping often.

goat
- 31,486
- 7
- 73
- 96