I am learning php scraping.I just started to scrape the following website:
**[URL]="http://www.youramazingplaces.com/"**
Up till now i have scraped the all the titles,image sources and link addresses of each posts.I am little confused to scrape the < p > tag as I need description of each title and that description is in 2 or 3 < p > tags and all the images on that page are also in < p > tags. I am using regex buddy. I want to create a regex expression from each post that should extract each description from the page except the paragraphg tags that contains images or other classes. Right now my REGEX extracts all the paragraph tags but I doesn't want all off them.I just need those tags that contains description only.
Up till now i have made the following regex to get all the paragraphs from that page: "%< p>(?P < description>.*?)< /p>%m'".
Output is like following:" Jodhpur, is the second largest city in the Indian state of Rajasthan. Total population is 851,051 people. It is one of the most beautiful and most visited places in India. This city has two nicknames: “Sun City” for the bright, sunny weather and “Blue City” due to the vivid blue-painted houses around the Mehrangarh Fort. There you can see amazing old building, beautiful landscapes, astonishing architecture… Points of interest in Jodhpur is: Mehrangarh For, Jaswant Thada, Rao Jodha Desert Rock Park, Umaid Bhawan Palace, Mandore and Mandore Gardens and a lot of another interesting places. For those who love to travel and explore some new different places that definitely should go to Jodhpur in India. Below you can see some photos of places from there and enjoy in it. Also this brilliant pictures will gonna make you feel like you are there and enjoy in the beauty of Jodhpur. If you want to have unforgettable vacation visit Jodhpur. Image by Girish Suryawanshi via Flickr Image by Michael Foley via Flickr
"
It contains the images tag aswell and i dont need them.I just need to scrape the description from each page only.
Following is my code:
*//$url="http://www.youramazingplaces.com/";
//$curl_scraped_page=initCurl($url);*
$pagenumber=1;
while($pagenumber<=1)
{
$url="http://www.youramazingplaces.com/page/{$pagenumber}/";
$curl_scraped_page=initCurl($url);
*//////////LINKS////////////*
preg_match_all('%<a href="(?P<links>.*?)"><b>(?P<readmore>.*?)</b></a>%m',
$curl_scraped_page,$link_array);
for($x=0; $x<count($link_array['links']); $x++ )
{
$curldata= initCurl($link_array['links'][$x]);
preg_match_all('%<h1 class="(.*?)">(?P<title>.*?)</h1>%s', $curldata,$title);
preg_match_all('%<p><img class="(?P<imageclass>.*?)" src="(?P<imgsrc>.*?)"alt=" (?P<alt>.*?)"/> </p>%m', $curldata,$img_src_array);
preg_match_all('%<p>(?P<description>.*?)</p>%m', $curldata,$description_array);
print_r($description_array['description']['1']);
$pagenumber++;
}
tag then use strip_tags() on the content?
– Scuzzy Aug 24 '14 at 21:48