Unable to get specific
tag for php scraping

Question

I am learning php scraping.I just started to scrape the following website:

 **[URL]="http://www.youramazingplaces.com/"**

Up till now i have scraped the all the titles,image sources and link addresses of each posts.I am little confused to scrape the tag as I need description of each title and that description is in 2 or 3 tags and all the images on that page are also in tags. I am using regex buddy. I want to create a regex expression from each post that should extract each description from the page except the paragraphg tags that contains images or other classes. Right now my REGEX extracts all the paragraph tags but I doesn't want all off them.I just need those tags that contains description only.

Up till now i have made the following regex to get all the paragraphs from that page: "%(?P < description>.*?)%m'".

Output is like following:" Jodhpur, is the second largest city in the Indian state of Rajasthan. Total population is 851,051 people. It is one of the most beautiful and most visited places in India. This city has two nicknames: “Sun City” for the bright, sunny weather and “Blue City” due to the vivid blue-painted houses around the Mehrangarh Fort. There you can see amazing old building, beautiful landscapes, astonishing architecture… Points of interest in Jodhpur is: Mehrangarh For, Jaswant Thada, Rao Jodha Desert Rock Park, Umaid Bhawan Palace, Mandore and Mandore Gardens and a lot of another interesting places. For those who love to travel and explore some new different places that definitely should go to Jodhpur in India. Below you can see some photos of places from there and enjoy in it. Also this brilliant pictures will gonna make you feel like you are there and enjoy in the beauty of Jodhpur. If you want to have unforgettable vacation visit Jodhpur. Image by Girish Suryawanshi via Flickr Image by Michael Foley via Flickr

"

It contains the images tag aswell and i dont need them.I just need to scrape the description from each page only.

Following is my code:

 *//$url="http://www.youramazingplaces.com/";
 //$curl_scraped_page=initCurl($url);*

 $pagenumber=1;

 while($pagenumber<=1)
 {
 $url="http://www.youramazingplaces.com/page/{$pagenumber}/";

 $curl_scraped_page=initCurl($url);
 *//////////LINKS////////////*
 preg_match_all('%<a href="(?P<links>.*?)"><b>(?P<readmore>.*?)</b></a>%m',      
 $curl_scraped_page,$link_array);
 for($x=0; $x<count($link_array['links']); $x++ )

 {
 $curldata=  initCurl($link_array['links'][$x]);

 preg_match_all('%<h1 class="(.*?)">(?P<title>.*?)</h1>%s', $curldata,$title);

 preg_match_all('%<p><img class="(?P<imageclass>.*?)" src="(?P<imgsrc>.*?)"alt="                 (?P<alt>.*?)"/>   </p>%m', $curldata,$img_src_array);

    preg_match_all('%<p>(?P<description>.*?)</p>%m', $curldata,$description_array);

   print_r($description_array['description']['1']);

  $pagenumber++;

 }

@Scuzzy ,if i use strip tag can I filter between those tags?because i need specific two or three paragraph tags that contain description only.I don't need rest of the information and i am confused on that.:( — waleeds37, Aug 24 '14 at 21:56

score 1 · Accepted Answer · answered Aug 24 '14 at 22:00

1

Do yourself a favor and never ever try to parse HTML with regular expressions. Use something like:

Then you just cherry pick pieces of the consumed HTML using selectors like you would in jQuery.

answered Aug 24 '14 at 22:00

TunaMaxx

1,782
12
18

Thanks All Of You.Can Anyone refer me from where should i learn DOM?As I am a beginner so I need a little help here. – waleeds37 Aug 24 '14 at 22:35
The docs at those two links will help. If you can use jQuery selectors, you can use either of those solutions. – TunaMaxx Aug 24 '14 at 23:10
Actually, phpQuery uses css selectors (pro), simple html dom uses it's own format (con) – pguardiario Aug 25 '14 at 01:07

Unable to get specific tag for php scraping

1 Answers1

Unable to get specific
tag for php scraping