I have a project to make shopee product scraping. Scraping for some products is successful, but if there are thousands of products, only hundreds of products are successful, the rest fail and the error is "forbidden". I've tried using three php methods for scraping, namely curl_init, curl_multi_init, and curl class.
- php curl_init() This method returns an array
function scrapcurl($data){
$result = [];
foreach ($data as $key => $value) {
$url = $value;
$ua = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13';
$handle = curl_init();
// Set the url
curl_setopt($handle, CURLOPT_URL, $url);
curl_setopt($handle, CURLOPT_USERAGENT, $ua);
curl_setopt($handle, CURLOPT_HEADER, 0);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($handle);
curl_close($handle);
array_push($result, $output);
}
return $result;
}
- php curl_multi_init() This method returns an array of json in string
ex:
{"error":null,"error_msg":null,"data":{"itemid":14513803134,"shopid":40261202,"userid":0,...}
then i convert to array associative with another function
function multiRequest($data, $options = array()) {
// array of curl handles
$curly = array();
// data to be returned
$result = array();
// multi handle
$mh = curl_multi_init();
// loop through $data and create curl handles
// then add them to the multi-handle
foreach ($data as $id => $d) {
$curly[$id] = curl_init();
$url = (is_array($d) && !empty($d['url'])) ? $d['url'] : $d;
curl_setopt($curly[$id], CURLOPT_URL, $url);
curl_setopt($curly[$id], CURLOPT_HEADER, 0);
curl_setopt($curly[$id], CURLOPT_RETURNTRANSFER, 1);
// post?
if (is_array($d)) {
if (!empty($d['post'])) {
curl_setopt($curly[$id], CURLOPT_POST, 1);
curl_setopt($curly[$id], CURLOPT_POSTFIELDS, $d['post']);
}
}
// extra options?
if (!empty($options)) {
curl_setopt_array($curly[$id], $options);
}
curl_multi_add_handle($mh, $curly[$id]);
}
// execute the handles
$running = null;
do {
curl_multi_exec($mh, $running);
} while($running > 0);
// get content and remove handles
foreach($curly as $id => $c) {
$result[$id] = curl_multi_getcontent($c);
curl_multi_remove_handle($mh, $c);
}
// all done
curl_multi_close($mh);
return $result;
}
- Curl class This method returns an array
use Curl;
function scrap($data)
{
$resultawal=[];
$result=[];
$image=[];
foreach ($data as $key => $value) {
# code...
$curl = new Curl();
$curl->get($value);
if ($curl->error) {
# code...
echo 'Error: ' . $curl->errorCode . ': ' . $curl->errorMessage . "\n";
}
else {
# code...
$js = $curl->response;
foreach ($js->data->images as $key => $value) {
$image["img$key"] = $value;
};
$gambar1 = json_encode($image);
$harga = substr($js->data->price_max, 0, -5);
$stok = $js->data->stock;
$nama = str_replace("'", "", $js->data->name);
$catid = $js->data->catid;
$deskripsi = str_replace("'", "", $js->data->description);
if ($js->data->video_info_list != '') {
$video = $js->data->video_info_list;
$video1 = json_encode($video);
} else {
$video1 = null;
}
$linkss = "https://shopee.co.id/" . str_replace(" ", "-", $nama) . "-i." . $js->data->shopid . "." . $js->data->itemid;
$berat = 0; // berat
$min = 1; // minimum_pemesanan
$etalase = NULL; // etalase
$preorder = 1; //preorder
$kondisi = "Baru";
$sku = NULL;
$status = "Aktif";
$asuransi = "optional";
$item_id = $js->data->itemid;
$resultawal = array(
'item_id'=>$item_id,
'linkss'=>$linkss,
'nama'=>$nama,
'deskripsi'=>$deskripsi,
'catid'=>$catid,
'berat'=>$berat,
'min'=>$min,
'etalase'=>$etalase,
'preorder'=>$preorder,
'kondisi'=>$kondisi,
'gambar1'=>$gambar1,
'video1'=>$video1,
'sku'=>$sku,
'status'=>$status,
'stok'=>$stok,
'harga'=>$harga,
'asuransi'=>$asuransi,
);
array_push($result, $resultawal);
}
}
return $result;
}
My Question From the three methods above, when the link is thousands, why does a 403 forbidden error appear with methods 1 and 2, and error: 403: HTTP/2 403 with method 3??
Additional info: Input of the program is thousand of link of products. For example:
5Pcs-pt4115-4115-sot-89-IC-Power-IC-LED-i.41253123.1355347598.sp_atk=09264df0-bb8d-4ca5-8970-719bbb2149dd
and then i take the shopid=41253123 and itemid=1355347598. Then i put to this link:
$link = "https://shopee.co.id/api/v4/item/get?itemid=" . $item_id . "&shopid=" . $shop_id;
and then use three methods above to scrape the product data.