0

I have a project to make shopee product scraping. Scraping for some products is successful, but if there are thousands of products, only hundreds of products are successful, the rest fail and the error is "forbidden". I've tried using three php methods for scraping, namely curl_init, curl_multi_init, and curl class.

  1. php curl_init() This method returns an array
function scrapcurl($data){
   $result = [];
   foreach ($data as $key => $value) {
      $url = $value;
      $ua = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13';
      $handle = curl_init();
                
      // Set the url
      curl_setopt($handle, CURLOPT_URL, $url);
      curl_setopt($handle, CURLOPT_USERAGENT, $ua);
      curl_setopt($handle, CURLOPT_HEADER, 0);
      curl_setopt($handle, CURLOPT_RETURNTRANSFER, 1);
      $output = curl_exec($handle);
      curl_close($handle);
      array_push($result, $output);
   }
   return $result;
}
  1. php curl_multi_init() This method returns an array of json in string ex: {"error":null,"error_msg":null,"data":{"itemid":14513803134,"shopid":40261202,"userid":0,...} then i convert to array associative with another function
function multiRequest($data, $options = array()) {
    // array of curl handles
    $curly = array();
    // data to be returned
    $result = array();

    // multi handle
    $mh = curl_multi_init();

    // loop through $data and create curl handles
    // then add them to the multi-handle
    foreach ($data as $id => $d) {
        $curly[$id] = curl_init();

        $url = (is_array($d) && !empty($d['url'])) ? $d['url'] : $d;
        curl_setopt($curly[$id], CURLOPT_URL,            $url);
        curl_setopt($curly[$id], CURLOPT_HEADER,         0);
        curl_setopt($curly[$id], CURLOPT_RETURNTRANSFER, 1);

        // post?
        if (is_array($d)) {
            if (!empty($d['post'])) {
            curl_setopt($curly[$id], CURLOPT_POST,       1);
            curl_setopt($curly[$id], CURLOPT_POSTFIELDS, $d['post']);
            }
        }

        // extra options?
        if (!empty($options)) {
            curl_setopt_array($curly[$id], $options);
        }

        curl_multi_add_handle($mh, $curly[$id]);
    }

    // execute the handles
    $running = null;
    do {
        curl_multi_exec($mh, $running);
    } while($running > 0);


    // get content and remove handles
    foreach($curly as $id => $c) {
        $result[$id] = curl_multi_getcontent($c);
        curl_multi_remove_handle($mh, $c);
    }

    // all done
    curl_multi_close($mh);

    return $result;
}
  1. Curl class This method returns an array
use Curl;

function scrap($data)
{
    $resultawal=[];
    $result=[];
    $image=[];
    foreach ($data as $key => $value) {
        # code...
        $curl = new Curl();
        $curl->get($value);
        if ($curl->error) {
            # code...
            echo 'Error: ' . $curl->errorCode . ': ' . $curl->errorMessage . "\n";
        }
        else {
            # code...
            $js = $curl->response;
            foreach ($js->data->images as $key => $value) {
                $image["img$key"] = $value;
            };
            $gambar1 = json_encode($image);
            $harga = substr($js->data->price_max, 0, -5);
            $stok = $js->data->stock;
            $nama = str_replace("'", "", $js->data->name);
            $catid = $js->data->catid;
            $deskripsi = str_replace("'", "", $js->data->description);
            if ($js->data->video_info_list != '') {
                $video = $js->data->video_info_list;
                $video1 = json_encode($video);
            } else {
                $video1 = null;
            }
            $linkss = "https://shopee.co.id/" . str_replace(" ", "-", $nama) . "-i." . $js->data->shopid . "." . $js->data->itemid;
            $berat = 0; // berat
            $min = 1; // minimum_pemesanan
            $etalase = NULL; // etalase
            $preorder = 1; //preorder
            $kondisi = "Baru";
            $sku = NULL;
            $status = "Aktif";
            $asuransi = "optional";
            $item_id = $js->data->itemid;

            $resultawal = array(
                'item_id'=>$item_id,
                'linkss'=>$linkss,
                'nama'=>$nama,
                'deskripsi'=>$deskripsi,
                'catid'=>$catid,
                'berat'=>$berat,
                'min'=>$min,
                'etalase'=>$etalase,
                'preorder'=>$preorder,
                'kondisi'=>$kondisi,
                'gambar1'=>$gambar1,
                'video1'=>$video1,
                'sku'=>$sku,
                'status'=>$status,
                'stok'=>$stok,
                'harga'=>$harga,
                'asuransi'=>$asuransi,
            );
            array_push($result, $resultawal);
        }
    }
    return $result;
}

My Question From the three methods above, when the link is thousands, why does a 403 forbidden error appear with methods 1 and 2, and error: 403: HTTP/2 403 with method 3??

Additional info: Input of the program is thousand of link of products. For example:

5Pcs-pt4115-4115-sot-89-IC-Power-IC-LED-i.41253123.1355347598.sp_atk=09264df0-bb8d-4ca5-8970-719bbb2149dd

and then i take the shopid=41253123 and itemid=1355347598. Then i put to this link:

$link = "https://shopee.co.id/api/v4/item/get?itemid=" . $item_id . "&shopid=" . $shop_id;

and then use three methods above to scrape the product data.

  • 1
    The simple answer seems to be that the API has a limit to how many items you can fetch in one go, or in X amount of time. Batch process with a smaller volume per request. – Markus AO Mar 28 '22 at 07:26
  • because I changed the inputted link into an array, I divided the array with array_chunk. I tried per array containing 25, 50, 100, 200, and 500. Still can't. To be divided by time I have not tried. I'll tell you the result when I try it @Markus – Abdullah Al-Karim Mar 28 '22 at 09:07

0 Answers0