I have a pastebin scraper script, which is designed to find leaked emails and passwords, to make a website like HaveIBeenPwned.
Here is what my script is doing:
- Scraping Pastebin links from https://psbdmp.ws/dumps
- Getting a random proxy using this Random Proxy API (because Pastebin bans your IP if you hammer too many requests): https://api.getproxylist.com/proxy
- Doing a CURL request to the Pastebin links, then doing a preg_match_all
to find all the email addresses and passwords in the format email:password
.
The actual script seems to be working alright, but it isn't optimized enough, and is giving me a 524 timeout error after some time, which I suspect is because of all those CURL requests.
Here is my code:
api.php
function comboScrape_CURL($url) {
// Get random proxy
$proxies->json = file_get_contents("https://api.getproxylist.com/proxy");
$proxies->decoded = json_decode($proxies->json);
$proxy = $proxies->decoded->ip.':'.$proxies->decoded->port;
list($ip,$port) = explode(':', $proxy);
// Crawl with proxy
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
comboScrape('email:pass',$curl_scraped_page);
}
index.php
require('api.php');
$expression = "/(?:https\:\/\/pastebin\.com\/\w+)/";
$extension = ['','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20'];
foreach($extension as $pge_number) {
$dumps = file_get_contents("https://psbdmp.ws/dumps/".$pge_number);
preg_match_all($expression,$dumps,$urls);
$codes = str_replace('https://pastebin.com/','',$urls[0]);
foreach ($codes as $code) {
comboScrape_CURL("https://pastebin.com/raw/".$code);
}
}