1

I'm scraping google search given a query, but the thing is that the titles I scrape are given in the ISO-8859-1 and I need them in UTF-8 for the spanish language. I get the 10 first titles but some of them are shown like this: Software Java | Oracle M�xico

I receive from the axios request the text/html of the google search and scrape the titles and the urls.

I've tried the following:

const axios = require("axios");
const cheerio = require("cheerio");
const fs = require("fs");

async function scrape (req, res) {
    try {
        const query = req.params.query;
        const encodedQuery = encodeURIComponent(query);
        // Set the number of search results you want (e.g., 10 in this case)
        const numResults = 10;
        const url = `https://www.google.com.mx/search?q=${encodedQuery}&start=${numResults}`
        
        await axios.get(url, {
            responseType: "arraybuffer",
            headers: {
                "Content-Type": "text/html; charset=UTF-8"
            }
        }).then((response) => {
            console.log(response.headers['content-type'])
            const $ = cheerio.load(response.data, { decodeEntities: false });
            const data = [...$(".egMi0")]
                .map(e => ({
                title: $(e).find("h3").text().trim(),
                href: $(e).find("a").attr("href"),
                }));
            console.log(data);
            fs.writeFileSync("test.html", response.data);
        })
        
        res.status(200).json({
            message: "Scraping successful",
            output: 10,
        });
      } catch (error) {
        // Handle any errors that occurred during the request
        console.error('Error while scraping website:', error.message);
        res.status(500).json({
            message: "Error while scraping website. Contact support.",
            error: "Internal Server Error",
        });
      }
}

module.exports = {
    scrape,
}

Even forcing the content-type it prints that my response.headers['content-type'] is text/html; charset=ISO-8859-1.

  • I suggest hardcoding a sample test URL and removing Express, which isn't relevant to the problem. Thanks. For reference, this is a follow-up to [Scrape google search using axios and cheerio (Node js)](https://stackoverflow.com/questions/77026858/scrape-google-search-using-axios-and-cheerio-node-js/77026985?noredirect=1#comment135793026_77026985). Did you see [How can I get the value in utf-8 from an axios get receiving iso-8859-1 in node.js](https://stackoverflow.com/questions/52863124/how-can-i-get-the-value-in-utf-8-from-an-axios-get-receiving-iso-8859-1-in-node)? – ggorlen Sep 02 '23 at 18:57
  • As an aside, I suggest [not mixing async/await and `.then`](https://stackoverflow.com/a/75785234/6243352). – ggorlen Sep 02 '23 at 19:00
  • @ggorlen yeah I already tried the solution of that post, but it throws an error `SyntaxError: Unexpected token < in JSON at position 0 at JSON.parse ()` I haven't been able to fix it. In the header, I specified that I needed text/html response. Regarding mixing async/await with then I already adjusted my code. Thank you. – La Bola Al Riel Sep 02 '23 at 19:15
  • Don't try to parse HTML as JSON. Use `response.text()` in place of `response.json()` and skip calling `JSON.parse()`--we're not dealing with JSON in this case. Feel free to [edit] your post to improve it since answers haven't been posted yet. – ggorlen Sep 02 '23 at 19:16
  • 1
    @ggorlen thank you with that I was able to solve it. I'll post the answer. – La Bola Al Riel Sep 02 '23 at 19:46
  • Does this answer your question? [Scrape google search using axios and cheerio (Node js)](https://stackoverflow.com/questions/77026858/scrape-google-search-using-axios-and-cheerio-node-js) – ggorlen Sep 02 '23 at 22:59

0 Answers0