1

I am scraping Chinese website.

I have

FEED_EXPORT_ENCODING='utf-8'

in settings.py file.

If I run my scraper via

scrapy crawl myscraper -o output.json

Then my output file shows correct Chinese.

But if I start my scraper via Scrapyd then the Items created in http://my-website:6800/jobs are not encoded and not correct.

Why FEED_EXPORT_ENCODING='utf-8' not working with Scrapyd?

Then I set FEED_URI='files/output.json' and then ran scraper via Scrapyd.

Now the output file at FEED_URI='files/output.json' is in correct format/encoding.

What could go wrong?

Umair Ayub
  • 19,358
  • 14
  • 72
  • 146
  • `FEED_EXPORT_ENCODING` comes with version 1.2.0 ([in this specific commit](https://github.com/scrapy/scrapy/commit/33a39b368ffab6641997e7611d588487176716de)). Which version of Scrapy is being used within your Scrapyd environment? – starrify May 15 '17 at 07:32
  • @starrify `Scrapy 1.3.3` and `twistd (the Twisted daemon) 16.4.1` – Umair Ayub May 15 '17 at 07:34

1 Answers1

1

For now I haven't seen anything that Scrapyd might have done wrong with FEED_EXPORT_ENCODING: it should have respected (to be precise, untouched) that setting.

But if I start my scraper via Scrapyd then the Items created in http://my-website:6800/jobs are not encoded and not correct.

Did you just viewed the items in a browser window, or downloaded the full content on your local disk and viewed it using a UTF-8-supported approach?
Scrapyd's webservice does not specify an encoding when serving items (code), which might lead to mis-interpreting. But the generated item files on the server (sample path) shall be okay. Can you verify that?

starrify
  • 14,307
  • 5
  • 33
  • 50
  • You are right, I was viewing it from browser, I downloaded it and it looks fine... Ok but its JSONLines... ... another thing I want is to beautify JSON to make it readable ... – Umair Ayub May 15 '17 at 10:46
  • @Umair I personally strongly recommend the JSON-lines format over JSON for storing scraping results, as a large JSON object is rather memory-unfriendly. Regarding readability, there're quite several tools for that. You may try [`jq`](https://stedolan.github.io/jq/manual/), e.g. `cat results.jl | jq`. – starrify May 15 '17 at 11:47