Trying to fix a ruby script to get japanese news to the kindle

Question

Hello Any help to fix this old script would be greatly appreciated. I'm trying to get NHK easy news to kindle script working. I know nothing about Ruby and not much more about programming in general. At the bottom of the post are all the links needed for the script, kindlegen, etc...

The steps I did, until now:

Downloaded and installed Ruby
CMD Installed gems nokogiri and Trollop
Downloaded and unzipped the JapNewsToKindle script
Downloaded and unzipped the Kindlegen
Put the kindlegen exe next to the JapNewsToKindle file.
CMD Run ruby JapNewsToKindle -u https://www3.nhk.or.jp/news/easy/k10013643691000/k10013643691000.html -O NHK01

After running the command got some errors about Trollop being deprecated and now being replaced? by Optimist.

Cmd Installed gem optimist
Replaced 3 ocurrences of trollop to optimist from lines 8/205/247
Changed lines 55 and 64 from u/doc = Nokogiri::HTML(open(url)) to u/doc = Nokogiri::HTML(URI.open(url))

In the instructions said that I needed to use the -O --out option to give a title name to the file because windows had a problem with title name.

*CMD Run ruby JapNewsToKindle -u https://www3.nhk.or.jp/news/easy/k10013643691000/k10013643691000.html -O NHK01

This time I got the next error:

JapNewsToKindlemod:183:in \`gsub!': no implicit conversion of Nokogiri::XML::NodeSet into String (TypeError)
from JapNewsToKindlemod:183:in `initialize'
from JapNewsToKindlemod:235:in `new'
from JapNewsToKindlemod:235:in `block in <main>'
from JapNewsToKindlemod:227:in `each'
from JapNewsToKindlemod:227:in `<main>'

Googling without results. CHECK.

Links:

JapNewsToKindle original post from 9 years ago. https://www.reddit.com/r/LearnJapanese/comments/1h4y3c/reading_nhk_easy_news_on_your_kindle/

Post with the instructions I followed at first https://www.reddit.com/r/LearnJapanese/comments/1h4y3c/reading_nhk_easy_news_on_your_kindle/caqz3yi/

GitPage where I downloaded "kindlegen" because it got changed in amazon for "kindle previewer" https://github.com/ciromattia/kcc/issues/371

Direct Link to the kindlegen zip in the github https://github.com/ciromattia/kcc/files/5133667/kindlegen_win32_v2_9.zip

Code at the moment :

    #!/usr/bin/env ruby
# encoding: utf-8
# Version: 0.2a 2013-06-28

require 'nokogiri'
require 'open-uri'
require 'tmpdir'
require 'optimist'
require 'rbconfig'
$is_windows = (RbConfig::CONFIG['host_os'] =~ /mswin|mingw|cygwin/)

def clean_string (str)
  str.tr('0-9', '０-９').sub('h２', 'h2').sub('h３', 'h3').sub('h４', 'h4')
end

def strip_element_tags (node, element_name)
  node.search('.//' + element_name).each do |e|
    e.replace e.inner_html
  end
end

def strip_ruby_tags (node)
  node.search('.//rt').remove
  strip_element_tags(node, 'ruby')
end

class Article
  def get_title (options = {})
    @doc.xpath(@XPath_title).each do |lines|
      strip_ruby_tags lines if not options[:ruby]
      return lines.content.to_s if options[:clean]
      return clean_string(lines.to_s)
    end
  end

  def get_date (options = {})
    @doc.xpath(@XPath_time).each do |lines|
      strip_element_tags lines, 'span'
      return clean_string(lines.to_s)
    end
  end

  def get_content (options = {:ruby => false})
    @doc.xpath(@XPath_article).each do |lines|
      strip_ruby_tags lines if not options[:ruby]
      strip_element_tags lines, 'span'
      strip_element_tags lines, 'a'
      return clean_string(lines.inner_html.to_s)
    end
  end
end

# class NHKEasyArticle < Article
  # def initialize (url)
    # @doc = Nokogiri::HTML(URI.open(url))
    # @XPath_title = '//*[@id="newstitle"]/h2'
    # @XPath_time = '//*[@id="newsDate"]'
    # @XPath_article = '//*[@id="newsarticle"]'
  # end
# end

#Added to modify class on line 53 because nhk data change over time. I kept the previous class for reference above this one.
class NHKEasyArticle < Article
  def initialize (url)
    @doc = Nokogiri::HTML(URI.open(url))
    @XPath_title = '//*[@class="article-main__title"]'
    @XPath_time = '//*[@id="js-article-date"]'
    @XPath_article = '//*[@id="js-article-body"]'
  end
end

class NHKArticle < Article
  def initialize (url)
    @doc = Nokogiri::HTML(URI.open(url))
    @XPath_title = '//*[@id="news"]/div[2]/div/div/div[1]/h1/span'
    @XPath_time = '//*[@id="news"]/div[2]/div/div/div[1]/h1/div'
    @XPath_article = '//*[@id="news"]/div[2]/div/div/div'
  end

  def get_title (options = {})
    super.gsub 'span', 'h2'
  end

  def get_date (options = {})
    super.gsub('<div class="time">', '<p id="newsDate">[').gsub('</div>', ']</p>')
  end

  def get_content (options = {:ruby => false})
    c = ''
    @doc.xpath(@XPath_article).each do |lines|
      break if lines.attribute('id').to_s == "news_mkanren"
      strip_ruby_tags lines if not options[:ruby]
      strip_element_tags lines, 'span'
      strip_element_tags lines, 'a'
      c += clean_string(lines.inner_html.to_s)
    end
    c.sub(/.*<p id="news_textbody">/m, '<p id="news_textbody">')
  end
end

class HTMLOutput
  def initialize (article, fileName, options = {})
    title = article.get_title(:ruby => false, :clean => true)

    @horizontal_css = <<eos
body {
  font-family: serif; }
h2, h3 {
  font-weight: bold;
  padding-top: 2em;
  margin-right: 1em;
  margin-left: 1em; }
h2 {
  font-size: 120%; }
p {
  text-indent: 1em; }
#newsDate {
  font-size: 90%;
  font-weight:bold;
  line-height: 1.5; }
eos

    @vertical_css = <<eos
body {
  -webkit-writing-mode: vertical-rl; }
#newsDate {
  padding-top: 10em;
  text-indent: -4em; }
eos
  @vertical_css = @horizontal_css + @vertical_css

    @html_header = <<eos
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xml:lang="ja" xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta http-equiv="Content-Style-Type" content="text/css" />
  <meta name="generator" content="pandoc" />
  <title>{{TITLE}}</title>
  <link rel="stylesheet" href="{{CSS_FILE}}" type="text/css" />
  <link rel="Schema.DC" href="http://purl.org/dc/elements/1.1/" />
  <meta name="DC.Title" content="{{TITLE}}" />
  <meta name="DC.Creator" content="NHK" />
  <meta name="DC.Publisher" content="NHK" /></head>
<body>
eos

    @html_footer = <<eos
</body>
</html>
eos

    @html_header.gsub! '{{TITLE}}', title
    @html_header.gsub! '{{CSS_FILE}}', fileName + ".css"

    File.open(fileName + ".css", 'w') { |file|
      file.write(@horizontal_css) if options[:horizontal]
      file.write(@vertical_css) if not options[:horizontal]
    }

    File.open(fileName + ".html", 'w') { |file|
      file.write(@html_header.sub('{{CSS_FILE}}', fileName + ".css"))
      file.write(article.get_title(options))
      file.write(article.get_date(options))
      file.write(article.get_content(options))
      file.write(@html_footer)
    }
  end
end

class KindleOutput
  def initialize (article, fileName, options = {})
    title = article.get_title(:ruby => false, :clean => true)

    @opf_file = <<eos
<?xml version="1.0" encoding="UTF-8"?>
<package version="3.0" xmlns="http://www.idpf.org/2007/opf" unique-identifier="BookId">
 <metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/">
   <dc:title>{{TITLE}}</dc:title> 
   <dc:contributor>NHK</dc:contributor>
   <dc:language>ja</dc:language>
   <dc:publisher>NHK</dc:publisher>
 </metadata>
 <manifest>
  <item id="style" href="{{CSS_FILE}}" media-type="text/css" />
  <item id="titlepage" href="{{FILENAME}}.html" media-type="application/xhtml+xml" />
 </manifest>
 <spine toc="tocncx" page-progression-direction="rtl">
  <itemref idref="titlepage" />
 </spine>
</package>
eos
    @opf_file.gsub! '{{TITLE}}', title
    @opf_file.gsub! '{{FILENAME}}', fileName
    @opf_file.gsub! '{{CSS_FILE}}', fileName + ".css"

    Dir.mktmpdir { |dir|
      HTMLOutput.new(article, dir + "/" + fileName, options)
      
      File.open(dir + "/" + fileName + ".opf", 'w') { |file|
        file.write(@opf_file)
      }
      if $is_windows
        system "kindlegen.exe \"#{dir + "/" + fileName}.opf\""
      else
        system "kindlegen \"#{dir + "/" + fileName}.opf\""
      end
      FileUtils.cp dir + "/" + fileName + ".mobi", fileName + ".mobi"
    }
  end
end

# main part

opts = Optimist::options do
  version "JapNewsToKindle 0.2a (c) 2013 Patrick Lerner [PatrickLerner@me.com]"
  banner <<-EOS
This program dumps Japanese News websites into a kindle compatible mobi file using Amazon's kindlegen (needs to be in path!).

Usage:
       JapNewsToKindle [options]
where [options] are:
EOS

  opt :ruby, "Get furigana if possible", :short => 'r'
  opt :url, "The URL that is supposed to be dumped", :type => String, :short => 'u'
  opt :out, "The output filename", :type => String, :short => 'O'
  opt :horizontal, "Use a horizontal layout instead of the default vertical one", :default => false, :short => 'n'
  opt :open, "Open the generated file in the Kindle Application", :default => false, :short => 'o'
end

backends = [
  [/nhk.or.jp\/news\/easy\/k[0-9]+\/k[0-9]+\.html/, NHKEasyArticle],
  [/nhk.or.jp\/news\/html\/[0-9]+\/[a-z][0-9]+\.html/, NHKArticle]
]

backends.each { |b|
  if b[0].match(opts[:url])
    article = b[1].new(opts[:url])
    if opts[:out]
      fileName = opts[:out]
    else
      fileName = article.get_title(:ruby => false, :clean => true)
    end
    KindleOutput.new(article, fileName, {:ruby => opts[:ruby], :horizontal => opts[:horizontal]})

    if opts[:open] and not $is_windows
      system "killall Kindle"
      kindleFilePath = ENV['HOME'] + "/Library/Application Support/Kindle/My Kindle Content/#{fileName}.mobi"
      FileUtils.rm kindleFilePath if File.exists? (kindleFilePath)
      system "open \"#{fileName.to_s}.mobi\""
    end
   exit
  end
}

Optimist::die :url, "must match against a backend supported by this program"

Answer given to me in reddit by a "Mimicry2311" allowing the code to run again.

It tries to find the title of the article in line 29:
@doc.xpath(@XPath_title)
but this returns an empty Nokogiri::XML::NodeSet (which for some reason can't be converted to a string, ultimately causing the error you mentioned)

However if I change the search patterns in line 53 and following to
class NHKEasyArticle < Article
  def initialize (url)
    @doc = Nokogiri::HTML(URI.open(url))
    @XPath_title = '//*[@class="article-main__title"]'
    @XPath_time = '//*[@id="js-article-date"]'
    @XPath_article = '//*[@id="js-article-body"]'
  end
end
it seems to run more smoothly.

_"I know nothing about Ruby and not much more about programming in general"_ – what kind of answer do you expect? What could help you? — Stefan, May 29 '22 at 09:25
Saying I don't know, was just to say don't use too complicated terms when posible so I can try and follow. As you can see I tried fixing it and searching for the answer but with my current knowledge can't find where in the code is the problem, just some guidance to help me narrow down the answer would have been good, but someone already got it to a usable state. Meanwhile, I'm since yesterday learning ruby and nokogiri to make my own scrapper or modify this one that was abandoned but looks beautiful. I will edit my post to shot the answer given to me in reddit, that allows the code to run agn. — Itochi, May 29 '22 at 15:40
If your fix makes the script work again, you can post that as an answer. It is totally okay and welcome to answer your own question! — Josien, May 30 '22 at 13:05

score 0 · Answer 1 · answered Jun 11 '22 at 08:46

Answer given to me in reddit by a "Mimicry2311" allowing the code to run again.

It tries to find the title of the article in line 29:
@doc.xpath(@XPath_title)
but this returns an empty Nokogiri::XML::NodeSet (which for some reason can't be converted to a string, ultimately causing the error you mentioned)

However if I change the search patterns in line 53 and following to
class NHKEasyArticle < Article
  def initialize (url)
    @doc = Nokogiri::HTML(URI.open(url))
    @XPath_title = '//*[@class="article-main__title"]'
    @XPath_time = '//*[@id="js-article-date"]'
    @XPath_article = '//*[@id="js-article-body"]'
  end
end
it seems to run more smoothly.

Trying to fix a ruby script to get japanese news to the kindle

1 Answers1