3

I am making a rails application to crawl the flight information from specific website. This app can be found here https://vemaybay.herokuapp.com/. It only took around 4-5 seconds to response locally, but it took 15-20 seconds when running on heroku. Is there anyway to speed up this response time? I have already changed the free to hobby dyno type to avoid DB spin-up costs but I believe DB connection and query is not the root cause. Is it related to the hosting problem? So can think about buying a host.

Below is my example code:

FlightService

 def crawl(from, to, date)
return if flight_not_available?(from, to)
begin
  selected_day = date.day - 1
  browser = ::Ferrum::Browser.new
  browser.headers.set({ "User-Agent" => "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36" })

  browser.goto("https://www.abay.vn/")
  browser.at_css("input#cphMain_ctl00_btnSearch").click
  browser.back

  browser.execute("document.getElementById('cphMain_ctl00_txtFrom').setAttribute('value','#{from}')")
  browser.execute("document.getElementById('cphMain_ctl00_txtTo').setAttribute('value','#{to}')")
  browser.execute("document.getElementById('cphMain_ctl00_cboDepartureDay').selectedIndex = #{selected_day}")
  browser.at_css("input#cphMain_ctl00_btnSearch").click
  # browser.execute("document.querySelectorAll('a.linkViewFlightDetail').forEach(btn=> btn.click())")
  sleep(1)
  body = Nokogiri::HTML(browser.body)

  flight_numbers = body.css("table.f-result > tbody > tr.i-result > td.f-number").map(&:text)
  depart_times = body.css("table.f-result > tbody > tr.i-result > td.f-time").map { |i| i.text.split(" - ").first }
  arrival_times = body.css("table.f-result > tbody > tr.i-result > td.f-time").map { |i| i.text.split(" - ").second }
  base_prices = body.css("table.f-result > tbody > tr.i-result > td.f-price").map(&:text)

  prices = base_prices
  store_flight(flight_numbers, from, to, date, depart_times, arrival_times, base_prices, prices)
  browser.quit
rescue StandardError => e
  Rails.logger.error e.message
  fail_with_message(e.message)
  browser.quit
end

end

Then in my controller i just call the crawl method to fetch data:

service = FlightService.new(from: @from, to: @to, departure_date: @departure_date, return_date: @return_date)
service.crawl_go_flights
@go_flights = service.go_flights
Tiktac
  • 966
  • 1
  • 12
  • 32
  • Probably this answers your question. https://stackoverflow.com/questions/2606190/why-are-my-basic-heroku-apps-taking-two-seconds-to-load?answertab=active#tab-top – Lalu Jan 12 '20 at 07:39
  • 1
    Could you provide more information? e.g. ruby/rails version, production environment configuration, maybe what your app does and what is actually being slow. – Edward Jan 12 '20 at 07:40
  • @Edward, I'm using ruby '2.6.3' with rails '6.0.1'. What configuration should i consider? I just turned on only config.assets.compile flag from production.rb to serve assets. – Tiktac Jan 12 '20 at 07:58
  • @Lalu seems that you are mentioning to the first loading problem with free dyno type. But I'm paying to use hobby type then my app will not be unloaded. – Tiktac Jan 12 '20 at 08:02
  • @Tiktac - Your app loads within 700ms for me, so I thought you are talking about first loading time. If it is slow for every request for you, probably it's because of your network, bandwidth or DNS issues? – Lalu Jan 12 '20 at 08:30
  • @Lalu, it is request response time when I click orange button, not the issue with first loading time. On my local machine, it responses under 5 sec but really wrong when its heroku app took 20 seconds :( – Tiktac Jan 12 '20 at 08:42
  • 1
    @Tiktac are you crawling or performing time consuming operations in controller methods? – Marslan Jan 12 '20 at 09:21
  • It's under 500ms for me when testing your site, so I think you have internet issue. – Vibol Jan 12 '20 at 09:31
  • @Marslan, as I told, the controller method will call service to crawl the information, it only take maximum 5 second to response in my localhost. But you can see in my above app, it will take approximately 20 second. It is definitely unacceptable. – Tiktac Jan 12 '20 at 12:24
  • @Coco I think you are talking about first loading time of my site, the issue happen when you fill information and click the orange button. – Tiktac Jan 12 '20 at 12:26
  • 1
    Heroku machines are just not very powerful. You should probably move the actual work into a background process instead of running costly business logic in your controller. You can use a framework like [fie](https://fie.eranpeer.co/guide) to easily update your views when new results are coming in. – Marcus Ilgner Jan 12 '20 at 14:18
  • Thanks @milgner I will consider to update view from background way as you said. – Tiktac Jan 12 '20 at 17:52
  • One thing you may also check is the region heroku is serving the Dyno from. The latency may every well come from when the Dyno tried to reach the target site. – Edward Jan 12 '20 at 22:27
  • OK, I understand. The scraping from another site is never a good idea to perform in real-time, it's a good practice to have a background job doing all the heavy work for X times per day then save the result to database, then your controller will interact only with your own database. If you worry about accuracy, let your job perform more frequently. – Vibol Jan 13 '20 at 00:01
  • @Coco I understand that my way is never a good practice but still wonder about the Heroku problem. – Tiktac Jan 13 '20 at 06:08

1 Answers1

2

I would try to add NewRelic Heroku add-on, it will show you what takes the most time, most likely it will be your Ruby code doing HTTP requests in a controller action to crawl a page.

Heroku tends to be slower than running code on your own development machine because Heroku resources are shared across users unless you bought expensive M/L dynos.

Without you sharing the code for crawling we don't know much how it work and where is the bottleneck. Do you crawl the single page or many pages (then this might be slow).

You can try moving crawl logic to the background worker, for instance, use Sidekiq gem. You could crawl page from time to time and store results in your DB then your controller action would only ask for data from your DB instead of crawling page every time. You can also use a rake task every 10 minutes defined in Heroku Scheduler to crawl page instead of Sidekiq (this might be faster to do). I don't know if having data up to date every 10 minutes is good enough for your use case. You need to pick a tech solution for your business use case needs. With Sidekiq you could run jobs more often by starting them every 1 minute using clockwork gem.

Artur Trzop
  • 336
  • 3
  • 12
  • Thanks for your excellent suggestion, I have tried to add NewRelic add-on for checking. And yes as you said, the most time consuming is coming from controller action. I do not know why it is taking so long on Heroku. I ever end up by using clockwork to schedule the crawling, but I faced some other technical issues with this solution, for example: I should run long and very long task to crawl all flights on all days from all source to destination airport, then I should run an update job to update all flights also. So I decide to crawl the page only once after user request. – Tiktac Jan 12 '20 at 17:12
  • My code only crawls the information from a single page, and the page responses so fast. That why I told it just tooks 4 second in my local. – Tiktac Jan 12 '20 at 17:17
  • I just attached my example code to my question for your reference. – Tiktac Jan 12 '20 at 17:26
  • You are using browser with Ferrum::Browser to crawle page this might be CPU intensive and RAM consuming. You can double check if your RAM usage in Heroku Metrics is exceeding the max limit. If you start SWAPing your dyno will be super slow. – Artur Trzop Jan 12 '20 at 19:23
  • I would consider rewriting code to scape the page in the background worker, for instance by using Sidekiq and saving results in DB. Then with ajax checking from time to time if results showed up or not. You could generate UUID on client side and start sidekiq job in controller action to crawl page. Pass to sidekiq job the UUID. The client browser can from time to time check with ajax if results for particular UUID are already in DB. If sidekiq job completes the work it can store results in DB with UUID. – Artur Trzop Jan 12 '20 at 19:27
  • Thanks @Artur Trzop, its RAM limit never reaches to 300MB, I think it is not too much. Let me consider updating view using sidekiq background job. Do you know any good example with this way? – Tiktac Jan 13 '20 at 02:32
  • I understand that my way is never a good practice but still wonder about the Heroku question. – Tiktac Jan 13 '20 at 02:42
  • If you wonder about Heroku just being slow then that's expected because it's shared resources. Heavy CPU operations will take even a few times slower on Heroku than on your machine. I used to work for a client where he had to buy very expensive dyno L just to process a request within Heroku 30s request limit because on a cheaper dyno the request exceeded 30s and timeout. We wanted to move logic to background job but the app was so complex that refactoring it was too complex and it was just simpler to overpay for fast CPU with expensive performance L dyno. – Artur Trzop Jan 13 '20 at 22:28