0

I have an array of Google News article urls. Google News article urls redirect immediately to real urls, ie: CNBC.com/.... I am trying to pull out the real, redirected url. I thought I could loop through the list and load the Google News link in a WebView, then call webView.url in a DispatchQueue after 1 second to get the real url, but this doesn't work.

How could you fetch a list of redirected urls quickly?

Here's my code you could use to reproduce the problem:

        let webView = WKWebView()
        let myList = [URL(string: "https://news.google.com/articles/CAIiEDthIxbgofssGWTpXgeJXzwqGQgEKhAIACoHCAow2Nb3CjDivdcCMJ_d7gU?hl=en-US&gl=US&ceid=US%3Aen"), URL(string: "https://news.google.com/articles/CAIiEP5m1nAOPt-LIA4IWMOdB3MqGQgEKhAIACoHCAowocv1CjCSptoCMPrTpgU?hl=en-US&gl=US&ceid=US%3Aen")]

        for url in myList {
            guard let link = url else {continue}
            self.webView.loadUrl(string: link.absoluteString)

            DispatchQueue.main.asyncAfter(deadline: .now() + 1.0) {
                let redirectedLink = self.webView.url
                print("HERE redirected url: ", redirectedLink) // this does not work
            }
        }
Ryan
  • 107
  • 7
  • 30
  • Isn’t scraping content from somebody else’s site kinda slimy? – Caleb Apr 14 '20 at 02:44
  • Last I checked that's literally what Google News is... a mass aggregator / scraper. – Ryan Apr 14 '20 at 03:26
  • Aggregating isn't the same as scraping. Google News is likely driven by RSS feeds, and when you click on a headline you go to the site that created that content. When you scrape Google's page, though, you're taking advantage of the content that Google created and using it as though it were your own. Google has API's for a million different things, so maybe there's one for their aggregated news -- if so, use that, and you won't need to scrape anything. If not, then maybe you should consider curating your own set of sources instead. – Caleb Apr 14 '20 at 14:57

1 Answers1

1

There are two problems with your attempt:

1) You're using one and the same web view in the loop and since nothing inside the loop blocks until the web view has finished loading, you just end up cancelling the previous request with every loop pass.

2) Even if you did block inside the loop, accessing the URL after a second won't work reliably since the navigation could easily take longer than that.

What I would recommend doing is to continue using a single web view (to save resources) but to use its navigation delegate interface for resolving the URLs one by one.

This is a crude example to give you a basic idea:

import UIKit
import WebKit

@objc class RedirectResolver: NSObject, WKNavigationDelegate {

    private var urls: [URL]
    private var resolvedURLs = [URL]()
    private let completion: ([URL]) -> Void
    private let webView = WKWebView()

    init(urls: [URL], completion: @escaping ([URL]) -> Void) {
        self.urls = urls
        self.completion = completion
        super.init()
        webView.navigationDelegate = self
    }

    func start() {
        resolveNext()
    }

    private func resolveNext() {
        guard let url = urls.popLast() else {
            completion(resolvedURLs)
            return
        }
        let request = URLRequest(url: url)
        webView.load(request)
    }

    func webView(_ webView: WKWebView, didFinish navigation: WKNavigation!) {
        resolvedURLs.append(webView.url!)
        resolveNext()
    }

}


class ViewController: UIViewController {

    private var resolver: RedirectResolver!

    override func viewDidLoad() {
        super.viewDidLoad()

        resolver = RedirectResolver(
            urls: [URL(string: "https://news.google.com/articles/CAIiEDthIxbgofssGWTpXgeJXzwqGQgEKhAIACoHCAow2Nb3CjDivdcCMJ_d7gU?hl=en-US&gl=US&ceid=US%3Aen")!, URL(string: "https://news.google.com/articles/CAIiEP5m1nAOPt-LIA4IWMOdB3MqGQgEKhAIACoHCAowocv1CjCSptoCMPrTpgU?hl=en-US&gl=US&ceid=US%3Aen")!],
            completion: { urls in
                print(urls)
            })
        resolver.start()
    }

}

This outputs the following resolved URLs:

[https://amp.cnn.com/cnn/2020/04/09/politics/trump-coronavirus-tests/index.html, https://www.cnbc.com/amp/2020/04/10/asia-markets-coronavirus-china-inflation-data-currencies-in-focus.html]

One other thing to note is that the redirection of those URLs in particular seems to rely on JavaScript which means you indeed need a web view. Otherwise kicking off URLRequests manually and observing the responses would have been enough.

hennes
  • 9,147
  • 4
  • 43
  • 63