1

I have a file that contains a long list of URLs. I want to use Google Refine to get HTTP status codes that appear when each URL is open. The URLs are stored in 1 column, 1 URL per 1 cell. The HTTP status codes should be stored in a new column. There are 3 languages available in Google Refine: Clojure, Jython and GREL. I am pretty new in programming.

M Novakova
  • 11
  • 2
  • 1
    It's hard to for us that are comfortable with these languages though not with open-refine (formerly google refine) to give answers without some more code context. Would a snippet of Clojure that fetches the headers for a URL be useful to you? – Arthur Ulfeldt Nov 20 '15 at 21:14

1 Answers1

2

in Clojure to get a response code you can make a connection and then check the response code. Here is an example that uses only the built in java.net classes so you won't have to include any libraries (I don't know how easy that is from withing this program)

hello.core> (.. (java.net.URL. "http://google.com/index.html")
                openConnection
                getResponseCode)
200

It would be more normal for a clojure application to use an http library such as http-kit to do this more cleanly. So if you can easily include libraries I would take that route and save a couple lines of code.

PS: you may also want to close the connection after

hello.core> (let [connection (.openConnection (java.net.URL. "http://google.com/index.html"))
                  response (.getResponseCode connection)]
              (.. connection      ;; yep, java's strange
                  getInputStream  ;; closing the input stream closes it's conneection
                  close)          ;; so most people use http-kit
              response)
Arthur Ulfeldt
  • 90,827
  • 27
  • 201
  • 284
  • Arthur, thanks for your help but this script (only the 2nd one worked in Open Refine) does not provide satisfying results in my case. It works with http addresses. My URLs are https. When I type the address manually in search bar as "http...", it redirects me to "https..." and then I can see a webpage (It should be response code 200). However, using this script with "http.." in Open Refine shows code 301, with "https..." shows an empty box, which means an error. Sorry for delayed answer, I appreciate your help! – M Novakova Dec 30 '15 at 21:03
  • Oracle has some opinions about which TLS/SSL certs are to be trusted that differ from other places (like everywhere else) and so debugging TLS connections is best done interactively from the repl. It's almost always possible to figure it out by fiddling with the parameters to openConnection and friends. best of luck! – Arthur Ulfeldt Dec 30 '15 at 21:57