1

I need to login into Jenkins through a crawler to collect some data, but Net/HTTPS gets an incomplete page in comparison to Jenkins' source, here are both sources:

Net/HTTPS' HTML

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>

<head>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  <meta http-equiv="refresh" content="1;url=/login?from=%2F">
  <script>
    window.location.replace('/login?from=%2F');
  </script>
</head>

<body style="background-color:white; color:white;">Authentication required</body>

</html>

Nokogiri's XML

=> #
<Nokogiri::HTML::Document:0x1a11444 name="document" children=[#<Nokogiri::XML::DTD:0x1a109b8 name="html">, #
  <Nokogiri::XML::Element:0x1a101ac name="html" children=[#<Nokogiri::XML::Element:0x2047ee4 name="head" children=[#<Nokogiri::XML::Element:0x2047d04 name="meta" attributes=[#<Nokogiri::XML::Attr:0x2047ca0 name="http-equiv" value="refresh">, #
    <Nokogiri::XML::Attr:0x2047c8c name="content" value="1;url=/login?from=%2F">]>, #
      <Nokogiri::XML::Element:0x2047660 name="script" children=[#<Nokogiri::XML::CDATA:0x2047480 "window.location.replace('/login?from=%2F');">]>]>, #
        <Nokogiri::XML::Element:0x20471ec name="body" attributes=[#<Nokogiri::XML::Attr:0x2047188 name="style" value="background-color:white; color:white;">] children=[#
          <Nokogiri::XML::Text:0x2046d50 "Authentication required">]>]>]>

Jenkins' source

<!DOCTYPE html>
<html>

<head resURL="/static/98ff49d3">


  <title>Jenkins</title>
  <link rel="stylesheet" type="text/css" href="/static/98ff49d3/css/style.css" />
  <link rel="stylesheet" type="text/css" href="/static/98ff49d3/css/color.css" />
  <link rel="stylesheet" type="text/css" href="/static/98ff49d3/css/responsive-grid.css" />
  <link rel="shortcut icon" type="image/vnd.microsoft.icon" href="/static/98ff49d3/favicon.ico" />
  <script>
    var isRunAsTest = false;
    var rootURL = "";
    var resURL = "/static/98ff49d3";
  </script>
  <script src="/static/98ff49d3/scripts/prototype.js" type="text/javascript"></script>
  <script src="/static/98ff49d3/scripts/behavior.js" type="text/javascript"></script>
  <script src='/adjuncts/98ff49d3/org/kohsuke/stapler/bind.js' type='text/javascript'></script>
  <script src="/static/98ff49d3/scripts/yui/yahoo/yahoo-min.js"></script>
  <script src="/static/98ff49d3/scripts/yui/dom/dom-min.js"></script>
  <script src="/static/98ff49d3/scripts/yui/event/event-min.js"></script>
  <script src="/static/98ff49d3/scripts/yui/animation/animation-min.js"></script>
  <script src="/static/98ff49d3/scripts/yui/dragdrop/dragdrop-min.js"></script>
  <script src="/static/98ff49d3/scripts/yui/container/container-min.js"></script>
  <script src="/static/98ff49d3/scripts/yui/connection/connection-min.js"></script>
  <script src="/static/98ff49d3/scripts/yui/datasource/datasource-min.js"></script>
  <script src="/static/98ff49d3/scripts/yui/autocomplete/autocomplete-min.js"></script>
  <script src="/static/98ff49d3/scripts/yui/menu/menu-min.js"></script>
  <script src="/static/98ff49d3/scripts/yui/element/element-min.js"></script>
  <script src="/static/98ff49d3/scripts/yui/button/button-min.js"></script>
  <script src="/static/98ff49d3/scripts/yui/storage/storage-min.js"></script>
  <script src="/static/98ff49d3/scripts/hudson-behavior.js" type="text/javascript"></script>
  <script src="/static/98ff49d3/scripts/sortable.js" type="text/javascript"></script>
  <script>
    crumb.init("", "");
  </script>
  <link rel="stylesheet" type="text/css" href="/static/98ff49d3/scripts/yui/container/assets/container.css" />
  <link rel="stylesheet" type="text/css" href="/static/98ff49d3/scripts/yui/assets/skins/sam/skin.css" />
  <link rel="stylesheet" type="text/css" href="/static/98ff49d3/scripts/yui/container/assets/skins/sam/container.css" />
  <link rel="stylesheet" type="text/css" href="/static/98ff49d3/scripts/yui/button/assets/skins/sam/button.css" />
  <link rel="stylesheet" type="text/css" href="/static/98ff49d3/scripts/yui/menu/assets/skins/sam/menu.css" />
  <meta name="ROBOTS" content="INDEX,NOFOLLOW" />
  <script src="/static/98ff49d3/scripts/yui/cookie/cookie-min.js"></script>
  <link rel="stylesheet" type="text/css" href="/static/98ff49d3/plugin/sectioned-view/sectioned-view.css" />
</head>

<body id="jenkins" data-version="jenkins-1.596.1" class="yui-skin-sam jenkins-1.596.1"><a href="#skip2content" class="skiplink">Skip to content</a>
  <div id="page-head">
    <div id="header">
      <div class="logo">
        <a id="jenkins-home-link" href="/">
          <img id="jenkins-head-icon" alt="title" src="/static/98ff49d3/images/headshot.png" />
          <img id="jenkins-name-icon" height="34" alt="title" width="139" src="/static/98ff49d3/images/title.png" />
        </a>
      </div>
      <div class="login"> <a href="/login?from=%2F"><b>log in</b></a>
        |
        <a href="/signup"><b>sign up</b></a>
      </div>
      <div class="searchbox hidden-xs">
        <form style="position:relative;" name="search" action="/search/" class="no-json" method="get">
          <div id="search-box-minWidth"></div>
          <div id="search-box-sizer"></div>
          <div id="searchform">
            <input id="search-box" placeholder="search" name="q" class="has-default-text" />
            <a href="http://wiki.jenkins-ci.org/display/JENKINS/Search+Box">
              <img style="width: 16px; height: 16px; " class="icon-help icon-sm" src="/static/98ff49d3/images/16x16/help.png" />
            </a>
            <div id="search-box-completion"></div>
            <script>
              createSearchBox("/search/");
            </script>
          </div>
        </form>
      </div>
    </div>
    <div id="breadcrumbBar">
      <tr id="top-nav">
        <td id="left-top-nav" colspan="2">
          <link rel='stylesheet' href='/adjuncts/98ff49d3/lib/layout/breadcrumbs.css' type='text/css' />
          <script src='/adjuncts/98ff49d3/lib/layout/breadcrumbs.js' type='text/javascript'></script>
          <div class="top-sticker noedge">
            <div class="top-sticker-inner">
              <div id="right-top-nav"></div>
              <ul id="breadcrumbs">
                <li class="item"><a class="model-link inside" href="/">Jenkins</a>
                </li>
                <li class="children" href="/"></li>
              </ul>
              <div id="breadcrumb-menu-target"></div>
            </div>
          </div>
        </td>
      </tr>
    </div>
  </div>
  <div id="page-body">
    <div class="row">
      <div id="side-panel">
        <div id="side-panel-content"></div>
      </div>
      <div id="main-panel">
        <div id="main-panel-content">
          <a name="skip2content"></a>
          <div style="margin: 2em;">
            <form style="text-size:smaller" name="login" action="j_acegi_security_check" method="post">
              <table>
                <tr>
                  <td>User:</td>
                  <td>
                    <input type="text" name="j_username" id="j_username" />
                  </td>
                </tr>
                <tr>
                  <td>Password:</td>
                  <td>
                    <input type="password" name="j_password" />
                  </td>
                </tr>
                <tr>
                  <td align="right">
                    <input id="remember_me" type="checkbox" name="remember_me" />
                  </td>
                  <td>
                    <label for="remember_me">Remember me on this computer</label>
                  </td>
                </tr>
              </table>
              <input name="from" value="/" type="hidden" />
              <input name="Submit" value="log in" class="submit-button primary" type="submit" />
              <script>
                $('j_username').focus();
              </script>
            </form>
            <div style="margin-top:2em"><a href="signup">Create an account</a> if you are not a member yet.</div>
          </div>
        </div>
      </div>
    </div>
  </div>
  <div id="footer-container" class="hidden-xs">
    <div id="footer"><span class="page_generated">
          Page generated:
          May 5, 2015 1:09:35 PM</span><span class="rest_api"><a href="api/">REST API</a></span><span class="jenkins_ver"><a href="http://jenkins-ci.org/">Jenkins ver. 1.596.1</a></span>
      <div id="l10n-dialog" class="dialog"></div>
      <div id="l10n-footer" style="display:none; float:left">
        <a href="#" onclick="return showTranslationDialog();">
          <img src="/static/98ff49d3/plugin/translation/flags.png" />Help us localize this page
        </a>
      </div>
      <script>
        var footer = document.getElementById('l10n-footer');
        var f = document.getElementById('footer');
        f.insertBefore(footer, f.firstChild);
        footer.style.display = "block";

        var translation = {};
        translation.bundles = "6CPNEARN8E/l4k/4nMQznROeAYoCO7auJUGWM6qMGBK2/ELamFqR7whqOnrQ+pYEU4X6xVw11/3WEM16VclDS66Hi2QY5S41H0NSwFiE07KHND+iP3c2Zb4MiiqIOrGRLMJEPdu/j3QYQ5Yp2rkj/ISZWOGFVY86zs/0JsDEw+VJN9dlaSkRcelDKNfziTE/8K7Sabhhd0we7ATzNTgNrfenUCaCdwR7BqPc7354m+fmVz7/8DpcYBMzl78E3+DpUF6sJa18uD7OkgPMNYz8lIM9Bx1ZXanyOk49M8Sea9qj+teMndv9kiyawWnloiBlg3KdK0DfZs1v+RbCQ/HnYcIcjAZVgKTYD2S0GpSj5oHMFQeTemQRnbj6WMon3u7Z8q3np+0Ucgxcs1LfKqprNmeugoD5jIxCuHhHCQvaHdw=";
        translation.detectedLocale = "";

        function showTranslationDialog() {
          if (!translation.launchDialog)
            loadScript("/static/98ff49d3/plugin/translation/dialog.js");
          else
            translation.launchDialog();
          return false;
        }
      </script>
    </div>
  </div>
</body>

</html>

I need these lines from the Jenkins source, to be able to fill and log in:

<input type="text" name="j_username" id="j_username" />
<input type="password" name="j_password" />
<input name="Submit" value="log in" class="submit-button primary" type="submit" />

and here is the code I'm running to fetch this data:

  1 require 'rubygems'
  2 require 'nokogiri'
  3 require 'net/https'
  4 require 'openssl'
  5 require 'mechanize'
  6 
  7 class JenkinsTest
  8         # Request the Jenkins webpage
  9         def request_jenkins_webpage
 10                 uri = URI.parse("https://jenkinspage.com:8443")
 11                 http = Net::HTTP.new(uri.host, uri.port)
 12                 http.use_ssl = true
 13                 http.verify_mode = OpenSSL::SSL::VERIFY_NONE
 14                 request = Net::HTTP::Get.new(uri.request_uri)
 15                 response = http.request(request)
 16                 @@page = Nokogiri::HTML(response.body)
 17         end
 18 
 19         def print_jenkins_webpage
 20                 puts @@page
 21         end
 22 end

A few extra notes: the network has a proxy, without login/password; the Jenkins' certificate is self-signed;

My question is, why does it happens and how can I fix it?

Thanks in advance!

  • 1
    First, Nokogiri has nothing to do with retrieving the content, that would be Net::HTTP in your case. Next, Jenkins is using JavaScript to replace the page's content. You can't do what you want with Nokogiri or Mechanize. Instead, you have to use something like Watir, which can drive a browser, which can process the JavaScript which will then retrieve the content you want. Or, you can replicate what the JavaScript will do and grab the necessary URL and then retrieve it. – the Tin Man May 05 '15 at 19:28
  • Thanks @theTinMan, I'll do it and update it here. – Filipe Gorges Reuwsaat May 06 '15 at 12:14
  • 1
    To follow up on @TheTinMan's comment, the JS `window.location.replace('/login?from=%2F')` is simply a redirect to a login page. You should fetch that URL instead and login. Who knows, you may be able to use Mechanize after your login cookie is set. – Mark Thomas May 06 '15 at 13:13
  • @theTinMan I've just tested with Watir; I request a Firefox browser, go to the jenkins page and fill in the login, but jenkins won't allow me to fill in the password throught the script. Is there a way to force it? – Filipe Gorges Reuwsaat May 06 '15 at 18:58
  • And @Mark Thomas, I've attempted to go straight to it, but it still leads me to the same page. – Filipe Gorges Reuwsaat May 06 '15 at 18:58
  • There are lots of ways for the server side to sniff what is connecting, and to disallow it. It's impossible to tell what the problem is from our side unless you tell us a lot more about what you're seeing. – the Tin Man May 06 '15 at 19:03
  • What page URL would that be? You haven't specified. – Mark Thomas May 06 '15 at 20:07

1 Answers1

0

Thanks to the help of @theTinMan, @MarkThomas and a colleague, I've managed to log into jenkins and collect the page's XML, through Mechanize and Nokogiri:

1 require 'rubygems'
2 require 'nokogiri'
3 require 'net/https'
4 require 'openssl'
5 require 'mechanize'
6 
7 # JenkinsXML logs into Jenkins and gets an XML version of the HTML page.
8 
9 class JenkinsXML
10 
11         # Jenkins' URIs.
12         @@jenkins_login_uri = "https://jenkinspage.com:8443/login?from=%2F"
13         @@jenkins_page_uri = "https://jenkinspage.com:8443"
14 
15         # Log into Jenkins.
16         def log_into_jenkins
17                 @@mechanize_agent = Mechanize.new
18                 @@mechanize_agent.user_agent_alias = "Windows IE 7"
19                 @@mechanize_agent.verify_mode = OpenSSL::SSL::VERIFY_NONE
20                 page = @@mechanize_agent.get(@@jenkins_login_uri)
21 
22                 form = page.forms[1]
23 
24                 form.j_username = "username-here"
25                 form.j_password = "password-here"
26                 @@mechanize_agent.submit(form)
27         end
28 
29         # Get Jenkins' HTML.
30         def get_jenkins_html
31                 @@jenkins_html = @@mechanize_agent.get(@@jenkins_page_uri).body
32         end
33 
34         # Get Jenkins' XML.
35         def get_jenkins_xml
36                 @jenkins_xml = Nokogiri::HTML(@@jenkins_html)
37                 return @jenkins_xml
38         end
39 
40 end