6

I am scraping a website which has Oracle ADF loopback script which continuously redirects me to same page of mine, so how to bypass it?

Following is my php code.

<?php
    $url = 'https://www.mywebsite.com/faces/index.jspx';
    $ch = curl_init($url);

    curl_setopt($ch, CURLOPT_COOKIEJAR, dirname(__FILE__) . '/cookie.txt');
    curl_setopt($ch, CURLOPT_COOKIEFILE, dirname(__FILE__) . '/cookie.txt');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    $header[] = 'User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36';
    curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    $data = curl_exec($ch);
    curl_close($ch);
    if (curl_errno($ch)) { // check for execution errors
      echo 'Scraper error: ' . curl_error($ch);
      exit;
    }
    echo $data;
?>

When i run above code i got redirected to same page,

and it also adds some query string parameters like ?_afrLoop=39478247795404&_afrWindowMode=0&_afrWindowId=null

in actual site _afrWindowId has some random alphanumeric string but i am getting null.

after stopping page redirection manually i got page which has Oracle loopback script as following

which causes the redirection, what to do help me.

loopback script:

    <html lang="el-GR"><head><script>
/*
** Copyright (c) 2008, Oracle and/or its affiliates. All rights reserved.
*/

/**
 * This is the loopback script to process the url before the real page loads. It introduces
 * a separate round trip. During this first roundtrip, we currently do two things: 
 * - check the url hash portion, this is for the PPR Navigation. 
 * - do the new window detection
 * the above two are both controled by parameters in web.xml
 * 
 * Since it's very lightweight, so the network latency is the only impact. 
 * 
 * here are the list of will-pass-in parameters (these will replace the param in this whole
 * pattern: 
 *        viewIdLength                           view Id length (characters), 
 *        loopbackIdParam                        loopback Id param name, 
 *        loopbackId                             loopback Id,
 *        loopbackIdParamMatchExpr               loopback Id match expression, 
 *        windowModeIdParam                      window mode param name, 
 *        windowModeParamMatchExpr               window mode match expression, 
 *        clientWindowIdParam                    client window Id param name, 
 *        clientWindowIdParamMatchExpr           client window Id match expression, 
 *        windowId                               window Id, 
 *        initPageLaunch                         initPageLaunch, 
 *        enableNewWindowDetect                  whether we want to enable new window detection
 *        jsessionId                             session Id that needs to be appended to the redirect URL
 *        enablePPRNav                           whether we want to enable PPR Navigation
 *
 */

var id = null; 
var query = null; 
var href = document.location.href; 
var hashIndex = href.indexOf("#"); 
var hash = null;

/* process the hash part of the url, split the url */
if (hashIndex > 0) 
{ 
  hash = href.substring(hashIndex + 1); 
  /* only analyze hash when pprNav is on (bug 8832771) */
  if (false && hash && hash.length > 0) 
  { 
    hash = decodeURIComponent(hash); 
    if (hash.charAt(0) == "@") 
    { 
      query = hash.substring(1); 
    } 
    else 
    { 
      var state = hash.split("@"); 
      id = state[0]; 
      query = state[1]; 
    } 
  } 
  href = href.substring(0, hashIndex); 
} 

/* process the query part */
var queryIndex = href.indexOf("?"); 
if (queryIndex > 0) 
{
  /* only when pprNav is on, we take in the query from the hash portion */
  query = (query || (id && id.length>0))? query: href.substring(queryIndex); 
  href = href.substring(0, queryIndex); 
} 

var jsessionIndex = href.indexOf(';');
if (jsessionIndex > 0)
{
  href = href.substring(0, jsessionIndex);
}

/* we will replace the viewId only when pprNav is turned on (bug 8832771) */
if (false) 
{
  if (id != null && id.length > 0) 
  { 
    href = href.substring(0, href.length - 11) + id;
  } 
}

var isSet = false; 
if (query == null || query.length == 0) 
{ 
  query = "?"; 
} 
else if (query.indexOf("_afrLoop=") >= 0) 
{ 
  isSet = true; 
  query = query.replace(/_afrLoop=[^&]*/, "_afrLoop=39279593944826"); 
} 
else 
{ 
  query += "&"; 
} 
if (!isSet) 
{ 
  query = query += "_afrLoop=39279593944826"; 
} 

/* below is the new window detection logic */
var initWindowName = "_afr_init_"; // temporary window name set to a new window
var windowName = window.name;

// if the window name is "_afr_init_", treat it as redirect case of a new window
if ((true) && (!windowName || windowName==initWindowName || 
    windowName!="null"))  
{ 
  /* append the _afrWindowMode param */
  var windowMode;
  if (true) 
  {
    /* this is the initial page launch case, 
       also this could be that we couldn't detect the real windowId from the server side */
    windowMode=0;
  }
  else if ((href.indexOf("/__ADFvDlg__") > 0) || (query.indexOf("__ADFvDlg__") >= 0))
  {
    /* this is the dialog case */
    windowMode=1;
  }
  else 
  {
    /* this is the ctrl-N case */
    windowMode=2;
  }

  if (query.indexOf("_afrWindowMode=") >= 0) 
  { 
    query = query.replace(/_afrWindowMode=[^&]*/, "_afrWindowMode="+windowMode); 
  } 
  else 
  { 
    query = query += "&_afrWindowMode="+windowMode; 
  } 

  /* append the _afrWindowId param */
  var clientWindowId;
  /* in case we couldn't detect the windowId from the server side */
  if (!windowName || windowName == initWindowName) 
  {
    clientWindowId = "null";

    // set window name to an initial name so we can figure out whether a page is loaded from
    // cache when doing Ctrl+N with IE
    window.name = initWindowName;
  }
  else 
  {
    clientWindowId = windowName;
  }  

  if (query.indexOf("_afrWindowId=") >= 0) 
  { 
    query = query.replace(/_afrWindowId=\w*/, "_afrWindowId="+clientWindowId); 
  } 
  else 
  { 
    query = query += "&_afrWindowId="+clientWindowId; 
  } 

}

var sess = "";

if (sess.length > 0)
  href += sess; 

/* if pprNav is on, then the hash portion should have already been processed */
if ((false) || (hash == null))
  document.location.replace(href + query);
else 
  document.location.replace(href + query + "#" + hash);
</script>
</head>
</html>
Umar Abdullah
  • 1,282
  • 1
  • 19
  • 37
Haritsinh Gohil
  • 5,818
  • 48
  • 50

1 Answers1

3

The right way to crawl ADF pages is to pass in URL a parameter

*domain.com*?org.apache.myfaces.trinidad.outputMode=webcrawler

to all the GET requests from the script. Keep in mind that when you switch to crawler mode, the pages will look different since it is not meant for human consumption, but it should contain all the raw details you would care about to crawl.

Although, this is an old question and the OP might have long moved on to better things, thought of answering this here to help anybody else hitting the same problem.

Ashwin Prabhu
  • 9,285
  • 5
  • 49
  • 82
  • Ashvin i am using php cURL library, i can not set output mode as you have stated, i think you can set it in ADF but not in php. – Haritsinh Gohil Mar 01 '19 at 11:59
  • I am referring to URL parameter which you make a request with. – Ashwin Prabhu Mar 01 '19 at 15:05
  • Interesting... is this documented somewhere officially? Or is it an unpublished "feature"? – Ryan May 07 '21 at 20:12
  • I can no longer find this in the repo, most likely discontinued. I had remembered it from the time I used to contribute there years ago. But if someone still needs this kind of ability email mode seems to be the closet since I recollect it inlines all CSS and turns off all types dynamic scripts. Email param appears to be "org.apache.myfaces.trinidad.agent.email" from https://github.com/apache/myfaces-trinidad/blob/master/trinidad-impl/src/main/java/org/apache/myfaces/trinidadinternal/agent/AgentFactoryImpl.java – Ashwin Prabhu May 08 '21 at 09:19