1

I am using BeautifulSoup to get data from a webpage. The webpage provides a date, which I see when I open the page in Firefox. However, when I view page source there is no date, just some javascript that generates the date. I see there are some related questions on here, I see references to ajax and json, I am just an amaeteur programmer though and remain confused here. Here is some of the HTML code which has the javascript code in it with the date I need.

<div class="match-details">
  <p class="floatleft">
    BARCLAYS PREMIER LEAGUE 

    <span>
      <script type="text/javascript">
        (function(){
        var d = new Date(1345489200000);

        var year = d.getFullYear();
        var month = d.getMonth() + 1;
        var day = d.getDate();
        var minutes = d.getMinutes();
        var hours = d.getHours();                                        

        if (minutes < 10) { minutes = '0' + minutes; }
        var dmy = [day, month, year];
        var hm = [hours, minutes];
        if (SITE_EDITION == 'us/en') {
            var dmy = [month, day, year];    
        }
        var matches_local = dmy.join('/') + " " + hm.join(':'); 
        matches_local += "<span class='live-red'>*</span>";

        document.write(matches_local);
        })();                                                       
      </script>
    </span>

  </p>
</div>
entropy
  • 3,134
  • 20
  • 20
appleLover
  • 14,835
  • 9
  • 33
  • 50
  • So what is your question? – Burhan Khalid Mar 01 '13 at 20:25
  • Could you outdent the code a bit? There's no need for pushing it off the page... – Tim Pietzcker Mar 01 '13 at 20:28
  • @BurhanKhalid that code will output a date to the page when run in a browser. He wants to know how to get that programatically if he's screen-scraping with python – entropy Mar 01 '13 at 20:30
  • @TimPietzcker I edited it to fix indentation but we'd have to wait for people to review the edit and accept it – entropy Mar 01 '13 at 20:30
  • @appleLover as far as I know, without a full-fledged javascript engine like those that run in browsers this might not be possible to do. Have a look at http://phantomjs.org/ which provides a browser you can access programatically – entropy Mar 01 '13 at 20:32
  • 1
    Are you just trying to find the string `new Date(1345489200000);` and turn that into a Python `datetime` object? Or are you trying to read the page rendered by this JavaScript and extract a date from the resulting HTML? – abarnert Mar 01 '13 at 20:58
  • abarnert, that is exactly what I want to do, turn the string new Date(1345489200000); into a Python datetime object. At first I assumed those were useless numbers as I don't see a date inside of that. Even now I still don't see how to turn that into a date. – appleLover Mar 02 '13 at 01:32
  • i opened a new thread since it seems like i needed a better understanding of javascript, rather than using some crazy new library to solve this simple problem. problem solved here. http://stackoverflow.com/questions/15179738/parsing-javascript-date – appleLover Mar 02 '13 at 22:17

1 Answers1

2

BeautifulSoup is an HTML processing library. You need a HTML + Javascript processing library.

Read up on this Question : Programmatic Python Browser with JavaScript

As that QA states...you basically either need to use a real browser -- via Selenium -- or use a python browser that supports javascript -- like Spynner.

Community
  • 1
  • 1
Jonathan Vanasco
  • 15,111
  • 10
  • 48
  • 72
  • thanks for the response. I am looking at pyv8 unfortunately i am having a hard time getting it set up on ubuntu. the people maintaining the pyv8 site recommend to use the prebuilt version but there is no prebuilt version for linux. i am going to open a new thread specifically asking how javascript parses the the line above. i think that will be simpler. – appleLover Mar 02 '13 at 21:34
  • Sorry, I was not clear & edited my repsonse. You need a HTML + Javascript processing library. PyV8 will only let you run javascript. It won't parse the page & tell you which javascript to run. You need to have a javascript-supporting HTML browser to trigger the correct events and allow for the DOM to be manipulated. – Jonathan Vanasco Mar 03 '13 at 20:18