2

The requests library works perfectly in retrieving csv or txt files from google docs (from How do you save a Google Sheets file as CSV from Python 3 (or 2)?)

But when i tried to do the same for pdf files in google doc, i only manage to get the HTML file, is there any way for me to download pdf files from google doc? e.g. https://docs.google.com/file/d/0BxcsLDhZbUBBMWY1MzRkZGQtMjQxNC00NzQ3LWFmNzEtNzNmMzYzYmU2MDRj/edit

I've tried using requester and i've got this:

>>> import requests # https://pypi.python.org/pypi/requests
>>> gdoc = 'https://docs.google.com/file/d/0BxcsLDhZbUBBMWY1MzRkZGQtMjQxNC00NzQ3LWFmNzEtNzNmMzYzYmU2MDRj/edit'
>>> print requests.get(gdoc).text

outputs:

<!DOCTYPE html><html><head><meta name="google" content="notranslate"><meta http-equiv="X-UA-Compatible" content="IE=edge;"><meta name="fragment" content="!"><title>The Starfish Story (Translation in Navajo).pdf - Google Drive</title><style type="text/css">#gbar,#guser{font-size:13px;padding-right:8px;padding-top:4px !important;}#gbar{padding-left:8px;height:22px}#guser{padding-bottom:7px !important;text-align:right}.gbh,.gbd{border-top:1px solid #c9d7f1;font-size:1px}.gbh{height:0;position:absolute;top:24px;width:100%}@media all{.gb1{height:22px;margin-right:.5em;vertical-align:top}#gbar{float:left}}a.gb1,a.gb4{text-decoration:underline !important}a.gb1,a.gb4{color:#00c !important}.gbi .gb4{color:#dd8e27 !important}.gbf .gb4{color:#900 !important}</style><script>_docs_flag_initialData={"jobset":"prod","docs-aiiws":"docs_warm_sdf","info_params":{},"uls":"","icso":false,"docs_eoal":true,"docs_oogt":"NONE","docosEmbedApiJs":"\/\/docs.google.com\/comments\/d\/AAHRpnXu2c4T_cvcH9MyrCUHPNj25CBhn1z7azmidPK7l5vEFT86M59YW7kmm6hTnmTuic9OmYbD43mFsbHo7FXIzRxICAm6TFMaL7q9d34z6-gL59HUNgpG3DAaoty1Q1eA5v7R_WCJU\/api\/js?hl=de","docosUnreadCommentsEnabled":false,"docs-egc":true,"docs-chat_base_url":"talkgadget.google.com\/talkgadget\/","docs-chat_domain_rotation":true,"docs-ce":true,"docs-ut":2,"promo_url":"","promo_title":"","promo_title_prefix":"","promo_content_html":"","promo_element_id":"","promo_orientation":1,"promo_show_on_click":false,"promo_show_on_load":false,"show_promo":false,"docs-encp":false,"buildLabel":"texmex_2013-49-Thu_RC1","buildClNumber":"57718063","debugTask":"ve_32","docs-show_debug_info":false,"dcau":"https:\/\/chrome.google.com\/webstore\/detail\/apdfllckaahabafndbhieahigkjlhalf","ondlburl":"\/\/docs.google.com","drive_url":"\/\/drive.google.com","docs-sup":"\/file","docs-uptc":["lsrp","usp","urp","utm_source","utm_medium","utm_campaign","utm_term","utm_content"],"docs-cwsd":"","docs-al":[0,0,0,1,0]
,"docs-ndt":"Untitled Texmex","docs-eit":false,"docs-spfe":true,"docs-mriim":1800000,"docs-ecc":false,"docs-mnumea":false,"docs-ess":false,"ecbsl":true,"ecid":true,"eod":true,"docs-eilb":false,"docs-pedd":true,"docs-evr":true,"docs-eir":false,"docs-enmr":false,"docs-esrd":false,"share_ui":"jfk","server_time_ms":1387227430022,"gaia_session_id":"","enable_iframed_embed_api":true,"cup":"\/folder\/d\/{folderId}\/edit","docs-fut":"\/\/docs.google.com\/#folders\/{folderId}","esid":true,"esubid":false,"docs-etbs":true,"enable_kennedy":true,"onePickImportDocumentUrl":"","opbu":"https:\/\/docs.google.com\/picker","opru":"https:\/\/docs.google.com\/relay.html","opdu":false,"ophi":"texmex","opuci":"","docs-se":false,"docs-ebcrsct":false,"docs-iror":false,"xdbcmUri":"https:\/\/docs.google.com\/file\/xdbcm.html","xdbcfAllowXpc":true,"docs-corsbc":false,"xdbcfAllowHostNamePrefix":true,"docs-spdy":false,"enable_client_docos":true,"enable_anchored_docos":true,"enable_docos_tickle":true,"gv_int_native":true,"enable_a11y":true,"tpc":true,"enable_pinned_revisions":false,"enable_edit_blob_revisions":false,"upload_url":"https:\/\/docs.google.com\/upload\/resumableupload","enable_toolbar":true,"enable_feedback_button":false,"enable_microscope":true,"enable_manage_timed_text":true,"video_embed_type":"PREFER_FLASH","enable_maps_embed":true,"maps_api_uri":"https:\/\/maps.googleapis.com\/maps\/api\/js?key=AIzaSyBCjpnguVjzi6vS67NdBtyYuvCYz3yBxCY&sensor=false","maps_display_uri":"https:\/\/maps.google.com\/maps","docs_abuse_link":"https:\/\/docs.google.com\/abuse?id=0BxcsLDhZbUBBMWY1MzRkZGQtMjQxNC00NzQ3LWFmNzEtNzNmMzYzYmU2MDRj","enable_csi":true,"csi_service_name":"texmex","third_party_default_icon_urls":{"icon16":"\/\/ssl.gstatic.com\/docs\/doclist\/images\/generic_app_icon_16.png","icon32":"\/\/ssl.gstatic.com\/docs\/doclist\/images\/generic_app_icon_32.png","icon64":"\/\/ssl.gstatic.com\/docs\/doclist\/images\/generic_app_icon_64.png","icon128":"\/\/ssl.gstatic.com\/docs\/doclist\/images\/generic_app_icon_128.png"},"enable_chrome_webstore_link":true};</script><script type="text/javascript">(function(){(function(){function e(a){this.t={};this.tick=function(a,c,b){var d=void 0!=b?b:(new Date).getTime();this.t[a]=[d,c];if(void 0==b)try{window.console.timeStamp("CSI/"+a)}catch(e){}};this.tick("start",null,a)}var a;window.performance&&(a=window.performance.timing);var f=a?new e(a.responseStart):new e;window.jstiming={Timer:e,load:f};if(a){var c=a.navigationStart,d=a.responseStart;0<c&&d>=c&&(window.jstiming.srt=d-c)}if(a){var b=window.jstiming.load;0<c&&d>=c&&(b.tick("_wtsrt",void 0,c),b.tick("wtsrt_",
"_wtsrt",d),b.tick("tbsd_","wtsrt_"))}try{a=null,window.chrome&&window.chrome.csi&&(a=Math.floor(window.chrome.csi().pageT),b&&0<c&&(b.tick("_tbnd",void 0,window.chrome.csi().startE),b.tick("tbnd_","_tbnd",c))),null==a&&window.gtbExternal&&(a=window.gtbExternal.pageT()),null==a&&window.external&&(a=window.external.pageT,b&&0<c&&(b.tick("_tbnd",void 0,window.external.startE),b.tick("tbnd_","_tbnd",c))),a&&(window.jstiming.pt=a)}catch(g){}})();})();
</script><link rel="stylesheet" href="/static/file/client/css/1508097430-edit_css_ltr.css">
<link rel="shortcut icon" href="https://ssl.gstatic.com/docs/doclist/images/icon_11_pdf_favicon.ico"><link rel="chrome-webstore-item" href="https://chrome.google.com/webstore/detail/apdfllckaahabafndbhieahigkjlhalf"></head><body dir="ltr" role="application" onload='_onload()'itemscope itemtype="http://schema.org/CreativeWork/DocumentObject"><noscript><div class="docs-butterbar-container"><div class="docs-butterbar-wrap"><div class="jfk-butterBar jfk-butterBar-shown jfk-butterBar-warning">Die Datei kann in Ihrem Browser nicht geöffnet werden, da JavaScript nicht aktiviert ist. Aktivieren Sie JavaScript und laden Sie die Seite noch einmal.</div></div><br></div></noscript><meta itemprop="name" content="The Starfish Story (Translation in Navajo).pdf"><meta itemprop="faviconUrl" content="https://ssl.gstatic.com/docs/doclist/images/icon_11_pdf_favicon.ico"><meta itemprop="url" content="https://docs.google.com/file/d/0BxcsLDhZbUBBMWY1MzRkZGQtMjQxNC00NzQ3LWFmNzEtNzNmMzYzYmU2MDRj/edit"><meta itemprop="embedURL" content="https://docs.google.com/file/d/0BxcsLDhZbUBBMWY1MzRkZGQtMjQxNC00NzQ3LWFmNzEtNzNmMzYzYmU2MDRj/preview"><div id="docs-chrome" class="docs-vis-ref-chrome" tabindex="0"><div><div id="docs-header"><div id="docs-branding-container"class="docs-branding-default"><a title="Google Drive öffnen" href="//drive.google.com" target="_blank"><div id="docs-drive-logo"></div><div id="docs-branding-logo"></div></a></div><div id=gbar><nobr><a target=_blank class=gb1 href="https://www.google.com/webhp?tab=ow">Suche</a> <a target=_blank class=gb1 href="http://www.google.com/imghp?hl=de&tab=oi">Bilder</a> <a target=_blank class=gb1 href="https://maps.google.com/maps?hl=de&tab=ol">Maps</a> <a target=_blank class=gb1 href="https://play.google.com/?hl=de&tab=o8">Play</a> <a target=_blank class=gb1 href="https://www.youtube.com/?tab=o1">YouTube</a> <a target=_blank class=gb1 href="https://news.google.com/nwshp?hl=de&tab=on">News</a> <a target=_blank class=gb1 href="https://mail.google.com/mail/?tab=om">Gmail</a> <b class=gb1>Drive</b> <a target=_blank class=gb1 style="text-decoration:none" href="http://www.google.com/intl/de/options/"><u>Mehr</u> &raquo;</a></nobr></div><div id=guser width=100%><nobr><span id=gbn class=gbi></span><span id=gbf class=gbf></span><span id=gbe><a  target='_blank' href="https://docs.google.com/abuse?id=0BxcsLDhZbUBBMWY1MzRkZGQtMjQxNC00NzQ3LWFmNzEtNzNmMzYzYmU2MDRj" class=gb4>Missbrauch melden</a> | </span><a  target='_blank' href="https://docs.google.com/settings" class=gb4>Einstellungen</a> | <a target=_top id=gb_70 href="https://www.google.com/accounts/ServiceLogin?service=wise&passive=1209600&continue=https://docs.google.com/file/d/0BxcsLDhZbUBBMWY1MzRkZGQtMjQxNC00NzQ3LWFmNzEtNzNmMzYzYmU2MDRj/edit&followup=https://docs.google.com/file/d/0BxcsLDhZbUBBMWY1MzRkZGQtMjQxNC00NzQ3LWFmNzEtNzNmMzYzYmU2MDRj/edit" class=gb4>Anmelden</a></nobr></div><div class=gbh style=left:0></div><div class=gbh style=right:0></div><div style="clear:both"></div><div id="docs-titlebar-container"><div id="docs-titlebar"><div class="docs-title-outer"><div class="docs-title-widget goog-inline-block" id="docs-title-widget"><span class="docs-title" id="docs-title" role="button"><div class="docs-title-inner" id="docs-title-inner">The Starfish Story (Translation in Navajo).pdf</div></span></div><div class="docs-star-container goog-inline-block"><div id="docs-star" class="goog-inline-block" style="display:none"></div></div><div class="docs-activity-indicator-container goog-inline-block"></div></div></div><div class="docs-titlebar-buttons"><div id="docs-presence-container" class="goog-inline-block docs-titlebar-button"><div id="docs-presence" class="goog-inline-block"></div><div role="button" id="docs-chat" class="goog-inline-block jfk-button jfk-button-standard jfk-button-narrow docs-chat jfk-button-disabled" aria-disabled="true" style="display: none"><div class="docs-icon goog-inline-block "><div class="docs-icon-img-container docs-icon-img docs-icon-chat">&nbsp;</div></div></div></div><div class="goog-inline-block"><div role="button" id="docs-docos-commentsbutton" class="goog-inline-block jfk-button jfk-button-standard docs-titlebar-button jfk-button-disabled" aria-disabled="true">Kommentare</div><div id="docs-docos-caret" style="display: none"><div class="docs-docos-caret-outer"></div><div class="docs-docos-caret-inner"></div></div></div><span vsjson="{&quot;role&quot;:20,&quot;summary&quot;:&quot;Jeder, der über den Link verfügt&quot;,&quot;visibilityState&quot;:&quot;unlisted&quot;,&quot;restrictedToDomain&quot;:false,&quot;visibilityEntries&quot;:[{&quot;role&quot;:20,&quot;summary&quot;:&quot;Jeder, der über den Link verfügt&quot;,&quot;visibilityState&quot;:&quot;unlisted&quot;,&quot;restrictedToDomain&quot;:false,&quot;details&quot;:&quot;Alle Nutzer, die über den Link verfügen, sind zum Zugriff berechtigt. Es ist keine Anmeldung erforderlich.&quot;}],&quot;restrictedToSingleUserScope&quot;:false}" id="docs-titlebar-share-client-button" class="goog-inline-block"><div role="button" class="goog-inline-block jfk-button jfk-button-action docs-titlebar-button jfk-button-disabled" aria-disabled="true"><span class="goog-inline-block apps-share-sprite scb-button-icon  scb-unlisted-icon-white">&nbsp;</span>Freigeben</div></span></div></div></div><div class="docs-butterbar-container"><div class="docs-butterbar-wrap"><div class="jfk-butterBar jfk-butterBar-shown jfk-butterBar-info">Der von Ihnen verwendete Browser wird nicht mehr unterstützt. Einige Funktionen sind daher möglicherweise nicht wie gewünscht verfügbar. Führen Sie ein Upgrade auf einen <a href="http://whatbrowser.org" target="_blank" class="docs-butterbar-link-no-pad">modernen Browser</a> wie <a href="https://www.google.com/chrome/?&brand=CHVN&utm_campaign=en&utm_source=en-et-na-us-docs-ug&utm_medium=et" target="_blank" class="docs-butterbar-link-no-pad">Google Chrome</a> aus.<a href="#" onclick="this.parentNode.parentNode.removeChild(this.parentNode);return false;" class="docs-butterbar-link">Schließen</a></div></div><br></div></div><div id="docs-bars"><div id="docs-menubars"><div id="docs-menubar" role="menubar" class="docs-menubar goog-container goog-container-horizontal" tabIndex="0"><div id="docs-file-menu" role="menuitem" class="menu-button goog-control goog-control-disabled goog-inline-block">Datei</div><div id="docs-edit-menu" role="menuitem" class="menu-button goog-control goog-control-disabled goog-inline-block">Bearbeiten</div><div id="docs-view-menu" role="menuitem" class="menu-button goog-control goog-control-disabled goog-inline-block">Ansicht</div><div id="docs-help-menu" role="menuitem" class="menu-button goog-control goog-control-disabled goog-inline-block">Hilfe</div></div><div id="docs-chat-message-a11y" aria-live="polite" class="docs-offscreen" style="height: 0; width: 0; overflow: hidden"></div><div id="docs-presence-menubar"></div></div></div><div id="docs-help-anchor-wrapper"><div id="docs-help-anchor"></div><div id="docs-help-anchor-right"></div></div><div id="docs-additional-bars"></div></div><div id="docs-editor-container" class="docs-vis-ref-editor-container"><div id="docs-editor" tabindex="1" ><iframe id="gview-embed-content"class="gview-embed-iframe"src="https://docs.google.com/viewer?srcid=0BxcsLDhZbUBBMWY1MzRkZGQtMjQxNC00NzQ3LWFmNzEtNzNmMzYzYmU2MDRj&amp;pid=explorer&amp;efh=false&amp;a=v"></iframe></div></div><script type="text/javascript" src="/static/file/client/js/612129255-edit_core__de.js"></script>
<script>DOCS_initializeModules({"core":[],"app":["core"]},{"core":["\/static\/file\/client\/js\/612129255-edit_core__de.js"],"app":["\/static\/file\/client\/js\/4052761810-edit_app__de.js"]}, 'core');</script><script type="text/javascript">_main('\/file\/d\/0BxcsLDhZbUBBMWY1MzRkZGQtMjQxNC00NzQ3LWFmNzEtNzNmMzYzYmU2MDRj', {'sid': '48231c7ba8cb2d29','id': '0BxcsLDhZbUBBMWY1MzRkZGQtMjQxNC00NzQ3LWFmNzEtNzNmMzYzYmU2MDRj', 'email': '', 'title': 'The Starfish Story (Translation in Navajo).pdf', 'description': '', 'mimetype': 'application\/pdf', 'fileExtension': 'pdf', 'mediaType': 'pdf', 'revisions': [{"tags":[],"creatorDisplayName":"Terry Teller","pinned":true,"filename":"The Starfish Story (Translation in Navajo).pdf","downloadUrl":"https:\/\/docs.google.com\/uc?id=0BxcsLDhZbUBBMWY1MzRkZGQtMjQxNC00NzQ3LWFmNzEtNzNmMzYzYmU2MDRj&export=download&revid=0BxcsLDhZbUBBQmJUS0dhMVV4YWZmZStPa05xWlgxd3ZzaVJrPQ","sizeInBytes":33421,"docId":"0BxcsLDhZbUBBQmJUS0dhMVV4YWZmZStPa05xWlgxd3ZzaVJrPQ","creationDateString":"09.01.12","creator":{"isMe":false,"nickname":"Terry Teller","iconUrl":"images\/doclist\/contact_nopicture.png","editProfileUrl":"editProfile"}}],'obfuscatedUserId': 'ANONYMOUS_17612595759507348808','userDomain': '', 'embedPreviewUri': 'https:\/\/docs.google.com\/file\/d\/0BxcsLDhZbUBBMWY1MzRkZGQtMjQxNC00NzQ3LWFmNzEtNzNmMzYzYmU2MDRj\/preview','syncUpdates': [],'contentRenderer': 'gviewembed'},{"description":{"raw":"","formatted":""},"download":{"isMissingBlobRef":false,"filename":"The Starfish Story (Translation in Navajo).pdf","url":"https:\/\/docs.google.com\/uc?id=0BxcsLDhZbUBBMWY1MzRkZGQtMjQxNC00NzQ3LWFmNzEtNzNmMzYzYmU2MDRj&export=download"},"revision":{"swfUrl":"\/static\/doclist\/client\/css\/1531528182-uploaderapi.swf","busyIconImageUrl":"https:\/\/ssl.gstatic.com\/docs\/doclist\/images\/loading_small.gif"},"sharing":{"is_private":false,"visibility_is_restricted_to_domain":false,"visibility_domain_display_name":""},"basicdetails":{"mimeType":"application\/pdf","lastModifiedDateString":"28.06.12","creationDateString":"09.01.12","fileSize":"33421"},"thumbnail":{"thumbnail_128":"https:\/\/lh5.googleusercontent.com\/lpPbLs1Ej7u889Xoa2e15WTjJJ1nQZMZYEfYlE5tIq-kyhOLFz-33NfIxbrTFLcVA4YyPU6cpVkdhUaXG30aCt3u0nKvWVZw3xdt4A=s128","thumbnail_full":"https:\/\/lh5.googleusercontent.com\/lpPbLs1Ej7u889Xoa2e15WTjJJ1nQZMZYEfYlE5tIq-kyhOLFz-33NfIxbrTFLcVA4YyPU6cpVkdhUaXG30aCt3u0nKvWVZw3xdt4A=s1600"},"gviewembed":{"url":"https:\/\/docs.google.com\/viewer?srcid=0BxcsLDhZbUBBMWY1MzRkZGQtMjQxNC00NzQ3LWFmNzEtNzNmMzYzYmU2MDRj&pid=explorer&efh=false&a=v","embeduri":"https:\/\/docs.google.com\/viewer?srcid=0BxcsLDhZbUBBMWY1MzRkZGQtMjQxNC00NzQ3LWFmNzEtNzNmMzYzYmU2MDRj&pid=explorer&efh=false&a=v&chrome=false&embedded=true","nonredirectedgviewurl":"https:\/\/docs.google.com\/viewer?srcid=0BxcsLDhZbUBBMWY1MzRkZGQtMjQxNC00NzQ3LWFmNzEtNzNmMzYzYmU2MDRj&pid=explorer&efh=false&a=v&chrome=true&redirect=false","isNativeGView":false},"webstoreui":{"mimeType":"application\/pdf","fileExtension":"pdf","moreDriveAppsUrl":"https:\/\/chrome.google.com\/webstore\/category\/collection\/drive_apps"}});</script></body></html>

I've tried using urllib and I've got:

>>> import urllib, codecs
>>> urllib.urlretrieve('https://docs.google.com/file/d/0BxcsLDhZbUBBMWY1MzRkZGQtMjQxNC00NzQ3LWFmNzEtNzNmMzYzYmU2MDRj/edit')
('/tmp/tmpQ5tDwR', <httplib.HTTPMessage instance at 0x16fbbd8>)
>>> codecs.open('/tmp/tmpQ5tDwR','r').read()

I got this this output: http://pastebin.com/D2FM1VMU

Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738
  • 4
    It looks like you're not actually using the Google Docs API, but trying to talk to Google Docs as a browser. That's a bad idea for multiple reasons. For one thing, it means Google will try to do "friendly" things that you don't want, like give you a HTML/JS-based PDF viewer instead of the PDF itself, which is exactly what you don't want. For another, it probably violates the terms of service. – abarnert Dec 16 '13 at 21:02

2 Answers2

3

The right answer here is to use the Google Drive API to access your documents, instead of trying to write a script that talks to Google Docs like a normal user-facing web browser.

The way you're doing things, Google thinks you want to view the page. And, since you don't look like a browser that can natively view PDFs, it's being nice to you and creating an HTML viewer page to let you read the PDF. That viewer page does have a "download" feature, and you could try to parse the HTML and JavaScript and trigger the download, but that's a lot of work.

Also, I've willing to bet the terms of service for Google Drive specifically disallow scripting and scraping the web interface.

The API does require you to create an API key, and you may also need to OAuth to handle logging in as the right user. But once you do that, it's as easy to use as what you're trying to do—and it actually works. You make a Files: get request to fetch information about the file from its ID (the long string of garbage from your existing attempt), which includes a downloadUrl field, and you just fetch that URL. Something like this, in pure stdlib:

url = 'https://www.googleapis.com/drive/v2/files/' + fileid
r = urllib2.urlopen(url)
filesinfo = json.load(r)
downloadurl = filesinfo['downloadUrl']
r2 = urllib2.urlopen(downloadurl)
data = r2.read()

requests will make your life a tiny bit easier when you start adding the API key and possibly the OAuth—e.g., you can just pass {'key': API_KEY} instead of calling urllib.urlencode on the dict to add it as a query string.

The Google API Client Library for Python will make things even easier—you can see the sample code right there on the docs page.

abarnert
  • 354,177
  • 51
  • 601
  • 671
0
from googleapiclient.discovery import build
from googleapiclient.http import MediaIoBaseDownload
from oauth2client.service_account import ServiceAccountCredentials
import io

credentials = ServiceAccountCredentials.from_json_keyfile_name(
    'gdrive.json', [
        'https://spreadsheets.google.com/feeds',
        'https://www.googleapis.com/auth/drive'
    ]
)
drive_service = build('drive', 'v3', credentials=credentials)

file_id = 'abc123'
request = drive_service.files().export_media(fileId=file_id, mimeType='application/pdf')
#fh = io.BytesIO() # this can be used to keep in memory
fh = io.FileIO('file.tar.gz', 'wb') # this can be used to write to disk
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
    status, done = downloader.next_chunk()
    print("Download %d%%." % int(status.progress() * 100))

https://developers.google.com/drive/api/v3/manage-downloads#download_a_document

Steve Robbins
  • 13,672
  • 12
  • 76
  • 124