I need to migrate a digital repository to a new platform, but lack access to the old platform so I have resorted to retrieving the objects over the web.
Some objects contain other objects. For most objects of this type, identifying/retrieving the components and their metadata is a straightforward process. But for some PDF files, it appears that the components referred to are actually references to individual pages within a single file rather than separate pages.
For example, http://content.wwu.edu/cdm4/document.php?CISOROOT=/wfront&CISOPTR=2711 gives me an object with 4 pages. http://content.wwu.edu/cgi-bin/showfile.exe?CISOROOT=/wfront&CISOPTR=2711&CISOMODE=print allows me to retrieve the entire document. http://content.wwu.edu/cgi-bin/showfile.exe?CISOROOT=/wfront&CISOPTR=2711 retrieves an XML document telling me the identifiers for the component pages, but when I try to curl them, I just get zero length docs. But using the same method when non PDF docs are involved, I get actual files -- this is why I think only individual pages are being retrieved.
How can I retrieve the individual pages, as I must store these as individual objects in the new platform? Thanks