0

I'm thinking on the best way to structure a large App Engine site (+1M urls).

I need a sitemaps.xml file in the root path of the domain file that links to sitemap[n].xml files.

The sitemaps.xml file can link up to 1000 sitemap[n].xml files and each of these sitemap[n].xml files has up to 50K urls.

Is there a way to dynamically generate the files with the 50K urls?

Any other way to do it without fetching 50K entities at a time?

Thanks!

PS: The files cannot be static because they have to be placed in the root path of the domain :(

ana
  • 1,565
  • 4
  • 17
  • 22

2 Answers2

1

You're best bet is to generate them ahead of time. Maybe run a map-reduce over your data and store each sitemap[n].xml in a a blob in a separate datastore entity. Then the handler (which is mapped from - url: /sitemap(.*) ) simply returns the blob from the corresponding entity.

All of this really depends on how your urls are stored and/or generated.

You could also generate all the urls offline and put them in one huge file. Upload that file it to the blobstore along with a file that has the offsets for each group of 50k urls in that file. In the handler, simply take the corresponding group of 50k urls from the blobstore.

Also realize that it's probably not that useful (with respect to SEO) to have such huge sitemaps.

Amir
  • 4,131
  • 26
  • 36
  • There's really no reason to upload a single blob, then serve parts of it - just upload one blob per file to be served and serve them directly, instead. – Nick Johnson Jan 31 '11 at 00:07
  • Agreed, but it all depends on your workflow. If you are given a huge file with all the urls, you can simply read 50k urls and send them, and remember the spot for the next 50k. Then you don't need to do any preprocessing ahead of time. But... You are right...we really don't have enough info from the question to give a good answer. – Amir Jan 31 '11 at 04:18
0

Why can't you add an entry in your app.yaml to redirect where the files go. Robots.txt should be in the root level but I keep it in /img

- url: /robots.txt  
  static_files: img/robots.txt
  upload: img/robots.txt

It is the exact same to any crawler.

mcotton
  • 844
  • 6
  • 11