-1

I have downloaded some website by a website copier software. I want to do extract some information from all pages.

Suppose there are many product pages and I want to gather only product information from all pages and store it in a excel file.

I want to know what are possible ways of doing this. My friend told me that he can write some script and make it happen but I don't understand how any script can solve this entire purpose.

Is there any free software out there or any code that can do this job. I know java pretty well, if I can make it happen by java by writing a code, then please provide some guidance.

Abhinav
  • 3,322
  • 9
  • 47
  • 63

2 Answers2

1

You probably don't want to use Java but JavaScript instead because the product pages are webpages, so you'd probably be more comfortable with a browser-native language. If it were me, I'd approach it this way:

1 - Write a master JS script that load all pages, one at a time.

2 - With each page, select that product information (probably with something like $('#productID'), etc.)

3 - Put them into JSON format and export to CSV with some third party library (or maybe write some codes yourself). Example of one such library: http://www.zachhunter.com/2011/06/json-to-csv/

Lim H.
  • 9,870
  • 9
  • 48
  • 74
  • can you please elaborate 1st step or provide any example. As I have not much knowledge and experience with JS – Abhinav Dec 23 '12 at 09:32
  • First, you need a JS library called jQuery. Second, suppose you store the pages at `home/page1.html`, `home/page2.html`, etc., to load the content of `#productID` in each page into a `#jsonResult` div in your `result.html` you just need to put this in your `result.html`: `$('#jsonResult').load('home/page1.html div#productID')`. That's the general idea. Of course you need to parse the content in JSON as well. Ref: http://api.jquery.com/load/ – Lim H. Dec 23 '12 at 09:44
  • thank you,I am now able to fetch data in result.html, now I have one last problem,I have hundreds of product pages,how can I extract data from all of them at once or any way by which it automatically fetch data from pages one after one? – Abhinav Dec 23 '12 at 13:07
  • This is not a JS question. It's a regrex question. It really depends on how the pages are named. For a dummy example, if it's stored as page1, page2, etc. like I mentioned above, you just need to do a for loop and use `load('home/page' + i + '.html div#productID')` with i as the iterative index. – Lim H. Dec 23 '12 at 13:50
  • If the naming scheme is random and there is no way to parse it, you can use a server side language, say Java, to rename theme or at least to iterate through the directory but I think I'm getting too clumsy here. http://docs.oracle.com/javase/tutorial/essential/io/find.html – Lim H. Dec 23 '12 at 13:57
0

Take a look at JSoup, a Java library for HTML documents.

You'll find plenty of documentation on their website.

You will want to learn about CSS selectors to select specific elements from the document, examples see http://jsoup.org/cookbook/extracting-data/selector-syntax

And then write the collected data as comma-separated values into a text file that you can load into Excel.

akuhn
  • 27,477
  • 2
  • 76
  • 91