How to capture tables with different structure from web

Question

I have thousands of web pages(need login with username and passwords) like https://XXX.incometax.XXX/Preview/ViewDetail?TIN_INFO_NO=11935# where only last four digits(11935 for this example) changes for each url. Each url retrives tax information for a taxpayers in different types of tables. Tables are served based on the information entered in the system for each taxpayer e.g. Some taxpayer's information table shows National Identity Card(NID) number for those who created electronic taxpayer's identification number(eTIN) using NID and for some taxpayer's information table shows Passport number(for those who created eTIN using passport number).So the bottom line is information table is different for different taxpayer. Now I need an automation that extracts those tables in a way that all newly found columns should be created and places respective columns data under respective column.

e.g. Suppose one taxpayer can create eTIN using either NID or Passport Number but not the both.Say at first pass automation system finds NID information and in the second pass it finds Passport information, now it will create new column named passport and place respective information under it and if in the third pass it finds NID information then it will place that information under the previously(at first pass) created NID column.Finally the automation system will generate a single csv file.

N.B. There is no legal restrictions for me to extract information from that site.I would like to have a non-programmatic solution.

I think this might be more appropriate for stackoverflow.com. If understand your question then what you need to do is something like this (assuming this is SQL based): a) select * from the table b) then if COLUMN1 is not empty then display. this way only the columns with information will be displayed. — Tux_DEV_NULL, Oct 16 '17 at 10:32
@Tux_DEV_NULL Ah! i do not have database to run sql command on rather i am building database from plain html webpages! — Learner, Oct 16 '17 at 10:36
ah..then that is tricky. we need some sort of web scraping tool. depending on what API you use. for instance perl has a few (Mojo::UserAgent is one). — Tux_DEV_NULL, Oct 16 '17 at 10:45
You need to make a program for this. There isn't any ready-made tool for a problem like this, since each scraping scenario is highly individual. — Tero Kilkanen, Oct 17 '17 at 06:14

How to capture tables with different structure from web

0 Answers0