0

I am currently scraping a website tat provides a table of data. The structure would be as follows

<table>
 <tr> #This is the first row
   <td> data 1 </td>
   .....
 </tr>
 ....
</table>

Let's say in the end there is a table with 20 rows and 10 columns. My script has to go from one table to the next, being between 100 and 1000 tables.

So, with xpath I locate each row, insert its data in 2 tables, and go to the next one. A pseudocode would be

for table in tables: #Between 100 and 1000 tables
  for row in table:
    Here I get from the row each td tag and returns a list
    Insert in table 1 half of the data, and get the id of the row inserted
    insert in table 2 the other half with the id of the first table row, to link both.

I´ve been timing it to see why and where this takes that long and I got the following

Overall table time 16 seconds
  Getting the data and generating the list for one row 0,453 secs

  Inserting data in table 1 0,006 secs
  Inserting data in table 2 0,0067 secs

This means that if I have to scrape all 1000 tables that would take me more than 10 hours, which is way too much time, considering that when I used beautiful soup overall time was between half an hour and 1:30h.

Seeing the problem is in getting text data from each td tag in each row, is there any way to speed it up? Esentially what I am doing in that part of the script is

data_in_position_1=row.find_element_by_xpath('.//td[1]').text
.....
data_in_position_15=row.find_element_by_xpath('.//td[15]').text

list=[data_in_position_1,.....,data_in_position_15]

return list

Well, I don´t know if scraping the whole table at once, or a different approach will show different results, but I need some way to speed this up.

Thanks

puppet
  • 707
  • 3
  • 16
  • 33
  • Try [bulk](https://stackoverflow.com/questions/44041143/why-bulk-import-is-faster-than-bunch-of-inserts) inserting it. And also, where is your db hosted? Connection can also be the bottleneck. – Adrian Apr 21 '18 at 10:16
  • In the same computer. But anyway I did timeit on it and the issue was in getting from the tag all the info stored in each label. – puppet Apr 22 '18 at 10:56

0 Answers0