1

I'm working with articles scraped from online newspapers with a mysql database and python. I want to use pandas to_sql method on a dataframe for appending recently scraped articles to a mysql table. It works pretty well, but im having some problems with the following:

Since the articles are automatically scraped from news sites, about 1% of them have issues (encoding, or texts are too long or stuff like that) and dont fit on the mysql table fields. Pandas to_sql method for some reason IGNORES these errors and discards the rows that do not fit. For example I have the following mysql table:

+--------------+--------------+------+-----+---------+----------------+
| Field        | Type         | Null | Key | Default | Extra          |
+--------------+--------------+------+-----+---------+----------------+
| id           | int(11)      | NO   | PRI | NULL    | auto_increment |
| title        | varchar(255) | YES  |     | NULL    |                |
| description  | text         | YES  |     | NULL    |                |
| content      | text         | YES  |     | NULL    |                |
| link         | varchar(300) | YES  |     | NULL    |                |
+--------------+--------------+------+-----+---------+----------------+

And I also have a Dataframe that contains 15 rows and 4 columns (title, description, content, link).

If 1 of those rows has a title larger than 255 characters, it wont fit in the mysql table. I expected an error when doing df.to_sql('press', con=con, index=False, if_exists='append'), that way I know i have a problem to fix; but the actual result was that 14 ROWS where appended instead of 15.

This could work for me, but i need to know which row was discarded so i can flag it for later revision. Is it possible to tell pandas to let me know which indexes are ignored?

Thanks!

ebravo
  • 55
  • 1
  • 8

0 Answers0