0

I've been trying to scrape some data off of PDFs regarding 2020 election results in California for my own morbid curiosity.

I need to scrape many tables that appear across many pages. In some cases, the rows will continue onto the next page, and additional columns will appear on other pages as well. I've included a link to one example. I'm comfortable with R, but I can also use Python if that will be better for scraping. I haven't found many resources indicating how to deal with tables that carry onto additional pages for either language though. I need to get these tables into a CSV or XLSX format.

Thank you in advance!

In this example, Pages 15-28 should be one table. https://www.co.tehama.ca.us/images/images/Elections/StatementOfVotesCastNOV2020v2excel.pdf

pkpto39
  • 545
  • 4
  • 11

1 Answers1

0

I was able to get the entire table using the following procedure.

  1. Open the pdf in MS Word - not Adobe Acrobat. Word will convert the document.
  2. After the conversion has completed, select all. (Both may take some time.)
  3. Paste into a blank Excel worksheet. Save and enjoy.
G5W
  • 36,531
  • 10
  • 47
  • 80
  • Thank you. I will give this a shot, but it doesn't appear to work on all of my PDFs, so I will still need help scraping the others. – pkpto39 Dec 06 '20 at 20:03
  • Can you provide an example of one on which this process does NOT work? – G5W Dec 06 '20 at 20:12
  • Yes. In this case, it copies the PDF into excel as an image. Am I doing something wrong? In this example, pages 2-5 should be one table. https://www.countyofglenn.net/sites/default/files/Statement%20of%20Vote%2011032020.pdf – pkpto39 Dec 06 '20 at 20:41
  • 1
    No, you are not doing anything wrong. That PDF contains an image - not text. There are free online services that will perform OCR to convert to text, but they often do a poor job of keeping the alignment for tables. However, you could try [OCR](https://www.onlineocr.net/) or google for others. – G5W Dec 06 '20 at 21:18