Questions tagged [pdf-scraping]

the process of getting data out of a PDF, this involves opening, reading and parsing the contents of the PDF to extract text, images, metadata or attachments

144 questions
0
votes
1 answer

Create single page PDF from multi page PDF WITHOUT external libraries

I've saw the following question around SO: Create Multi-Page PDF from other PDFs But it didn't replied what i need. Consider i have an PDF with 20 pages. So far so good. From the same place, i can have a PDF with only one page. This one will be used…
paboobhzx
  • 109
  • 10
0
votes
2 answers

How to convert a scanned PDF file to Editable PDF file with python?

I just need to know if we can convert a scanned pdf file to an editable pdf file using python. I know couple of libraries out there like pytesseract, pyocr. Guidance in this regard will be highly appreciated. Thanks
Umar Aftab
  • 147
  • 4
  • 15
0
votes
0 answers

I need to scrape data from 100 Microsoft Word documents and create a table in a CSV file

I have 100s of Microsoft Word documents. Each document has the same headers. I need to be able to read the data present in those documents and create a table. Output in the form of a CSV file. I tried to use Scrapy. But I am new, and I don’t know…
0
votes
0 answers

Data Scraping from PDF

I am trying to collect data from a pdf using the R tabulizer package. However then I got an error when I try to convert the data to a dataframe and export it to CSV. My code is below. Could someone help me with this? # Library packages if…
0
votes
1 answer

Extract PDF data with varying white space as seperation

I'm looking at getting data from this PDFs. I'm running into a problem, where location names with multiple words ("Northern Island" for example) are being put into different columns. The "sep" argument within "read.table" seems to only be able to…
0
votes
0 answers

Copy website data with SendKeys ("^a") and Paste in Excel

I have VBA Code where I am able to browse webpage but Sendkeys "^a", "^c" not working. Tried multiple times but no luck. Please suggest. Using this code: Set DestinationWorksheet = ActiveSheet Dim myURL As String Dim HTMLdoc As HTMLDocument …
0
votes
2 answers

Is it possible for a PDF data parser to read PowerPoint PDFs?

I am currently developing a proprietary PDF parser that can read multiple types of documents with various types of data. Before starting, I was thinking about if reading PowerPoint slides was possible. My employer uses presentation guidelines that…
0
votes
1 answer

How to read a PDF up to a certain end line?

I am doing for loop for many research papers. Here I want extract from read document a content. How can I make that R reads only until last line, where many dots are, and indicate as an end-line? like on the picture below: [Numbers]…
0
votes
0 answers

pdftools - Helvetica (?) font distorts text import

I am struggling to properly read pdfs which contain the Helvetica font with the pdftools package. I am trying to extract info from about a 1000 voting records. Overall, pdftools works as intended. However, there are one or two hundred pdfs where the…
zoowalk
  • 2,018
  • 20
  • 33
0
votes
1 answer

Cleaning up text data extracted from scanned .pdf

I am creating a script to extract text from a scanned pdf to create a JSON dictionary for implementation into a MongoDB later. The issue I have run into is that using tesseract-ocr via Textract module successfully extracted all the text but it is…
0
votes
1 answer

No text is returned when pypdf2 is used to scrape a one paged pdf

I have downloaded a bunch of pdfs from this source: 'http://ec.europa.eu/growth/tools-databases/cosing/index.cfm?fuseaction=search.detailsPDF_v2&id=28157 Now I want to scrape the PDF's by using PyPDF2, however no text is returned. I tested the code…
0
votes
1 answer

Trying to scrape a PDF in R, my code will only scrape 6 out of 9 pages and i'm not sure why, am I missing something in my code?

Im trying to scrape a couple PDFs in R, PDF1 has 9 pages and PDF2 has 12 pages. When I run the code below it scrapes both PDFs but only up to page 6 and nothing after that. Is there a reason for this? Something missing in my code? library(tm) read…
Jlingz14
  • 47
  • 6
0
votes
1 answer

Converting a PDF file to a nice table

I have this PDF file which is arranged in 5 columns. I have looked and looked through Stack Overflow (and Googled crazily) and tried all the solutions (including the last resort of trying Adobe Acrobat itself). However, for some reason I cannot…
econclicks
  • 327
  • 1
  • 5
  • 11
0
votes
0 answers

Extract text from PDF section keeping strings in one line

I have a bunch of PDF files and I need to extract some information from them. The "section" have the text "Referências" and looks like the picture below: I tried a lot of text extractor tools to accomplish this task, but the problem is that I need…
Wolgan Ens
  • 385
  • 1
  • 11
0
votes
1 answer

Is there an easy way to find specific text in a PDF, highlight it and print OR save to new file?

So what I'm hoping to do is automate mapping out process of desk locations in a building layout map that is in PDF format. I work with a deployment team that handles IT equipment requests.. and basically we get requests with a list of user names and…
Vvega
  • 37
  • 6