Highest Voted 'pdf-scraping' Questions

0

votes

1 answer

Create single page PDF from multi page PDF WITHOUT external libraries

I've saw the following question around SO: Create Multi-Page PDF from other PDFs But it didn't replied what i need. Consider i have an PDF with 20 pages. So far so good. From the same place, i can have a PDF with only one page. This one will be used…

c# pdf binaryfiles pdf-scraping

asked Dec 17 '19 at 04:19

paboobhzx

109
10

0

votes

2 answers

How to convert a scanned PDF file to Editable PDF file with python?

I just need to know if we can convert a scanned pdf file to an editable pdf file using python. I know couple of libraries out there like pytesseract, pyocr. Guidance in this regard will be highly appreciated. Thanks

python-3.x pdf-scraping

asked Nov 03 '19 at 20:03

Umar Aftab

147
4
15

0

votes

0 answers

I need to scrape data from 100 Microsoft Word documents and create a table in a CSV file

I have 100s of Microsoft Word documents. Each document has the same headers. I need to be able to read the data present in those documents and create a table. Output in the form of a CSV file. I tried to use Scrapy. But I am new, and I don’t know…

web-scraping pdf-scraping

asked Oct 01 '19 at 23:32

Learning_quick

1

0

votes

0 answers

Data Scraping from PDF

I am trying to collect data from a pdf using the R tabulizer package. However then I got an error when I try to convert the data to a dataframe and export it to CSV. My code is below. Could someone help me with this? # Library packages if…

r pdf-scraping tabulizer

asked Aug 19 '19 at 01:21

Caíque Melo

1
1

0

votes

1 answer

Extract PDF data with varying white space as seperation

I'm looking at getting data from this PDFs. I'm running into a problem, where location names with multiple words ("Northern Island" for example) are being put into different columns. The "sep" argument within "read.table" seems to only be able to…

r pdf pdf-scraping

asked Jul 17 '19 at 09:22

deetseeker

3
2

0

votes

0 answers

Copy website data with SendKeys ("^a") and Paste in Excel

I have VBA Code where I am able to browse webpage but Sendkeys "^a", "^c" not working. Tried multiple times but no luck. Please suggest. Using this code: Set DestinationWorksheet = ActiveSheet Dim myURL As String Dim HTMLdoc As HTMLDocument …

excel vba pdf sendkeys pdf-scraping

asked Jul 14 '19 at 10:55

Mukunda Adhikari

11
3

0

votes

2 answers

Is it possible for a PDF data parser to read PowerPoint PDFs?

I am currently developing a proprietary PDF parser that can read multiple types of documents with various types of data. Before starting, I was thinking about if reading PowerPoint slides was possible. My employer uses presentation guidelines that…

python parsing pdf pdf-scraping

asked Jul 10 '19 at 16:55

Mashiyath Haque

9
1

0

votes

1 answer

How to read a PDF up to a certain end line?

I am doing for loop for many research papers. Here I want extract from read document a content. How can I make that R reads only until last line, where many dots are, and indicate as an end-line? like on the picture below: [Numbers]…

r regex pdf-scraping

asked Apr 11 '19 at 10:09

Bakai Baiazbekov

61
4

0

votes

0 answers

pdftools - Helvetica (?) font distorts text import

I am struggling to properly read pdfs which contain the Helvetica font with the pdftools package. I am trying to extract info from about a 1000 voting records. Overall, pdftools works as intended. However, there are one or two hundred pdfs where the…

r pdf web-scraping pdf-scraping

asked Apr 11 '19 at 07:24

zoowalk

2,018
20
33

0

votes

1 answer

Cleaning up text data extracted from scanned .pdf

I am creating a script to extract text from a scanned pdf to create a JSON dictionary for implementation into a MongoDB later. The issue I have run into is that using tesseract-ocr via Textract module successfully extracted all the text but it is…

python pdf-scraping

asked Mar 27 '19 at 05:38

Brett Plemons

65
9

0

votes

1 answer

No text is returned when pypdf2 is used to scrape a one paged pdf

I have downloaded a bunch of pdfs from this source: 'http://ec.europa.eu/growth/tools-databases/cosing/index.cfm?fuseaction=search.detailsPDF_v2&id=28157 Now I want to scrape the PDF's by using PyPDF2, however no text is returned. I tested the code…

python pdf-scraping

asked Jan 25 '19 at 12:57

Mr Anderson

7
3

0

votes

1 answer

Trying to scrape a PDF in R, my code will only scrape 6 out of 9 pages and i'm not sure why, am I missing something in my code?

Im trying to scrape a couple PDFs in R, PDF1 has 9 pages and PDF2 has 12 pages. When I run the code below it scrapes both PDFs but only up to page 6 and nothing after that. Is there a reason for this? Something missing in my code? library(tm) read…

r pdf tm pdf-scraping xpdf

asked Jan 04 '19 at 11:28

Jlingz14

47
6

0

votes

1 answer

Converting a PDF file to a nice table

I have this PDF file which is arranged in 5 columns. I have looked and looked through Stack Overflow (and Googled crazily) and tried all the solutions (including the last resort of trying Adobe Acrobat itself). However, for some reason I cannot…

pdf text pdf-scraping

asked Mar 21 '11 at 12:23

econclicks

327
1
5
11

0

votes

0 answers

Extract text from PDF section keeping strings in one line

I have a bunch of PDF files and I need to extract some information from them. The "section" have the text "Referências" and looks like the picture below: I tried a lot of text extractor tools to accomplish this task, but the problem is that I need…

parsing pdf extractor pdf-scraping

asked Sep 15 '18 at 22:21

Wolgan Ens

385
1
11

0

votes

1 answer

Is there an easy way to find specific text in a PDF, highlight it and print OR save to new file?

So what I'm hoping to do is automate mapping out process of desk locations in a building layout map that is in PDF format. I work with a deployment team that handles IT equipment requests.. and basically we get requests with a list of user names and…

python-3.x pdf-scraping

asked Sep 02 '18 at 04:45

Vvega

37
6

Questions tagged [pdf-scraping]