Questions tagged [pdfplumber]

Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.

95 questions
0
votes
1 answer

Python - Reset BytesIO So Next File Isn't Appended

I'm having a problem with BytesIO library in Python. I want to convert a pdf file that I have retrieved from an S3 bucket, and convert it into a dataframe using a custom function convert_bytes_to_df. The first pdf file is fine to convert to a csv,…
clattenburg cake
  • 1,096
  • 3
  • 19
  • 40
0
votes
1 answer

pdfplumber extract table data works when the table has borders, doesn't work when the table has no borders

Using reportlab I made 2 1 page pdfs with 1 table: The data in the table is this: data1 = [['00', '', '02', '', '04'], ['', '11', '', '13', ''], ['20', '', '22', '23', '24'], ['30', '31', '32', '', '34']] The point is, to get the rows…
Pedroski
  • 433
  • 1
  • 7
  • 16
0
votes
1 answer

extracting data into columns using pdfplumber

I have a pdf which has data in tabular format and has 6 columns but the columns are not separated by boundaries so when I extract the data using pdfplumber, all the data comes in one cell only and I want in separate cells. How could I do that? For…
arvin
  • 9
  • 4
0
votes
1 answer

pdfplumber - Extract table row splitted across multiple pages

Given a pdf(attached) with table row splitted across multiple pages with page break in between. I am trying to extract tabular data in a csv from this pdf using pdfplumber, but am getting this data in separate rows in a csv. Basically I would like…
jsanjayce
  • 272
  • 5
  • 15
0
votes
2 answers

How to Convert PDF file into CSV file using Python Pandas

I have a PDF file, I need to convert it into a CSV file this is my pdf file example as link https://online.flippingbook.com/view/352975479/ the code used is import re import parse import pdfplumber import pandas as pd from collections import…
0
votes
1 answer

How to correctly format this pdfplumber extract_table() output to DataFrame?

I have searched stack overflow on how to extract table information from a pdf without horizontal lines, and I am almost successful, however this brings me to my next problem. How to correctly output the data for use in a DataFrame. The pdf tables in…
GT1992
  • 79
  • 6
0
votes
2 answers

PYTHON - extract list element using keyword

My goal is to extract an element from many list that similar like this. Taking elements that is food. test_list = ['Tools: Pen', 'Food: Sandwich', 'Fruit: Apple' ] I the final result would be "Sandwich" by look list element with the word "Food:"…
Hay Team
  • 3
  • 1
0
votes
1 answer

how to recognize a graph in pdf using python?

new to pdf parsing. I want to recognize a graph in a pdf file, so I could skip it and not extract this type of text. all I know about the pdf is that it is generated from word (not scanned). Input - pdf with a graph such as this one. output should…
0
votes
0 answers

Hi, i need some information how to create DataFrame from PDF file

I have PDF format table And i need to create Data Frame from it. I use pdfplumber module and when i try to create DataFrame i get: 0 1 2 3 \ 0 Oil Company None None …
Trepetaky
  • 45
  • 3
0
votes
2 answers

Is there a way in python to extract only the CORE TEXT (without boxes, footer etc.) from a pdf?

I am trying to extract only the core text from a "rich" pdf document, meaning that it has a lot of tables, graphs, boxes, footers etc. in which I am not interested in. I tried with some common python packages like PyPDF2, pdfplumber or…
a-caputo
  • 13
  • 4
0
votes
1 answer

how to extract only main text with pdfplumber and ignore image text and tables?

trying to parse any non scanned pdf and extract only text, without tables and their comments or pictures and their comment. just the main text of a pdf, if such text exists. tried pdfplumber. when trying this piece of code it extract all texts, …
0
votes
0 answers

pdfplumber memory hogging with discord bot

I was using a command to fetch a pdf and format it asynchronously. This is the command: async def ext_command(self, ctx:interactions.CommandContext, page: int = None): await ctx.defer(ephemeral=False) loop = asyncio.get_running_loop() async with…
Parth
  • 39
  • 10
0
votes
3 answers

Regular expressions python - get only the description

i am newbie in python, and i am trying to use RE to transform some PDF in DF. So, for now i have a list with this information list = ['9076968 ADT 10mg 60comp 22CN014A T E1 059366 5 2,72 1,97 1,56 0,0 0,01 6 1,57 7,85', '9076943 ADT 25mg 60comp…
foliveir
  • 59
  • 5
0
votes
0 answers

Error pdfplumber cluster_objects 'str' object is not callable

I need to obtain all the information of a pdf in lists or arrangements; but this library generates this error and there is no way to solve it. with pdfplumber.open(file) as temp: def check_bboxes(word, table_bbox): """ Check whether word is…
0
votes
0 answers

pdfplumber - How to extract table with no horizontal lines?

So I have a table like this one, with an unknown number of description lines. Some can have 1, 2, 5, even zero, or more lines: (I removed all sensitive informations.) and I use : with pdfplumber.open("invoice.pdf") as pdf: pages = pdf.pages …
Cristian F.
  • 328
  • 2
  • 12