Scrape data from PDF with python but not from a table or a normal te

Question

Hello guys and thank you in advance for helping me.

So basically, i am trying scrape data from a pdf.

this is the pdf data:

what i want to do is extract data from it like that:

i tried to do it with tabula but gave me this:

and i tried with regular expression but nothing.

can you please help me

import tabula
import pandas as pd
import numpy as np


df = (pd.concat(
         tabula.read_pdf(
              "/content/drive/MyDrive/Stage/word.pdf", pages="all", pandas_options={"header": None}))
         .squeeze().str.extract(r"\)\s*([^\s]+)\s*([a-z\s,]+)?\s*([A-Z\s]+)?\s*(\w\d+)")
         .stack(dropna=False).strip().unstack()
         .set_axis(["word", "type", "comment", "suffix"], axis=1)
     [["word", "type"]] #uncomment this line to match your expected output
     )

df.to_excel("table.xlsx", index=False) #uncomment this line to make a spreadsheet
print(df)

@tous, why *preposition* is not a type of the word *aboard* in your expected ouptut ? — Timeless, Apr 15 '23 at 23:22

score 3 · Answer 1 · answered Apr 15 '23 at 23:53

You can try something like this with tabula-py & pandas :

import tabula

df = (pd.concat(
         tabula.read_pdf(
              "file.pdf", pages="all", pandas_options={"header": None}))
         .squeeze().str.extract(r"\)\s*([^\s]+)\s*([a-z\s,]+)?\s*([A-Z\s]+)?\s*(\w\d+)")
         .stack(dropna=False).str.strip().unstack()
         .set_axis(["word", "type", "comment", "suffix"], axis=1)
     #[["word", "type"]] #uncomment this line to match your expected output
     )

#df.to_excel("table.xlsx", index=False) #uncomment this line to make a spreadsheet

Output :

print(df)

           word                 type       comment suffix
0       abandon                 verb    STOP DOING     C1
1      abnormal            adjective           NaN     C1
2        aboard  adverb, preposition           NaN     C1
3      abortion                 noun           NaN     C1
4   absolutely!                  NaN           NaN     C1
5        absorb                 verb      REMEMBER     C1
6         abuse                 noun  WRONG ACTION     C1
7    accelerate                 verb        HAPPEN     C1
8    acceptable            adjective       ALLOWED     C1
9    acceptance                 noun           NaN     C1
10     accepted            adjective           NaN     C1

PDF used :

it gave me that error AttributeError: 'DataFrame' object has no attribute 'str' — tous, Apr 16 '23 at 01:10
WIth a pdf that matches your example, the code actually works. Either, you failed to describe your dataset or your doing something wrong. Either way, can you make a reproducible example that triggers the error ? Also, can you include your code in the question ? — Timeless, Apr 16 '23 at 10:42
Cool, AWK now, I'll see if I can find it as well and then re-adapt my answer later. Thanks K J ;) — Timeless, Apr 16 '23 at 17:11
it didnt work even with the file that Mr K J told me about :( — tous, Apr 17 '23 at 00:58

Scrape data from PDF with python but not from a table or a normal te

1 Answers1