Based on the need, we need to tweak the parameters on tabula
so that data import makes sense. The parameters I suggested in the comments were just an example. To get columns starting x-axis either we need to use acrobat's paid version or with some trails.
so code would be like
Import and setup
import tabula
import pandas as pd
pdf_file='file1.pdf'
column_names=['Product','Batch No','Machin No','Time','Date','Drum/Bag No','Tare Wt.kg','Gross Wt.kg',
'Net Wt.kg','Blender','Remarks','Operator']
df_results=[] # store results in a list
as pages are not in the same format, we need to process them separately. And some clean up like, remove the column which is not needed or data after certain value(refer in page 2 processing)
# Page 1 processing
try:
df1 = tabula.read_pdf(pdf_file, pages=1,area=(95,20, 800, 840),columns=[93,180,220,252,310,315,333,367,
410,450,480,520]
,pandas_options={'header': None}) #(top,left,bottom,right)
df1[0]=df1[0].drop(columns=5)
df1[0].columns=column_names
df_results.append(df1[0])
df1[0].head(2)
except Exception as e:
print(f"Exception page not found {e}")
# Page 2 processing
try:
df2 = tabula.read_pdf(pdf_file, pages=3,area=(10,20, 800, 840),columns=[93,180,220,252,310,315,330,370,
410,450,480,520]
,pandas_options={'header': None}) #(top,left,bottom,right)
row_with_Sta = df2[0][df2[0][0] == 'Sta'].index.tolist()[0]
df2[0] = df2[0].iloc[:row_with_Sta]
df2[0]=df2[0].drop(columns=5)
df2[0].columns=column_names
df_results.append(df2[0])
df2[0].head(2)
except Exception as e:
print(f"Exception page not found {e}")
#result = pd.concat([df1[0],df2[0]]) # concate both the pages and then write to CSV
result = pd.concat(df_results) # concate list of pages and then write to CSV
result.to_csv("result.csv")
pls test the code, as I have some level of verification only :)