i'm having hundreds of pdf files on my google drive and i want to extract page 6 from all the pdf files without necessarily changing the original pdf file name as output using Jupyter notebook on google colab.
I used the code below to extract a page without changing the original file name, and it worked just fine:
from PyPDF2 import PdfFileReader, PdfFileWriter
path = '/content/drive/Shareddrives/2022 | ICT 4014 | Group M2N2/datasets/frazer/1.pdf'
file_ext = path.replace('.pdf', '')
pdf = PdfFileReader(path)
pdfpage = [6]
PdfWriter = PdfFileWriter() #Creating pdfWriter instance
for page_num in pdfpage:
PdfWriter.addPage(pdf.getPage(page_num))
with open('{0}_1.pdf'.format(file_ext), 'wb') as a:
PdfWriter.write(a)
a.close()
Output:
1_1.pdf
I further tried to implement a loop so that i can extract page 6 from the specified directory and i got an error:
import PyPDF2
import os
import re
import sys
import glob
import PyPDF2 as pdf
from PyPDF2 import PdfFileReader, PdfFileWriter
path = glob.glob(os.path.join('/content/drive/Shareddrives/2022 | ICT 4014 | Group M2N2/datasets/unza_etd_pdfs','*.pdf'))
for pdf_files in path:
file_ext = path.replace('.pdf', '')
pdf = PdfFileReader(path)
pdfpage = [6]
PdfWriter = PdfFileWriter() #Creating pdfWriter instance
for page_num in pdfpage:
PdfWriter.addPage(pdf.getPage(page_num))
with open('{0}_1.pdf'.format(file_ext), 'wb') as a:
PdfWriter.write(a)
a.close()
Output Error:
AttributeError: 'list' object has no attribute 'replace'