0

i'm having hundreds of pdf files on my google drive and i want to extract page 6 from all the pdf files without necessarily changing the original pdf file name as output using Jupyter notebook on google colab.

I used the code below to extract a page without changing the original file name, and it worked just fine:

from PyPDF2 import PdfFileReader, PdfFileWriter

path = '/content/drive/Shareddrives/2022 | ICT 4014 | Group M2N2/datasets/frazer/1.pdf'

file_ext = path.replace('.pdf', '')
  
pdf = PdfFileReader(path)
  
pdfpage = [6]

PdfWriter = PdfFileWriter() #Creating pdfWriter instance

for page_num in pdfpage:
  PdfWriter.addPage(pdf.getPage(page_num))

with open('{0}_1.pdf'.format(file_ext), 'wb') as a:
  PdfWriter.write(a)
  a.close() 

Output:

1_1.pdf

I further tried to implement a loop so that i can extract page 6 from the specified directory and i got an error:

import PyPDF2
import os 
import re
import sys
import glob 
import PyPDF2 as pdf
from PyPDF2 import PdfFileReader, PdfFileWriter

path = glob.glob(os.path.join('/content/drive/Shareddrives/2022 | ICT 4014 | Group M2N2/datasets/unza_etd_pdfs','*.pdf'))

for pdf_files in path:
  
file_ext = path.replace('.pdf', '')
  
pdf = PdfFileReader(path)
  
pdfpage = [6]

PdfWriter = PdfFileWriter() #Creating pdfWriter instance

for page_num in pdfpage:
  PdfWriter.addPage(pdf.getPage(page_num))

with open('{0}_1.pdf'.format(file_ext), 'wb') as a:
  PdfWriter.write(a)
  a.close()

Output Error:

AttributeError: 'list' object has no attribute 'replace'

0 Answers0