0

I am unable to extract MCC details from PDF. I am able to extract other data with my code.

import tabula.io as tb
from tabula.io import read_pdf
pdf_path = "IR21_SVNMT_Telekom Slovenije d.d._20210506142456.pdf"
for df in df_list:
    if 'MSRN Number Range(s)' in df.columns:
         df = df.drop(df.index[0])
         df.columns = df.columns.str.replace('\r', '')
         df.columns = df.columns.str.replace(' ', '')
         df.columns = df.columns.str.replace('Unnamed:0', 'CountryCode(CC)')
         df.columns = df.columns.str.replace('Unnamed:1', 'NationalDestinationCode(NDC)')
         df.columns = df.columns.str.replace('Unnamed:2', 'SNRangeStart')
         df.columns = df.columns.str.replace('Unnamed:3', 'SNRangeStop')
         break
msrn_table = (df[['CountryCode(CC)','NationalDestinationCode(NDC)','SNRangeStart','SNRangeStop']])
print (msrn_table)

The same logic I am trying to retrieve "Mobile Country Code (MCC)" details. But Pandas data frame is showing different data instead of what is there in PDF.

for df in df_list:
    if 'Mobile Country Code (MCC)' in df.columns:
        break
print (df)

Pandas output is given in this: pandas_output

The actual content in pdf file is: actual_pdf

n1colas.m
  • 3,863
  • 4
  • 15
  • 28
user1107731
  • 357
  • 1
  • 2
  • 10

1 Answers1

0

This code works

import pdfplumber
import re
pattern =re.compile(r'Mobile Network Code \(MNC\)[\r\n]+([^\r\n]+)')
#pattern =re.compile(r'Mobile\sNetwork\sCode\s\(MNC\)')
pdf = pdfplumber.open(pdf_path)
n = len(pdf.pages)
final = ""
for page in range(n):
    data = pdf.pages[page].extract_text()
    final = final + "\n" + data
mcc_mnc=" "
matches=pattern.findall(final)
mcc_mnc=mcc_mnc.join(matches)
mcc = mcc_mnc.split(" ")
actual_mcc =mcc[0]
actual_mnc=mcc[1]
print (actual_mcc)
print (actual_mnc)
user1107731
  • 357
  • 1
  • 2
  • 10