How to distinguish uploaded PDFs to extract data through regular expression in python Django

Question

Here are uploaded pdfs and it will convert it into text. After converting into text I use a regular expression to get some specific data from the pdfs. Now there are various kinds of pdfs and I have to use different types of regular expression for each pdf. but I am facing problem to distinguish the pdf in the if conditions just like below. What I have done here is only going to the first if condition. how can I pass the pdf into its desire place I meant into the specific regular expression I have created. Or is there any other ways to do that mainly I just wanted to build up pdf extractor for some specific data.

def upload(request):
    if request.method == 'POST':
        form = PoForm(request.POST, request.FILES)
        if form.is_valid():
            form.save()
            file_name = form.cleaned_data['pdf'].name
            print(form.cleaned_data['pdf'].name)
            text=convert_pdf_to_txt(file_name)


            text=text.replace('\n','')
            print(text)
            path = 'media/pos/pdfs/{}'.format(file_name)
            print(path)
            basename = os.path.basename(path)


            if file_name == basename:

                print(basename)
                print(file_name)
                regex_Quantity ='Quantity:\s?([0-9]+)'
                regex_style_no ='No:\s\s\s\s?([0-9]+)'


            elif file_name == basename:
                print("print2")
                print(basename)
                regex_Quantity = 'Total Units\s?([0-9\,]+)'
                regex_style_no = 'Number:\s?([0-9]+)'


            elif file_name == basename:
                print(basename)
                print("print3")
                regex_Quantity ='PO\s?([0-9\.]+)'
                regex_style_no = 'Article-No.:\s?([0-9]+)'

well first of all I have no idea what you wanted to achieve by stating exactly the same expression in following if elif. Thats more like if elif statement question. if the first expression in `if` is true rest of the code is skipped because it gets the first True expression and executes code assigned in the `if`. when the `if` expression is false then it goes to the `elif` and checks if the expression is True and the process repeats — quqa123, Apr 17 '20 at 22:45
well, actually I wanted to automate collecting specific data from pdf while I just upload my pdf in my Django project through regular expression. but it only works for one pdf I mean it only goes for the first condition. my regular expressions are different for each and every different pdfs. I can not distinguish how to get into the elif condition. there are 3 conditions here for 3 category type pdfs, each condition has each regular expression for each pdfs to extract specific value from the pdf. it can be more pdfs and more conditions. I hope you get my point of view — zenvar, Apr 18 '20 at 14:17
before continuing your app's development please read [this](https://www.programiz.com/python-programming/if-elif-else). I mean it's clear that you're just starting your programming journey so it's better to get familiar with the basics. good luck ! — quqa123, Apr 18 '20 at 19:17
@quqa123 hey you do not get my point of view or u do not understand my problem actually. thank you. made a project which uploads any kind of pdfs but you have some regex for each and every pdf. when you upload it and stores into the project directory and from there u grab uploaded pdfs and runs a function to convert it into text then put it into the regular expression to extract desire data. now tell me how you handle your pdfs to its own specific regular expression. — zenvar, Apr 18 '20 at 21:01
do you mean something like this pseudo-code `if pdf contains this regex do something elif pdf contains other regex do something else` ? — quqa123, Apr 18 '20 at 21:06
automate data extractor. u just upload pdf and it will extract the values and store into DB. So each and every pdfs are different so every different pdf has different regex. so how do distinguish which one of them will go which under? I was going to find its path file name to compare but I was wrong and it didn't work because it will go under only first condition. that was my attempt. and that's why I am here to get help from other members. — zenvar, Apr 18 '20 at 21:16
ok I think i know what you mean and will post anwser in a sec but please if you thought that your if elif code would work just read the link I gave you before it wil show you why id only went into first condition — quqa123, Apr 18 '20 at 21:20

score 0 · Answer 1 · answered Apr 18 '20 at 21:33

0

To get your target basing on the piece of it's content you should use search link like this:

from re import search

content = get_your_pdf_content_or_particular_string()
if search('your_regex', content):
   do_something()
elif search('your_other_regex', content):
   do_something_else()
elif ...

search will return None if regex didn't match any part of the content but if it does it will return MatchObject from which you can access the actual regex found in the content as MatchObject.group(0) and use it in your code if you like.

answered Apr 18 '20 at 21:33

quqa123

605
6
15

each pdf has a lot regex but in search func, I can add only one regex am I right? I want to combine my each pdf's regex to run simultaneously. Is there any other way to do it? Btw thank you for all your effort. I really appreciate. – zenvar Apr 18 '20 at 22:03
yes search function gets a match for the regex supplied as argument. I don't know what you mean by `combine my each pdf's regex to run simultaneously` i'd suggest you watch some `python re` crash course it can give you the anwsers for your problems. If you find this helpful just hit the up arrow :) – quqa123 Apr 18 '20 at 22:32
the search function can carry only single regex but I have lot of regex for a single pdf. so how to handle it? and is there any other way to do that in your mind? Sure I will hit up if it resolves my issue – zenvar Apr 18 '20 at 22:41
well just call the search function for different regex. btw this is not "do my homework" stack if someone helps you in comments - hit the like button. don't expect a full solution for a problem you created. I can just give you the resources rest depends on your skill – quqa123 Apr 18 '20 at 23:10

How to distinguish uploaded PDFs to extract data through regular expression in python Django

1 Answers1