0

i'm working on my first "big" project, and i basically need to deal with a lot of phone numbers, like, extracting them from a file (already done), formatting them to the same format (problem is here) and lastly store them in a database (also already done).
The problem with formatting is that I have no control over the data source, their format is not consistent, and they are national and international numbers all together, some have the country code with the plus sign, others do not, Some have parenthesis, hyphens, leading 0, etc. some don't.
I'm trying to use the library phonenumbers to separate national and international numbers, my country is brazil and the overwhelming majority of numbers are brazilian. so i start by removing all unnecessary characters like parentheses, hyphen, spaces, plus symbol and leading zeros

df['Mobile Phone'] = df['Mobile Phone'].str.replace('\(|\)|\-|\+|\s', '')

df['Mobile Phone'] = df['Mobile Phone'].str[:1].str.replace('0', '') + df['Mobile Phone'].str[1:]

the next step would be to separate the nationals from the internationals, that's where the use of the library comes in. So far I've tried two ways, but they all give an exception error. In this first attempt, I expected to be able to fill the Origin column with the name of the country of origin of that number, so I could separate the numbers with origin from Brazil from the others. however this is not possible because I need to inform phonenumbers.parse() the country of origin of that number, which is not possible because I have no way of knowing, and because of that I get the error like below

df['Origin'] = df['Mobile Phone'].apply(lambda x: geocoder.description_for_number(phonenumbers.parse(x), 'en'))

NumberParseException: (0) Missing or invalid default region.

so I tried to inform the country of origin as Brazil (BR), but it also returns me an error, because at some point the number passed to phonenumbers.parse() will be an international number, and it will not be recognized as a valid number, as follows the code and error below

df['Origin'] = df['Mobile Phone'].apply(lambda x: geocoder.description_for_number(phonenumbers.parse(x, 'BR'), 'en'))

NumberParseException: (1) The string supplied did not seem to be a phone number.

i also tried to use the phonenumbers.is_valid_number() and fill the 'valid' column with true or false if the number was valid for brazil, however the error remains the same, because when passing the number to the phonenumbers.parse() method if the number is international it will not be recognized and the error will be issued

df['Valid'] = df['Mobile Phone'].apply(lambda x: phonenumbers.is_valid_number(phonenumbers.parse(x, 'BR')))

NumberParseException: (1) The string supplied did not seem to be a phone number.

would there be any way to avoid or ignore these exceptions so that the rest of the checks are done? or some way to return another value for the column when the exception is called, indicating that number was not recognized? or is there a way to pass a list of all existing countries to the phonenumbers.parse() method ?, something like this

df['Valid'] = df['Mobile Phone'].apply(lambda x: phonenumbers.is_valid_number(phonenumbers.parse(x, list_of_countries)))

or

df['Valid'] = df['Mobile Phone'].apply(lambda x: phonenumbers.is_valid_number(phonenumbers.parse(x, ['EN', 'GB', 'BR'])))

here is a sample of some numbers that are contained in one of the files I'm working on, the first 4 numbers are Brazilian, the last ones are international, without undergoing any kind of treatment

+55 34 98400-xxxx
34 99658-xxxx
+349798xxxx
9685-xxxx
549215xxxx
+598 91 xxx xxx
+81 80-4250-xxxx
+81 90-4262-xxxx
+971 50 147 xxxx
+972 53-881-xxxx

and they look like this after I perform a treatment to clean the useless characters

553498400xxxx
3499658xxxx
349798xxxx
9685xxxx
549215xxxx
59891xxxxxx
81804250xxxx
81904262xxxx
97150147xxxx
97253881xxxx

the complete Brazilian local number follows this format: +55 XX XXXXX-XXXX, but in the data there are incomplete numbers, which do not have some information, like the country code for example.

I do not intend to perform any type of formatting on international numbers, as they are numbers from several different countries and each one has its own format,I just need to remove them from the dataframe somehow so that I can perform the formatting in the Brazilian numbers, and after that I will put the international numbers again in the dataframe, as I already said I already made the code to format the Brazilian numbers, to insert the necessary information in the numbers that are without, my difficulty is in fact in how to separate the international numbers from the Brazilian numbers using phonenumber library or otherwise.

  • Please make it clear where the problem lies. I believe you are having issues only with [tag:python-phonenumber], and you have no problems with [tag:pandas] or [tag:google-geocoder]. The question would be much clearer if you had a [example], and provided us with several examples of phone numbers (as most of us are not Brazilian) and how `phonenumber` fails to meet your expectations (of course, you can anonymise them, for example by replacing last 6 digits or so with `#` or something). If indeed you do have problems with dataframes, then please explain how they are relevant to your question. – Amadan May 11 '22 at 01:33
  • Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. – Community May 11 '22 at 02:12
  • @Amadan I'm trying to edit to add some more information like tables representing the dataframe with some examples of numbers, but I'm getting the error " Your post appears to contain code that is not properly formatted as code. Please indent all code by 4 spaces using the code toolbar button or the CTRL+K keyboard shortcut." – MadSweeney May 11 '22 at 02:13
  • In the worst case, you can just write the numbers as text, someone can edit it for you. – Amadan May 11 '22 at 02:17
  • @Amadan if you need more information just ask, in the meantime I'll keep trying to find a solution – MadSweeney May 11 '22 at 02:34
  • As it seems all of your numbers are international (with a country code); using `phonenumbers.parse('+' + clean_number, None)` should work. If you have numbers that do not have a country code but are local brazilian numbers, you have not provided them. – Amadan May 11 '22 at 02:55
  • @Amadan yes I also have Brazilian numbers that do not have the country code or the plus symbol, and also those that have the plus symbol but do not have the country code, I'll make an edit to make it clearer, if all Brazilian numbers had the country code I probably wouldn't be facing any problems, but some don't, and so that I can add the country code to them I need to remove the numbers from the other countries, then I enter a dead end. – MadSweeney May 11 '22 at 03:02

2 Answers2

1

If you don't know which numbers are international and which are local, you'll just have to try both:

def guess_phonenumber(clean, loc):
    # Try national
    pn = phonenumbers.parse(clean, loc)
    if not phonenumbers.is_valid_number(pn):
        # Not national; add + and try international
        pn = phonenumbers.parse("+" + clean, None)
    if not phonenumbers.is_valid_number(pn):
        # Not international either
        pn = None
    return pn

guess_phonenumber(clean_phone_number, "BR")
# => PhoneNumber or None

If the phone cannot be recognised, it is likely either invalid altogether, or it is missing too much information to be able to be reconstructed (e.g. a local number, when you do not know which area it is local to).

Amadan
  • 191,408
  • 23
  • 240
  • 301
  • thanks to your solution I managed to get the answer, I just had to make some changes and now it's working perfectly, thank you very much for the idea. – MadSweeney May 14 '22 at 16:06
0

The 7.0.0 version of Django's phonenumber field addresses this issue, and should be able to handle international numbers without Amadan's answer

merhoo
  • 589
  • 6
  • 18