This is what I'm trying to do.
- Scan the csv using Polars lazy dataframe
- Format the phone number using a function
- Remove nulls and duplicates
- Write the csv in a new file
Here is my code
import sys
import json
import polars as pl
import phonenumbers
#define the variable and parse the encoded json
args = json.loads(sys.argv[1])
#format phone number as E164
def parse_phone_number(phone_number):
try:
return phonenumbers.format_number(phonenumbers.parse(phone_number, "US"), phonenumbers.PhoneNumberFormat.E164)
except phonenumbers.NumberParseException:
pass
return None
#scan the csv file do some filter and modify the data and then write the output to a new csv file
pl.scan_csv(args['path'], sep=args['delimiter']).select(
[args['column']]
).with_columns(
#convert the int phne number as string and apply the parse_phone_number function
[pl.col(args['column']).cast(pl.Utf8).apply(parse_phone_number).alias(args['column']),
#add another column list_id with value 100
pl.lit(args['list_id']).alias("list_id")
]
).filter(
#filter nulls
pl.col(args['column']).is_not_null()
).unique(keep="last").collect().write_csv(args['saved_path'], sep=",")
I tested a file with 800k rows and 23 columns (150mb) and it takes around 20 seconds and more than 500mb ram then it completes the task.
Is this normal? Can I optimize the performance (the memory usage at least)?
I'm really new with Polars and I work with PHP and I'm very noob at python too, so sorry if my code looks bit dumb haha.