1

This is what I'm trying to do.

  1. Scan the csv using Polars lazy dataframe
  2. Format the phone number using a function
  3. Remove nulls and duplicates
  4. Write the csv in a new file

Here is my code

import sys
import json
import polars as pl
import phonenumbers

#define the variable and parse the encoded json
args = json.loads(sys.argv[1])

#format phone number as E164
def parse_phone_number(phone_number):
    try:
        return phonenumbers.format_number(phonenumbers.parse(phone_number, "US"), phonenumbers.PhoneNumberFormat.E164)
    except phonenumbers.NumberParseException:
        pass
    return None

#scan the csv file do some filter and modify the data and then write the output to a new csv file
pl.scan_csv(args['path'], sep=args['delimiter']).select(
    [args['column']]
).with_columns(
    #convert the int phne number as string and apply the parse_phone_number function
    [pl.col(args['column']).cast(pl.Utf8).apply(parse_phone_number).alias(args['column']),
    #add another column list_id with value 100
    pl.lit(args['list_id']).alias("list_id")
    ]
).filter(
    #filter nulls
    pl.col(args['column']).is_not_null()
).unique(keep="last").collect().write_csv(args['saved_path'], sep=",")

I tested a file with 800k rows and 23 columns (150mb) and it takes around 20 seconds and more than 500mb ram then it completes the task.

Is this normal? Can I optimize the performance (the memory usage at least)?

I'm really new with Polars and I work with PHP and I'm very noob at python too, so sorry if my code looks bit dumb haha.

Sarwar Alam
  • 59
  • 2
  • 9
  • 1
    What does this have to do with dask? If nothing, please remove this tag. – Michael Delgado Feb 05 '23 at 06:26
  • 1
    You shouldn't expect calling `apply` to run any faster with polars then pandas or anything else. See the doc: `Implementing logic using a Python function is almost always _significantly_ slower and more memory intensive than implementing the same logic using the native expression API` https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.apply.html#polars.Expr.apply – 0x26res Feb 05 '23 at 10:08
  • @0x26res Can you give me a example how to implement the expression in my code? I just want to replace the phone number with formatted phone number – Sarwar Alam Feb 05 '23 at 12:04

2 Answers2

4

You are using an apply, which means you are effectively writing a python for loop. This often is 10-100x slower than using expressions.

Try to avoid apply. And if you do use apply, don't expect it to be fast.

P.S. you can reduce memory usage by not casting the whole column to Utf8, but instead cast inside your apply function. Though I don't think using 500MB is that high. Ideally polars uses as much RAM as available without going OOM. Unused RAM might be wasted potential.

ritchie46
  • 10,405
  • 1
  • 24
  • 43
0

Presumably you have something that looks like this...

df=pl.DataFrame({'column':[9345551234, 9945554321, 8005559876]})

and you want to end up with something that looks like

shape: (3, 1)
┌────────────────┐
│ phnum          │
│ ---            │
│ str            │
╞════════════════╡
│ (934) 555-1234 │
│ (994) 555-4321 │
│ (800) 555-9876 │
└────────────────┘

You can get this using the str.slice method

df.select(pl.col('column').cast(pl.Utf8())) \
    .select((pl.lit("(") + pl.col('column').str.slice(0,3) + 
            pl.lit(") ") + pl.col('column').str.slice(3,3) + 
            pl.lit("-")+pl.col('column').str.slice(6,4)).alias('phnum'))
Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
  • The phone number may be of another country, and may also be like +123456789 or +1 (934) 555-1234 or +1 934 555-1234 I want to format it automatically – Sarwar Alam Feb 06 '23 at 19:37
  • Well polars supports regex so you can expand the principle to whatever you need. You can go look at the package you're using's source and find the regex they're using and put it into polars expressions. – Dean MacGregor Feb 06 '23 at 20:24