1

Hi have a dataframe where I have a column called tags which is a json string.

I want to filter this dataframe on the tags column so it only contains rows where a certain tag key is present or where a tag has a particular value.

I guess I could do a string contains match but think it may be more robust to have the json convert into a dict first and using has_key etc ?

What would be the recommended way to do this in python polars ?

Thanks

Glenn Pierce
  • 720
  • 1
  • 6
  • 18

1 Answers1

5

Polars does not have a generic dictionary type. Instead, dictionaries are imported/mapped as structs. Each dictionary key is mapped to a struct 'field name', and the corresponding dictionary value becomes the value of this field.

However, there are some constraints for creating a Series of type struct. Two of them are:

  • all structs must have the same field names.
  • the field names must be listed in the same order.

In your description, you mention has_key, which indicates that the dictionaries will not have the same keys. As such, creating a column of struct from your dictionaries will not work. (For more information, you can see this Stack Overflow response.)

json_path_match

I suggest using json_path_match, which extracts values based on some simple JSONPath syntax. Using JSONPath syntax, you should be able to query whether a key exists, and retrieve it's value. (For simple unnested dictionaries, these are the same query.)

For example, let's start with this data:

import polars as pl

json_list = [
    """{"name": "Maria",
        "position": "developer",
        "office": "Seattle"}""",
    """{"name": "Josh",
        "position": "analyst",
        "termination_date": "2020-01-01"}""",
    """{"name": "Jorge",
        "position": "architect",
        "office": "",
        "manager_st_dt": "2020-01-01"}""",
]

df = pl.DataFrame(
    {
        "tags": json_list,
    }
).with_row_count("id", 1)
df
shape: (3, 2)
┌─────┬────────────────────┐
│ id  ┆ tags               │
│ --- ┆ ---                │
│ u32 ┆ str                │
╞═════╪════════════════════╡
│ 1   ┆ {"name": "Maria",  │
│     ┆         "posit...  │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ {"name": "Josh",   │
│     ┆         "positi... │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆ {"name": "Jorge",  │
│     ┆         "posit...  │
└─────┴────────────────────┘

To query for values:

df.with_columns([
    pl.col('tags').str.json_path_match(r"$.name").alias('name'),
    pl.col('tags').str.json_path_match(r"$.office").alias('location'),
    pl.col('tags').str.json_path_match(r"$.manager_st_dt").alias('manager start date'),
])
shape: (3, 5)
┌─────┬────────────────────┬───────┬──────────┬────────────────────┐
│ id  ┆ tags               ┆ name  ┆ location ┆ manager start date │
│ --- ┆ ---                ┆ ---   ┆ ---      ┆ ---                │
│ u32 ┆ str                ┆ str   ┆ str      ┆ str                │
╞═════╪════════════════════╪═══════╪══════════╪════════════════════╡
│ 1   ┆ {"name": "Maria",  ┆ Maria ┆ Seattle  ┆ null               │
│     ┆         "posit...  ┆       ┆          ┆                    │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ {"name": "Josh",   ┆ Josh  ┆ null     ┆ null               │
│     ┆         "positi... ┆       ┆          ┆                    │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆ {"name": "Jorge",  ┆ Jorge ┆          ┆ 2020-01-01         │
│     ┆         "posit...  ┆       ┆          ┆                    │
└─────┴────────────────────┴───────┴──────────┴────────────────────┘

Notice the null values. This is the return value when a key is not found. We'll use this fact for the has_key functionality you mentioned.

Also, if we look at the "location" column, you'll see that json_path_match does distinguish between an empty string "office":"" and a key not found..

To filter for the presence of a key, we simply filter for null values.

df.filter(
    pl.col('tags').str.json_path_match(r"$.manager_st_dt").is_not_null()
)
shape: (1, 2)
┌─────┬───────────────────┐
│ id  ┆ tags              │
│ --- ┆ ---               │
│ u32 ┆ str               │
╞═════╪═══════════════════╡
│ 3   ┆ {"name": "Jorge", │
│     ┆         "posit... │
└─────┴───────────────────┘

The json_path_match will also work with nested structures. (See the Syntax page for details.)

One limitation, however: json_path_match will only return the first match for a query, rather than a list of matches. If your JSON strings are not lists or nested dictionaries, this won't be a problem.

  • I should mention: if your dictionaries have mostly all the same keys, then you might want to try this workaround using Pandas: https://stackoverflow.com/a/72385859/18559875. But if the keys significantly different between dictionaries, this may create a very large struct. –  Jul 26 '22 at 20:16