Just an update, but I was at last able to come with a (fully working) non-regex approach to this problem. The reason it took me this long, is because it actually required some intense thought and deliberation on my part. In fact this was not done easily; it took me two days of intermittent work to actually piece all of it together, and also for me to be able to fully wrap my head around what I was trying to accomplish.
The regex solution put forth by @Wiktor, is the accepted answer for now, and it works really well in general. I actually (going back later) found that there were only a few edge cases that it wasn't able to handle, which I go over here. However, there were a few reasons I had to wonder if a non-regex solution would perhaps be the better choice:
My actual use case is that I'm building a library (package), and so I want to reduce on dependencies if possible. The huge bummer is that the regex
module is an external dependency, which is not negligible in size either; in my case, I would probably need to add this dependency as an extra feature to my library.
The regex matching seems to not be as fast or efficient as I had hoped. Don't get me wrong, it's still incredibly fast for matching the complex use cases mentioned in the post (about 1-3ms on average), but given a lot of annotations for a class, I could understand that this would quickly add up. Therefore, I had this suspicion that a a non-regex approach would almost certainly be faster, and was curious to test that out.
Therefore, I am posting the non-regex implementation that I was able to cobble together below. This solves my original problem of converting Union type annotations such as X|Y
into annotations like Union[X, Y]
, and also goes above and beyond to also support more complex use cases that I found that the regex implementation actually does not account for. I still prefer the regex version as I believe it is vastly simpler to this, and for the majority of cases I believe that it will end up working perfectly and without issue.
However, note this is the first and only non-regex implementation I have been able to put together for this specific problem. And without further ado, here goes:
from typing import Iterable, Dict, List
# Constants
OPEN_BRACKET = '['
CLOSE_BRACKET = ']'
COMMA = ','
OR = '|'
def repl_or_with_union(s: str):
"""
Replace all occurrences of PEP 604- style annotations (i.e. like `X | Y`)
with the Union type from the `typing` module, i.e. like `Union[X, Y]`.
This is a recursive function that splits a complex annotation in order to
traverse and parse it, i.e. one that is declared as follows:
dict[str | Optional[int], list[list[str] | tuple[int | bool] | None]]
"""
return _repl_or_with_union_inner(s.replace(' ', ''))
def _repl_or_with_union_inner(s: str):
# If there is no '|' character in the annotation part, we just return it.
if OR not in s:
return s
# Checking for brackets like `List[int | str]`.
if OPEN_BRACKET in s:
# Get any indices of COMMA or OR outside a braced expression.
indices = _outer_comma_and_pipe_indices(s)
outer_commas = indices[COMMA]
outer_pipes = indices[OR]
# We need to check if there are any commas *outside* a bracketed
# expression. For example, the following cases are what we're looking
# for here:
# value[test], dict[str | int, tuple[bool, str]]
# dict[str | int, str], value[test]
# But we want to ignore cases like these, where all commas are nested
# within a bracketed expression:
# dict[str | int, Union[int, str]]
if outer_commas:
return COMMA.join(
[_repl_or_with_union_inner(i)
for i in _sub_strings(s, outer_commas)])
# We need to check if there are any pipes *outside* a bracketed
# expression. For example:
# value | dict[str | int, list[int | str]]
# dict[str, tuple[int | str]] | value
# But we want to ignore cases like these, where all pipes are
# nested within the a bracketed expression:
# dict[str | int, list[int | str]]
if outer_pipes:
or_parts = [_repl_or_with_union_inner(i)
for i in _sub_strings(s, outer_pipes)]
return f'Union{OPEN_BRACKET}{COMMA.join(or_parts)}{CLOSE_BRACKET}'
# At this point, we know that the annotation does not have an outer
# COMMA or PIPE expression. We also know that the following syntax
# is invalid: `SomeType[str][bool]`. Therefore, knowing this, we can
# assume there is only one outer start and end brace. For example,
# like `SomeType[str | int, list[dict[str, int | bool]]]`.
first_start_bracket = s.index(OPEN_BRACKET)
last_end_bracket = s.rindex(CLOSE_BRACKET)
# Replace the value enclosed in the outermost brackets
bracketed_val = _repl_or_with_union_inner(
s[first_start_bracket + 1:last_end_bracket])
start_val = s[:first_start_bracket]
end_val = s[last_end_bracket + 1:]
return f'{start_val}{OPEN_BRACKET}{bracketed_val}{CLOSE_BRACKET}{end_val}'
elif COMMA in s:
# We are dealing with a string like `int | str, float | None`
return COMMA.join([_repl_or_with_union_inner(i)
for i in s.split(COMMA)])
# We are dealing with a string like `int | str`
return f'Union{OPEN_BRACKET}{s.replace(OR, COMMA)}{CLOSE_BRACKET}'
def _sub_strings(s: str, split_indices: Iterable[int]):
"""Split a string on the specified indices, and return the split parts."""
prev = -1
for idx in split_indices:
yield s[prev+1:idx]
prev = idx
yield s[prev+1:]
def _outer_comma_and_pipe_indices(s: str) -> Dict[str, List[int]]:
"""Return any indices of ',' and '|' that are outside of braces."""
indices = {OR: [], COMMA: []}
brace_dict = {OPEN_BRACKET: 1, CLOSE_BRACKET: -1}
brace_count = 0
for i, char in enumerate(s):
if char in brace_dict:
brace_count += brace_dict[char]
elif not brace_count and char in indices:
indices[char].append(i)
return indices
I've tested it against the common use cases listed in the question above, as well as more complex use cases that even the regex implementation seemed to wrestle with.
For example, given these sample test cases:
test_cases = """
str|int|bool
Optional[int|tuple[str|int]]
dict[str | int, list[B | C | Optional[D]]]
dict[str | Optional[int], list[list[str] | tuple[int | bool] | None]]
tuple[str|OtherType[a,b|c,d], ...] | SomeType[str | int, list[dict[str, int | bool]]] | dict[str | int, str]
"""
for line in test_cases.strip().split('\n'):
print(repl_or_with_union(line).replace(',', ', '))
Then the result is as below (note that I've replaced ,
with ,
so it's a bit easier to read)
Union[str, int, bool]
Optional[Union[int, tuple[Union[str, int]]]]
dict[Union[str, int], list[Union[B, C, Optional[D]]]]
dict[Union[str, Optional[int]], list[Union[list[str], tuple[Union[int, bool]], None]]]
Union[tuple[Union[str, OtherType[a, Union[b, c], d]], ...], SomeType[Union[str, int], list[dict[str, Union[int, bool]]]], dict[Union[str, int], str]]
Now the only ones that the regex implementation wasn't able to correctly parse were the last two cases, which are arguably pretty complex to begin with. Here are the regex solutions for the last two - which unfortunately aren't how we'd want them (again, I've ensured there's a space after each comma so it's a bit easier to read)
dict[Union[str, Optional][int], list[Union[list[str], tuple[Union[int, bool]], None]]]
tuple[Union[str, OtherType][a, Union[b, c], d], ...] | SomeType[Union[str, int], list[dict[str, Union[int, bool]]]] | dict[Union[str, int], str]
Maybe it's worth going over why those cases weren't handled as expected with the regex version? My suspicion, and actually confirmed after testing, is that any value in a |
expression that contains a brackets []
appears to not parse correctly. For example, str | Optional[int]
parses as Union[str,Optional][int]
currently, but ideally that would be handled like Union[str,Optional[int]]
.
I've boiled down the two test cases above to abbreviated forms below, for which I was able to confirm that the regex didn't handle as expected:
str | Optional[int]
tuple[str|OtherType[a,b|c,d], ...] | SomeType[str]
When parsing via the regex implementation, these are the current results. Note that in one of the results, the |
character also appears, however ideally we would strip that out as Python versions earlier than 3.10 wouldn't be able to evaluate a pipe |
expression against builtin types.
Union[str,Optional][int]
tuple[Union[str,OtherType][a,Union[b,c],d], ...] | SomeType[str]
The desired end result (that the non-regex approach seems to resolve as expected, after I fixed it to handle such cases when testing) is as follows:
Union[str, Optional[int]]
Union[tuple[Union[str,OtherType[a,Union[b,c],d]], ...], SomeType[str]]
Lastly, I've also been able to time it against the regex approach above. I was myself curious how this solution would fare against the regex version, which is arguably much simpler and easier to understand.
The code I tested with is given below:
def regex_repl_or_with_union(text):
rx = r"(\w+\[)(\w+(\[(?:[^][|]++|(?3))*])?(?:\s*\|\s*\w+(\[(?:[^][|]++|(?4))*])?)+)]"
n = 1
res = text
while n != 0:
res, n = regex.subn(rx, lambda x: "{}Union[{}]]".format(x.group(1), regex.sub(r'\s*\|\s*', ',', x.group(2))),
res)
return regex.sub(r'\w+(?:\s*\|\s*\w+)+', lambda z: "Union[{}]".format(regex.sub(r'\s*\|\s*', ',', z.group())), res)
test_cases = """
str|int|bool
Optional[int|tuple[str|int]]
dict[str | int, list[B | C | Optional[D]]]
"""
def non_regex_solution():
for line in test_cases.strip().split('\n'):
_ = repl_or_with_union(line)
def regex_solution():
for line in test_cases.strip().split('\n'):
_ = regex_repl_or_with_union(line)
n = 100_000
print('Non-regex: ', timeit('non_regex_solution()', globals=globals(), number=n))
print('Regex: ', timeit('regex_solution()', globals=globals(), number=n))
The results - run on an Alienware PC, AMD Ryzen 7 3700X 8-core processor /w 16GB memory:
Non-regex: 2.0510589000186883
Regex: 31.39290289999917
So, the non-regex implementation I came up with actually turned out to be on average about 15x faster than the regex implementation, which was hard to believe. The best news to me is that it doesn't involve additional dependencies. I will likely move forward and utilize the the non-regex solution for now, and note that this is mainly as I would like to reduce on project dependencies if possible. Great thanks again to @Wiktor and all those who helped out with this problem, and helped steer me towards a solution!