1

I am trying to refactor some Python modules which contain complex list comprehensions than can be single or multiline. An example of such a list comprehension is:

some_list = [y(x) for x in some_complex_expression if x != 2]

I tried to use the following regex pattern in PyCharm but this matches simple lists as well:

\[.+\]

Is there a way to not match simple lists and perhaps also match list comprehensions that are multiline? I am okay with solutions other than regex as well.

PythonForEver
  • 387
  • 2
  • 15
  • 3
    Unsure but does a list-comprehension not always contain both keywords ' for ' and ' in ' in this order? If so, maybe start exploiting this? – JvdV Aug 17 '22 at 07:04
  • 3
    Without knowing what sort of content your files may or may not have, this seems too open ended question. E.g. could your file contain `some_annoying_string = "[y(x) for x in items if x != 2]"` which you don't want to detect? – Julien Aug 17 '22 at 07:08
  • 3
    Above all this sounds like an XY problem. What is your *real* end goal? – Julien Aug 17 '22 at 07:11
  • @Julien It doesn't contain such cases. – PythonForEver Aug 17 '22 at 07:16
  • @JvdV thanks! I think i understand how to do it now. – PythonForEver Aug 17 '22 at 07:17
  • @Julien my goal is to find complex list comprehensions and refactor them to something simple. – PythonForEver Aug 17 '22 at 07:22
  • On top of the scenario suggested by @Julien , you should also consider taking into account list comprehensions in an f-string, e.g. `f'{[i for i in range(3)]}'`, which you likely want to include. – blhsing Aug 17 '22 at 07:23
  • You should also take into account list comprehensions that span over multiple lines, since the ending bracket does not have to be on the same line as the opening bracket. – blhsing Aug 17 '22 at 07:25
  • 1
    "find complex list comprehensions and refactor them to something simple" is too vague. What defines "complex" and "simple". At least give a [mre]. E.g. the example of list comprehension you give is far from "complex" by any standard... If you want to refactor *all* list comprehensions, then why using python in the first place? – Julien Aug 17 '22 at 07:31
  • @blhsing matching multiline comprehensions looks too hard. I dont know how to do it. – PythonForEver Aug 17 '22 at 07:40
  • @Julien I intentionally left my example simple, but imagine more complex logic inside it. In other words, I want to make code more readable. Do you think I should change my minimal example to reflect this comment's details? – PythonForEver Aug 17 '22 at 07:42
  • 3
    Regex is simply not the right tool to handle the full-blown syntax of a programming language. You should use a Python parser instead. I've added an answer to demonstrate that. – blhsing Aug 17 '22 at 08:33
  • 7
    This question is being discussed on [meta](https://meta.stackoverflow.com/questions/420050/how-make-my-question-more-clear#420050) – Thom A Aug 25 '22 at 14:44
  • 4
    This is already closed but here's a duplicate target: [How to find list comprehension in python code](https://stackoverflow.com/questions/35149906/how-to-find-list-comprehension-in-python-code) – Abdul Aziz Barkat Aug 25 '22 at 17:13

2 Answers2

4

Regex is not designed to handle a structured syntax. You are almost certain to always be able to find corner cases that your deliberately written regex is unable to handle, as suggested by the comments above.

A proper Python parser should be used instead to identify list comprehensions per the language specifications. Fortunately, Python has included a comprehensive set of modules that help parse and navigate through Python code in various ways.

In your case, you can use the ast module to parse the code into an abstract syntax tree, walk through the AST with ast.walk, identify list comprehensions by the ListComp nodes, and output the lines of those nodes along with their line numbers.

Since list comprehensions can be nested, you'd want to avoid outputting the inner list comprehensions when the outer ones are already printed. This can be done by keeping track of the last line number sent to the output and only printing line numbers greater than the last line number.

For example, with the following code:

import ast

with open('file.py') as file:
    lines = file.readlines()

last_lineno = 0
for node in ast.walk(ast.parse(''.join(lines))):
    if isinstance(node, ast.ListComp):
        for lineno in range(node.lineno, node.end_lineno + 1):
            if lineno > last_lineno:
                print(lineno, lines[lineno - 1], sep='\t', end='')
                last_lineno = lineno
        print()

and the following content of file.py:

a = [(i + 1) * 2 for i in range(3)]
b = '[(i + 1) * 2 for i in range(3)]'
c = [
    i * 2
    for i in range(3)
    if i
]
# d = [(i + 1) * 2 for i in range(3)]
e = [
    [(i + 1) * 2 for i in range(j)]
    for j in range(3)
]

the code would output:

1   a = [(i + 1) * 2 for i in range(3)]

3   c = [
4       i * 2
5       for i in range(3)
6       if i
7   ]

9   e = [
10      [(i + 1) * 2 for i in range(j)]
11      for j in range(3)
12  ]

because b is assigned a string, and the assignment of d is commented out.

Demo: https://replit.com/@blhsing/StimulatingCrimsonProgramminglanguage#main.py

blhsing
  • 91,368
  • 6
  • 71
  • 106
2

To match the above example without matching simple lists you can use:

\[.+ for .+ in .+\]

Thanks, JvdV! (this answer is based on his tips)

PythonForEver
  • 387
  • 2
  • 15