I am seeing a very weird behavior in which I can't step into or put breakpoint in some of the ElementTree classes. I started with the below code:
from xml.etree import ElementTree as ET
print(f"{ET.__file__}")
et = ET.parse("/tmp/pom.xml")
print(et)
I got the below output:
/usr/local/Cellar/python@3.9/3.9.1_3/Frameworks/Python.framework/Versions/3.9/lib/python3.9/xml/etree/ElementTree.py
<xml.etree.ElementTree.ElementTree object at 0x10fef9a30>
So, I opened the ElementTree.py
file and put a breakpoint into class XMLParser.__init__
(here), but the breakpoint didn't get hit.
class XMLParser:
...
def __init__(self, *, target=None, encoding=None):
import pdb; pdb.set_trace()
try:
from xml.parsers import expat
except ImportError:
Then I added a breakpoint into the ElementTree.parse
(here):
def parse(self, source, parser=None):
...
close_source = False
if not hasattr(source, "read"):
source = open(source, "rb")
close_source = True
try:
if parser is None:
# If no parser was specified, create a default XMLParser
import pdb; pdb.set_trace()
parser = XMLParser()
if hasattr(parser, '_parse_whole'):
I did get pdb prompt, but when I tried to step into XMLParser
, it went straight to the next line. I even ensured that it is referring to the same local class (not some native implementation):
(Pdb) import inspect
(Pdb) inspect.getmodule(XMLParser)
<module 'xml.etree.ElementTree' from '/usr/local/Cellar/python@3.9/3.9.1_3/Frameworks/Python.framework/Versions/3.9/lib/python3.9/xml/etree/ElementTree.py'>
The reason I am doing this is to figure out why overridden _start
and _end
methods are not getting invoked for my custom class that extends XMLParser
. I am instantiating the parser something like this (derived from here, with _start_list
changed to _start
):
class LineNumberingParser(ET.XMLParser):
def _start(self, *args, **kwargs):
element = super()._start(*args, **kwargs)
element._start_line_number = self.parser.CurrentLineNumber
print(f"----- {element.tag} {element._start_line_number}")
return element
def _end(self, *args, **kwargs):
element = super()._end(*args, **kwargs)
element._end_line_number = self.parser.CurrentLineNumber
print(f"----- {element.tag} {element._end_line_number}")
return element
parser = LineNumberingParser(target=ET.TreeBuilder(insert_comments=True))
et = ET.parse("/tmp/pom.xml", parser)
I even tried adding a constructor to LineNumberingParser
and stepping into the super
constructor, but I got the same behavior as before, though I can see that the self
instance gets initialized properly (e.g., self.target
is None
before super.__init__
call but initialized after).
What am I missing here?
Update 1: I put some print
statements in very obvious places in XMLParser
(like __init__
and _start
) and got no output, so it seems like it is using a different implementation though inspect.getmodule
says otherwise.
Update 2: I just noticed the below at the end of the module:
# Import the C accelerators
try:
# Element is going to be shadowed by the C implementation. We need to keep
# the Python version of it accessible for some "creative" by external code
# (see tests)
_Element_Py = Element
# Element, SubElement, ParseError, TreeBuilder, XMLParser, _set_factories
from _elementtree import *
from _elementtree import _set_factories
except ImportError:
pass
else:
_set_factories(Comment, ProcessingInstruction)
I guess it indeed was a C native implementation and that is why pdb wasn't stepping in (the answer to my original question). Now I am back to square one to find a solution for line numbers.
Update 3: I found the code used in the test module to skip native module and _start
and _end
do get called with the python implementations, but there are some significant differences in the write code path.