0

I am interested in parsing typical output of a website crawler using lark. Here is an example of some sample output based on my own github website:

--------------------------------------------------------------------
All found URLs:
https://awa5114.github.io/2021/01/12/#step-2-select-location-and-environment-for-new-project
https://awa5114.github.io/2021/01/12/mypy-pycharm.html
https://awa5114.github.io/2021/01/12/#step-6-test-the-template-on-a-script
--------------------------------------------------------------------
All local URLs:
https://awa5114.github.io/2021/01/12/#step-2-select-location-and-environment-for-new-project
https://awa5114.github.io/2021/01/12/mypy-pycharm.html
--------------------------------------------------------------------
All foreign URLs:
https://github.com/awa5114
https://github.com/jekyll/jekyll
https://github.com/jekyll/minima
--------------------------------------------------------------------
All broken URLs:

I am using the following grammar:

start: section~4
section: (bar  "All " descriptor " URLs:"  link_list)
link_list: (url)*
descriptor: "found" | "local" | "foreign" | "broken"
url: /.+/
bar: /-{68}/

%import common.NEWLINE
%ignore NEWLINE

Calling pretty on the resulting tree results in the following:

start
  section
    bar --------------------------------------------------------------------
    descriptor
    link_list
      url   https://awa5114.github.io/2021/01/12/#step-2-select-location-and-environment-for-new-project
      url   https://awa5114.github.io/2021/01/12/mypy-pycharm.html
      url   https://awa5114.github.io/2021/01/12/#step-6-test-the-template-on-a-script
  section
    bar --------------------------------------------------------------------
    descriptor
    link_list
      url   https://awa5114.github.io/2021/01/12/#step-2-select-location-and-environment-for-new-project
      url   https://awa5114.github.io/2021/01/12/mypy-pycharm.html
  section
    bar --------------------------------------------------------------------
    descriptor
    link_list
      url   https://github.com/awa5114
      url   https://github.com/jekyll/jekyll
      url   https://github.com/jekyll/minima
  section
    bar --------------------------------------------------------------------
    descriptor
    link_list

This looks allright, but I would like to not include the terminal bar in my tree. How can I achieve this? I had a look through the docs and tried preceding bar with an underscore and or question mark, but for some reason that does not help...

user32882
  • 5,094
  • 5
  • 43
  • 82

1 Answers1

0

I actually found it just now. The way to do it is not only preceding bar with an underscore but also making it uppercase as follows:

start: section~4
section: (_BAR  "All " descriptor " URLs:"  link_list)
link_list: (url)*
descriptor: "found" | "local" | "foreign" | "broken"
url: /.+/
_BAR: /-{68}/

%import common.NEWLINE
%ignore NEWLINE

Which results in the following tree:

start
  section
    descriptor
    link_list
      url   https://awa5114.github.io/2021/01/12/#step-2-select-location-and-environment-for-new-project
      url   https://awa5114.github.io/2021/01/12/mypy-pycharm.html
      url   https://awa5114.github.io/2021/01/12/#step-6-test-the-template-on-a-script
  section
    descriptor
    link_list
      url   https://awa5114.github.io/2021/01/12/#step-2-select-location-and-environment-for-new-project
      url   https://awa5114.github.io/2021/01/12/mypy-pycharm.html
  section
    descriptor
    link_list
      url   https://github.com/awa5114
      url   https://github.com/jekyll/jekyll
      url   https://github.com/jekyll/minima
  section
    descriptor
    link_list

It would be nice if this were made clear in the lark-parser docs...

user32882
  • 5,094
  • 5
  • 43
  • 82