1

I am trying to generate a list of short descriptions of RFC's by parsing the ietf RFC index. I am hoping for some command to the effect of curl https://www.ietf.org/download/rfc-index.txt | sed 'magic' | awk 'more magic' | cut -f ?

The un-parsed output of the command curl https://www.ietf.org/download/rfc-index.txt looks like:

6708 Application-Layer Traffic Optimization (ALTO) Requirements. S.
      Kiesel, Ed., S. Previdi, M. Stiemerling, R. Woundy, Y. Yang.
      September 2012. (Format: TXT, HTML) (Status: INFORMATIONAL) (DOI:
      10.17487/RFC6708) 

6709 Design Considerations for Protocol Extensions. B. Carpenter, B.
     Aboba, Ed., S. Cheshire. September 2012. (Format: TXT, HTML)
     (Status: INFORMATIONAL) (DOI: 10.17487/RFC6709) 

6710 Simple Mail Transfer Protocol Extension for Message Transfer
     Priorities. A. Melnikov, K. Carlberg. August 2012. (Format: TXT,
     HTML) (Status: PROPOSED STANDARD) (DOI: 10.17487/RFC6710) 

6711 An IANA Registry for Level of Assurance (LoA) Profiles. L.
     Johansson. August 2012. (Format: TXT, HTML) (Status: INFORMATIONAL)
     (DOI: 10.17487/RFC6711) 

I am hoping to get output that chops off the extra notes at the Month Year:

6708 Application-Layer Traffic Optimization (ALTO) Requirements. S.
      Kiesel, Ed., S. Previdi, M. Stiemerling, R. Woundy, Y. Yang.

6709 Design Considerations for Protocol Extensions. B. Carpenter, B.
     Aboba, Ed., S. Cheshire.

6710 Simple Mail Transfer Protocol Extension for Message Transfer
     Priorities. A. Melnikov, K. Carlberg. 

6711 An IANA Registry for Level of Assurance (LoA) Profiles. L.
     Johansson.
Enlico
  • 23,259
  • 6
  • 48
  • 102
Lenna
  • 1,220
  • 6
  • 22
  • 3
    Welcome to Stack Overflow. SO is a question and answer page for professional and enthusiastic programmers. Add your own code to your question. You are expected to show at least the amount of research you have put into solving this question yourself. – Cyrus Apr 19 '20 at 20:16
  • I recommend you use structured data instead of text. There is an [XML file](https://www.rfc-editor.org/rfc-index.xml) containing all the RFC metadata, and you could use something like [XMLStarlet](http://xmlstar.sourceforge.net/) to extract the data you want. – Benjamin W. Apr 19 '20 at 21:00

2 Answers2

1

If the structure of all the entries is as consistent as you show, you don't even need to explicitly match year or month, but you can rely on how all the parts that you want to remove are delimited.

The following command works on your input:

sed -zE 's/[^.]+\.[ \n]+\([^)]+\)[ \n]+\([^)]+\)[ \n]+\([^)]+\)//g' yourfile

Essentially it matches the last (and only) three parentesized texts (\([^)]+\)), together with the last point terminated string ([^.]+\.) that precedes them. It allows these three constituents to be separated by spaces and/or newlines ([ \n]+).

Besides, with the -z option, sed treats the input file as a single line. -E is to use + instead of \+ to mean 1 or more (at the price of having to write \( and \) to match the literal parenthesis).

Enlico
  • 23,259
  • 6
  • 48
  • 102
1

This uses sed command:

sed -r 's/^(.*)(January|February|March|April|May|June|July|August|September|October|November|December) [[:digit:]]{4}(.*)$/\1/'

Just pipe curl to it.

Some details:

  • -r: use "Extended regular expressions"
  • Capture output data before "$month $year" in first group (signified with parentheses)
  • Capture "$month $year" in second group.
  • Capture the rest in third group.
  • Output only first group (\1)

Here's part about sed from classic series on command line tools by Bruce Barnett .

dav23r
  • 86
  • 2
  • 7
  • That'll only modify the lines containing the date, but not the line after it. – Benjamin W. Apr 19 '20 at 21:00
  • This was the ```sed``` trick I was looking for. Being able to switch out multiple words (with regex). This example code does not produce the exact output I was looking for, but when combined with more ```sed```'ing will. – Lenna Apr 20 '20 at 15:04