-1

I have a string of HTML with both absolute and relative URLs and I'm trying to retrieve only the relative URLs. I tried using the get-urls package but this only retrieves absolute URLs.

An example of the string of html received.

<!DOCTYPE>
<html>
<head>

<title>Our first HTML page</title>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

</head>
<body>

<h2>Welcome to the web site: this is a heading inside of the heading tags.</h2>

<p>This is a paragraph of text inside the paragraph HTML tags. We can just keep writing ...
</p>

<h3>Now we have an image:</h3>

<div><img src="/images/plantTracing.gif" alt="Graphic of a Mouse Pad"></div>

<h3>
This is another heading inside of another set of headings tags; this time the tag is an 'h3' instead of an 'h2' , that means it is a less important heading.
</h3>

<h4>Yet another heading - right after this we have an HTML list:</h4>

<ol>
<li><a href="https://github.com/">First item in the list</a></li>
<li><a href="/modules/example.md"> Second item in the list</a></li>
<li>Third item in the list</li>
</ol>

<p>You will notice in the above HTML list, the HTML automatically creates the numbers in the list.</p>

<h3>About the list tags</h3>
</body>
</html>

Currently doing this

getUrls(string of HTML received)

It only returns {https://github.com/}

I want to return {https://github.com/, /modules/example.md}

  • 1
    What does "a text" mean in "I have a text". Show us exactly what you have. Do you have a string of HTML? Do you have a URL that you're fetching the HTML from? Do you just have a plain piece of text and you don't know what the format is? What are you starting with? Also, questions about code here should nearly always show us the code you already have. Please, think about what it takes to communicate a clear question here. Clear questions get quick answers here. Unclear questions either never get answers or they get downvotes or get closed. – jfriend00 Dec 16 '19 at 00:52
  • Thanks @jfriend00. I've modified my question – user2998991 Dec 16 '19 at 01:08
  • It doesn't look to me like the get-urls package is an HTML parser. I think you will probably want/need an HTML parser. Keep in mind that `/images/plantTracing.gif` by itself in a piece of text is not necessarily a URL. It could just as well be a path. To know that's a relative URL, one would have to understand the context which requires parsing the HTML. There are many HTML parsers such as [cheerio](https://www.npmjs.com/package/cheerio) that you can use from node.js. – jfriend00 Dec 16 '19 at 01:11
  • 1
    Note, the `get-urls` package contains this note in the documentation: ***Require URLs to have a scheme or leading www. to be considered an URL.*** So, that package will not do what you want. – jfriend00 Dec 16 '19 at 01:14
  • @user2998991 `get-urls` leverages [url-regex](https://www.npmjs.com/package/url-regex) to determine a URL, and will validate against a TLD (absolute paths won't qualify in that scenario), therefore the result you see is correct. As suggested, you use a HTML parser like `cheerio` as already suggested and extract the hrefs manually, it's fairly trivial – James Dec 16 '19 at 01:15

1 Answers1

0

The get-urls package requires the URL to either start with a scheme such as http:// or to start with a known top-level domain.

In fact, the doc even contains this Require URLs to have a scheme or leading www. to be considered an URL.

Since you're looking for relative paths that have neither of those, that package will not do what you want.

You will probably benefit best from an actual HTML parser such as cheerio which find the HTML attribute based URLs based on HTML context, not on just text matching tricks as that will find all the paths that are relative URLs.

jfriend00
  • 683,504
  • 96
  • 985
  • 979