3

I am looking to create a very simple, very basic nested table of contents in php which gets all the h1-6 and indents things appropriately. This means that if I have something like:

<h1>content</h1>
<h2>more content</h2>

I should get:

content
    more content.

I know it will be css that creates the indents, that's fine, but how do I create a table of contents with working links to the content on the page?

apparently its hard to grasp what I am asking for...

I am asking for a function that reads an html document and pulls out all the h1-6 and makes a table of contents.

psubsee2003
  • 8,563
  • 8
  • 61
  • 79
TheWebs
  • 12,470
  • 30
  • 107
  • 211

4 Answers4

6

I used this package, it's pretty easy and straight forward to use.

https://github.com/caseyamcl/toc

Install via Composer by including the following in your composer.json file:

{
    "require": {
        "caseyamcl/toc": "^3.0",
    }
}

Or, drop the src folder into your application and use a PSR-4 autoloader to include the files.

Usage This package contains two main classes:

TOC\MarkupFixer: Adds id anchor attributes to any H1...H6 tags that do not already have any (you can specify which header tag levels to use at runtime) TOC\TocGenerator: Generates a Table of Contents from HTML markup Basic Example:

$myHtmlContent = <<<END
    <h1>This is a header tag with no anchor id</h1>
    <p>Lorum ipsum doler sit amet</p>
    <h2 id='foo'>This is a header tag with an anchor id</h2>
    <p>Stuff here</p>
    <h3 id='bar'>This is a header tag with an anchor id</h3>
END;

$markupFixer  = new TOC\MarkupFixer();
$tocGenerator = new TOC\TocGenerator();

// This ensures that all header tags have `id` attributes so they can be used as anchor links
$htmlOut  = "<div class='content'>" . $markupFixer->fix($myHtmlContent) . "</div>";

//This generates the Table of Contents in HTML
$htmlOut .= "<div class='toc'>" . $tocGenerator->getHtmlMenu($myHtmlContent) . "</div>";

 echo $htmlOut;

This produces the following output:

<div class='content'>
    <h1 id="this-is-a-header-tag-with-no-anchor-id">This is a header tag with no anchor id</h1>
    <p>Lorum ipsum doler sit amet</p>
    <h2 id="foo">This is a header tag with an anchor id</h2>
    <p>Stuff here</p>
    <h3 id="bar">This is a header tag with an anchor id</h3>
</div>
<div class='toc'>
    <ul>
        <li class="first last">
        <span></span>
            <ul class="menu_level_1">
                <li class="first last">
                    <a href="#foo">This is a header tag with an anchor id</a>
                    <ul class="menu_level_2">
                        <li class="first last">
                            <a href="#bar">This is a header tag with an anchor id</a>
                        </li>
                    </ul>
                </li>
            </ul>
        </li>
    </ul>
</div>
Yuseferi
  • 7,931
  • 11
  • 67
  • 103
2

For this you have just to search for the tags in the HTML code.

I wrote two functions (PHP 5.4.x).

The first one returns an array, that contains the data of the table of contents. The data is is only the headline it self, the id of the tag (if you want to use anchors) and a sub-table of content.

function get_headlines($html, $depth = 1)
{
    if($depth > 7)
        return [];

    $headlines = explode('<h' . $depth, $html);

    unset($headlines[0]);       // contains only text before the first headline

    if(count($headlines) == 0)
        return [];

    $toc = [];      // will contain the (sub-) toc

    foreach($headlines as $headline)
    {
        list($hl_info, $temp) = explode('>', $headline, 2);
        // $hl_info contains attributes of <hi ... > like the id.
        list($hl_text, $sub_content) = explode('</h' . $depth . '>', $temp, 2);
        // $hl contains the headline
        // $sub_content contains maybe other <hi>-tags
        $id = '';
        if(strlen($hl_info) > 0 && ($id_tag_pos = stripos($hl_info,'id')) !== false)
        {
            $id_start_pos = stripos($hl_info, '"', $id_tag_pos);
            $id_end_pos = stripos($hl_info, '"', $id_start_pos);
            $id = substr($hl_info, $id_start_pos, $id_end_pos-$id_start_pos);
        }

        $toc[] = [  'id' => $id,
                    'text' => $hl_text,
                    'sub_toc' => get_headlines($sub_content, $depth + 1)
                ];

    }

    return $toc;
}

The second returns a string that formats the toc with HTML.

function print_toc($toc, $link_to_htmlpage = '', $depth = 1)
{
    if(count($toc) == 0)
        return '';

    $toc_str = '';

    if($depth == 1)
        $toc_str .= '<h1>Table of Content</h1>';

    foreach($toc as $headline)
    {
        $toc_str .= '<p class="headline' . $depth . '">';
        if($headline['id'] != '')
            $toc_str .= '<a href="' . $link_to_htmlpage . '#' . $headline['id'] . '">';

        $toc_str .= $headline['text'];
        $toc_str .= ($headline['id'] != '') ? '</a>' : '';
        $toc_str .= '</p>';

        $toc_str .= print_toc($headline['sub_toc'], $link_to_htmlpage, $depth+1);
    }

    return $toc_str;
}

Both functions are far away from being perfect, but they work fine in my tests. Feel free to improve them.

Notice: get_headlines is not a parser, so it does not work on broken HTML code and just crashes. It also only works with lowercase <hi>-tags.

AbcAeffchen
  • 14,400
  • 15
  • 47
  • 66
  • The notice that this isn't (using) a real parser is important. It may work on various nicely formed HTML, but constructing edge cases that break its assumptions is very easy, so I would not recommend using this function. I wrote a [similar warning here.](https://alanhogan.com/html-myths#regex-html) – Alan H. Apr 27 '22 at 07:03
-1

How about this (although it can only do one H level) ...

function getTOC(string $html, int $level=1) {
    $toc="";
    $x=0;
    $n=0;
    $html1="";

    $safety=1000;
    while ( $x>-1 and $safety-->0 ) {

        $html0=strtolower($html);
        $x=strpos($html0, "<h$level");

        if ( $x>-1 ) {
            $y=strpos($html0, "</h$level>");
            $part=strip_tags(substr($html, $x, $y-$x));
        
            $toc  .="<a href='#head$n'>$part</a>\n";
            $html1.=substr($html,0,$x)."<a name='head$n'></a>".substr($html, $x, $y-$x+5)."\n";
            $html=substr($html, $y+5);
            $n++;
        }

    }
    $html1.=$html;
    $html=$toc."\n<HR>\n".$html1;
    return $html;
}

This will create a basic list of links

$html="<html><body>";
$html.="<h1>Heading 1a</h1>One Two Three";
$html.="<h2>heading 2a</h2>Four Five Six";
$html.="<h1 class='something'>Heading 1b</h1>Seven Eight Nine";
$html.="<h2>heading 2b</h2>Ten Eleven Twelve";
$html.="</body></html>";


echo getTOC($html, 1);

gives...

<a href='#head0'>Heading 1a</a>
<a href='#head1'>Heading 1b</a>

<HR>
<html><body><a name='head0'></a><h1>Heading 1a</h1>
One Two Three<h2>heading 2a</h2>Four Five Six<a name='head1'></a><h1 
class='something'>Heading 1b</h1>
Seven Eight Nine<h2>heading 2b</h2>Ten Eleven Twelve</body></html>

See https://onlinephp.io/c/fceb0 for a running example

user1432181
  • 918
  • 1
  • 9
  • 24
  • Using string pattern matching is [absolutely not](https://alanhogan.com/html-myths#regex-html) a robust way to handle HTML input! Please do not use this code – Alan H. Jan 07 '23 at 03:50
-2

This function return the string with appended table of content only for h2 tags. 100% tested code.

function toc($str){

        $html = preg_replace('/]+\>/i', '$0 

In This Article

', $str, 1); //toc just after first image in content $doc = new DOMDocument(); $doc->loadHTML($html); // create document fragment $frag = $doc->createDocumentFragment(); // create initial list $frag->appendChild($doc->createElement('ul')); $head = &$frag->firstChild; $xpath = new DOMXPath($doc); $last = 1; // get all H1, H2, …, H6 elements $tagChek = array(); foreach ($xpath->query('//*[self::h2]') as $headline) { // get level of current headline sscanf($headline->tagName, 'h%u', $curr); array_push($tagChek,$headline->tagName); // move head reference if necessary if ($curr parentNode->parentNode; } } elseif ($curr > $last && $head->lastChild) { // move downwards and create new lists for ($i=$last; $ilastChild->appendChild($doc->createElement('ul')); $head = &$head->lastChild->lastChild; } } $last = $curr; // add list item $li = $doc->createElement('li'); $head->appendChild($li); $a = $doc->createElement('a', $headline->textContent); $head->lastChild->appendChild($a); // build ID $levels = array(); $tmp = &$head; // walk subtree up to fragment root node of this subtree while (!is_null($tmp) && $tmp != $frag) { $levels[] = $tmp->childNodes->length; $tmp = &$tmp->parentNode->parentNode; } $id = 'sect'.implode('.', array_reverse($levels)); // set destination $a->setAttribute('href', '#'.$id); // add anchor to headline $a = $doc->createElement('a'); $a->setAttribute('name', $id); $a->setAttribute('id', $id); $headline->insertBefore($a, $headline->firstChild); } // echo $frag; // append fragment to document if(!empty($tagChek)): $doc->getElementsByTagName('section')->item(0)->appendChild($frag); return $doc->saveHTML(); else: return $str; endif; }
  • Using an actual HTML parser here is very good! Using a [regex](https://alanhogan.com/html-myths#regex-html) to find a character sequence that may or may not be the end of an image tag is not! – Alan H. Apr 27 '22 at 07:07