3

everybody - long time listener, first time caller.

I've been playing around with libtidy in C on macOS 10.13. I started with the sample code here and modified it to read a local html file instead of using curl. Everything seems to work okay except for text. It will find and output every tag in my test file, but does not seem able to get the text at all, and it's driving me nuts.

The code in question occurs in the DumpNode tree-walking function. My hacked-up version:

#include <stdio.h>
#include <tidy.h>
#include <tidybuffio.h>

/* Wrapper functions for file i/o */
int w_getc(void* ptr)
{
  return getc((FILE *)ptr);
}
void w_ungetc(void *ptr, unsigned char bv)
{
  ungetc((int)bv, (FILE *)ptr);
}
Bool w_feof(void *ptr)
{
  return (Bool)feof((FILE *)ptr);
}

/* Traverse the document tree */
void dumpNode(TidyDoc doc, TidyNode tnod, int indent)
{
  TidyNode child;
  for(child = tidyGetChild(tnod); child; child = tidyGetNext(child) ) {
    ctmbstr name = tidyNodeGetName(child);
    if (!name) {
      /* if it doesn't have a name, then it's probably text, cdata, etc... */
      TidyBuffer buf;
      tidyBufInit(&buf);
      if (tidyNodeHasText(doc, child) && tidyNodeGetText(doc, child, &buf)) {
        printf("%u, %u, %u\n", buf.size, buf.allocated, buf.next);
        printf("%*.*s\n", indent, indent, (buf.bp && buf.size > 0)?(char *)buf.bp:"");
      }
      tidyBufFree(&buf);
    }
    dumpNode(doc, child, indent + 4); /* recursive */
  }
}

int main(int argc, char **argv)
{
  if(argc == 2) {
    TidyDoc tdoc;
    int err;
    FILE *fp;
    TidyInputSource insrc;

    tdoc = tidyCreate();

    fp = fopen(argv[1], "r");
    if (!fp) return -1;
    
    if (tidyInitSource(&insrc, fp, &w_getc, &w_ungetc, &w_feof)) {
      err = tidyParseSource(tdoc, &insrc); /* parse the input */
      if(err >= 0) dumpNode(tdoc, tidyGetRoot(tdoc), 0); /* walk the tree */      
    }

    /* clean-up */
    fclose(fp);
    tidyRelease(tdoc);
    return err;

  }
  return 0;
}

And my compiler string: gcc -o TidyExample tidyexample.c -ltidy -DENABLE_DEBUG_LOG -DDEBUG_PPRINT -DDEBUG_INDENT

Here's what I've deduced so far:

  • When encountering a text node, tidyNodeHasText(doc, child); and tidyNodeGetText(doc, child, &buf); both return yes.
  • That tells me that tidyNodeGetText is calling the pretty print functions like it should. I've verified that neither TidyXmlOut or TidyXhtmlOut are set for doc, so TY_(PPrintTree) should fire.
  • Since I use -DENABLE_DEBUG_LOG and -DDEBUG_PPRINT, TY_(PPrintTree) should call dbg_show_node, but it doesn't appear to. This could be because the node == NULL condition prompted an immediate return, but I asserted the existence of the node, and tidyNodeGetText would return no if it didn't exist, so that can't be what's happening.
  • I also tried to set a progress callback to get an better view of what was going on in there, but weirdly, the linker didn't recognize the symbol _tidySetPrettyPrinterCallback.
    • ETA: I figured this out; there's a second linked library needed for this to work: -ltidys. I can now get pretty printer progress.
  • The only output the above snippet generates is 0, 0, 0\n from the first printf statement, and \n from the second.
  • Incidentally, that curl sample code I ripped off? It has the same problem. If you run it, you get a segfault as soon as it hits any text, because it doesn't check whether there's anything in the buffer before it calls printf.

I am out of ideas. Either I'm doing something majorly wrong (likely), or there's a good-sized bug in libtidy (less likely, but possible).

ETA: Here's a teeny-tiny HTML file, that when you invoke TidyExample minimal.html results in the usual empty buffers:

<!DOCTYPE html>
<html>
<head>
<title>Test</title>
<p>This is text.</p>
</body>
</html>
3R1N
  • 89
  • 3
  • MCVE: Done. Libtidy dev forums: still looking. – 3R1N Nov 23 '20 at 04:26
  • Is this really/also a web-scraping problem? If so, add the `[web-scraping]` tag. It looks like your program takes 2 arguments. Can you add a sample command line invocation AND include a minimal `doc` that will demonstrate the problem? Good luck. – shellter Nov 26 '20 at 16:53
  • Tested on linux with libtidy 5.7.16, both the original example and your code both work. Perhaps a bug in the OS X version? Could you also attach a testfile with which it fails? – vmt Nov 27 '20 at 00:51
  • Okay, here's a teeny-tiny html file with two text nodes, which when invoked as `TidyExample minimal.html` results in the usual empty buffers: ` Test

    This is text.

    `
    – 3R1N Nov 28 '20 at 08:22
  • 1
    For completeness' sake (not of much help, as I don't have access to OS X), here's the output with said test file (and program): http://ix.io/2FTc – vmt Nov 29 '20 at 08:14

2 Answers2

0

Okay, I've found a "solution" of sorts. It gets the job done, but I have no idea why.

So, after discovering -ltidys, I was playing around with setting pretty print callbacks, and I discovered that if I set one, the output would be what I expected... even if I didn't actually set a callback!

Seriously, all I have to do is insert the line tidySetPrettyPrinterCallback(tdoc, NULL);, and the buffers fill up and print like they should. Comment it out, and it stops working.

I've investigated a few of the other functions that link with libtidys.a, and they seem to have the same effect. I haven't done any rigorous experimentation, though.

If anyone has any insight into what might be causing this, I'd be interested to know, for my own knowledge's sake. But I'm not going to probe any further. Since I've found a practical workaround for my problem, I'm going to let this be for now, and work on the actual project I was trying to use Tidy for.

3R1N
  • 89
  • 3
0

On macOS, with -ltidy you are linking with a version of Tidy that is dynamic. Compile with ... -o TidyExample tidyexample.c ./libtidy.a to ensure that your more recent and static library is used.

andeha
  • 1
  • 1