everybody - long time listener, first time caller.
I've been playing around with libtidy in C on macOS 10.13. I started with the sample code here and modified it to read a local html file instead of using curl. Everything seems to work okay except for text. It will find and output every tag in my test file, but does not seem able to get the text at all, and it's driving me nuts.
The code in question occurs in the DumpNode
tree-walking function. My hacked-up version:
#include <stdio.h>
#include <tidy.h>
#include <tidybuffio.h>
/* Wrapper functions for file i/o */
int w_getc(void* ptr)
{
return getc((FILE *)ptr);
}
void w_ungetc(void *ptr, unsigned char bv)
{
ungetc((int)bv, (FILE *)ptr);
}
Bool w_feof(void *ptr)
{
return (Bool)feof((FILE *)ptr);
}
/* Traverse the document tree */
void dumpNode(TidyDoc doc, TidyNode tnod, int indent)
{
TidyNode child;
for(child = tidyGetChild(tnod); child; child = tidyGetNext(child) ) {
ctmbstr name = tidyNodeGetName(child);
if (!name) {
/* if it doesn't have a name, then it's probably text, cdata, etc... */
TidyBuffer buf;
tidyBufInit(&buf);
if (tidyNodeHasText(doc, child) && tidyNodeGetText(doc, child, &buf)) {
printf("%u, %u, %u\n", buf.size, buf.allocated, buf.next);
printf("%*.*s\n", indent, indent, (buf.bp && buf.size > 0)?(char *)buf.bp:"");
}
tidyBufFree(&buf);
}
dumpNode(doc, child, indent + 4); /* recursive */
}
}
int main(int argc, char **argv)
{
if(argc == 2) {
TidyDoc tdoc;
int err;
FILE *fp;
TidyInputSource insrc;
tdoc = tidyCreate();
fp = fopen(argv[1], "r");
if (!fp) return -1;
if (tidyInitSource(&insrc, fp, &w_getc, &w_ungetc, &w_feof)) {
err = tidyParseSource(tdoc, &insrc); /* parse the input */
if(err >= 0) dumpNode(tdoc, tidyGetRoot(tdoc), 0); /* walk the tree */
}
/* clean-up */
fclose(fp);
tidyRelease(tdoc);
return err;
}
return 0;
}
And my compiler string: gcc -o TidyExample tidyexample.c -ltidy -DENABLE_DEBUG_LOG -DDEBUG_PPRINT -DDEBUG_INDENT
Here's what I've deduced so far:
- When encountering a text node,
tidyNodeHasText(doc, child);
andtidyNodeGetText(doc, child, &buf);
both returnyes
. - That tells me that
tidyNodeGetText
is calling the pretty print functions like it should. I've verified that neitherTidyXmlOut
orTidyXhtmlOut
are set fordoc
, soTY_(PPrintTree)
should fire. - Since I use
-DENABLE_DEBUG_LOG
and-DDEBUG_PPRINT
,TY_(PPrintTree)
should calldbg_show_node
, but it doesn't appear to. This could be because thenode == NULL
condition prompted an immediate return, but I asserted the existence of the node, andtidyNodeGetText
would returnno
if it didn't exist, so that can't be what's happening. - I also tried to set a progress callback to get an better view of what was going on in there, but weirdly, the linker didn't recognize the symbol
_tidySetPrettyPrinterCallback
.- ETA: I figured this out; there's a second linked library needed for this to work:
-ltidys
. I can now get pretty printer progress.
- ETA: I figured this out; there's a second linked library needed for this to work:
- The only output the above snippet generates is
0, 0, 0\n
from the firstprintf
statement, and\n
from the second. - Incidentally, that curl sample code I ripped off? It has the same problem. If you run it, you get a segfault as soon as it hits any text, because it doesn't check whether there's anything in the buffer before it calls
printf
.
I am out of ideas. Either I'm doing something majorly wrong (likely), or there's a good-sized bug in libtidy (less likely, but possible).
ETA: Here's a teeny-tiny HTML file, that when you invoke TidyExample minimal.html
results in the usual empty buffers:
<!DOCTYPE html>
<html>
<head>
<title>Test</title>
<p>This is text.</p>
</body>
</html>
This is text.
` – 3R1N Nov 28 '20 at 08:22