I have a blog dataset which has a huge number of blog pages, with blog posts, comments and all blog features. I need to extract only blog post from this collection and store it in a .txt file. I need to modify this program as this program should collect blogposts tag starts with <p> and ends with </p> and avoiding other tags.
Currently I use HTMLParser to do the job, here is what I have so far:
import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.tags.MetaTag;
public class HTMLParserTest {
public static void main(String... args) {
Parser parser = new Parser();
HasAttributeFilter filter = new HasAttributeFilter("P");
try {
parser.setResource("d://Blogs/asample.txt");
NodeList list = parser.parse(filter);
Node node = list.elementAt(0);
if (node instanceof MetaTag) {
MetaTag meta = (MetaTag) node;
String description = meta.getAttribute("content");
System.out.println(description);
}
} catch (ParserException e) {
e.printStackTrace();
}
}
}
thanks in advance