I am using crawler4j for crawling some websites and it is working fine. I am able to download all the files present in a website and now I have a new task ahead of me.I need to extract iframe,base64 and other embedded codes also if possible!
Till now what i am doing is, in my visit method
String place="<iframe";
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String text = htmlParseData.getText();
String html = htmlParseData.getHtml();
List<WebURL> links = htmlParseData.getOutgoingUrls();
System.out.println("Text length: " + text.length());
// System.out.println("html sorce code:- "+html);
int number=html.length();
String[] result=html.split("\\s");
System.out.println("print random word"+result[12500]+number);
int i;
for(i=0;i<number;i++)
{
if(result[i].equals(place))
{
System.out.println("iframe found"+i);
}
}
System.out.println("Text length: " + text.length());
System.out.println("Html length: " + html.length());
System.out.println("Number of outgoing links: " + links.size());
}
I have added the above if case to get the iframes of the given html page.It is working almost near to perfect.
I know that this is a bad way of extracting of iframes from a html page.I tried many other ways to extract iframes and other embedded codes from html pages but failed.After going through the source code I found a java class which can satisfy my requirement.As you can see from the url above I have to call startElemnt method using necessary parameters in the HtmlContentHandler class inorder to get the required codes.
`public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException`
{
}
So In my visit method I have created a HtmlContentHandler object and tried to call the startElement method mentioned above.
HtmlContentHandler ecode=new HtmlContentHandler();
ecode.startElement(url,localName,qName,attributes);
Now the problem is with parameters of that method. I am sending the url value that is crawled for the url parameter and I have no idea what values I have to sent for the rest of the parameters!
Can some one help me in this? One more thing I know that many other tools can make my work easy but I want to do this in crawler4j instead!
Thank you!!