I’ve already talked about how Jsoup is strictly a HTML-only parser. You can make-do with that for scraping content from static pages (like the one you’re on right now), but what about the rest of the Internet? The answer is HTMLUnit.
HTMLUnit is a headless webkit for Java and a powerful tool for page automation.
Environment setup
Assuming you’re on Eclipse:
- Download HTMLUnit
- Copy
lib
folder into your project directory - In Eclipse, right-click on your project folder -
Build Path
-Configure Build Path
-Add External JARs
and add all JARs you just moved into your project
And that’s it!
Code
Load URL
String URL = "http://github.com";
WebClient client = new WebClient();
HtmlPage page = client.getPage(URL);
Get Element By ID
String ID = "microsoft-callout";
HtmlDivision div = page.getHtmlElementById(ID);
What’s page automation if you can’t simulate a click, right?
page = div.click();