Scraping With HTMLUnit


I’ve already talked about how Jsoup is strictly a HTML-only parser. You can make-do with that for scraping content from static pages (like the one you’re on right now), but what about the rest of the Internet? The answer is HTMLUnit.

HTMLUnit is a headless webkit for Java and a powerful tool for page automation.

Environment setup

Assuming you’re on Eclipse:

  1. Download HTMLUnit
  2. Copy lib folder into your project directory
  3. In Eclipse, right-click on your project folder - Build Path - Configure Build Path - Add External JARs and add all JARs you just moved into your project

And that’s it!

Code

Load URL

String URL = "http://github.com";
WebClient client = new WebClient();
HtmlPage page = client.getPage(URL);

Get Element By ID

String ID = "microsoft-callout";
HtmlDivision div = page.getHtmlElementById(ID);

What’s page automation if you can’t simulate a click, right?

page = div.click();