Jsoup Troubles

Of late, I’ve been working on a project that scrapes content off a webpage. To elaborate on what I’ve been working on: it’s an Android app that fetches university results from my university’s webpage.

Naturally, I turned to Jsoup. While it’s an amazing library for HTML scraping, its limitations quickly become a barrier to the bigger things you might have planned on doing with it, because it’s next to impossible to find any useful data on a static webpage.

1 - JavaScript content

If the page you’re scraping data from has dynamically loaded content, courtesy JavaScript, then you’ve hit a dead end. Jsoup is a HTML scraper, you’re better off working with Selenium or HTMLUnit. Just in case you’d like to try your luck, you could try faking the user agent:

Document doc = Jsoup.connect(URL).userAgent("Chrome/41.0.2228.0").get();

2 - Memory issues

More often than not, Jsoup never downloaded the complete source code of the webpage I wanted to scrap data from. Even using the maxBodySize method made no difference:

Document doc = Jsoup.connect(URL).userAgent("Chrome/41.0.2228.0").maxBodySize(Integer.MAX_VALUE).get();

Hoping to find fixes!