Bozhidar Bozhanov

About Bozhidar Bozhanov

Senior Java developer, one of the top stackoverflow users, fluent with Java and Java technology stacks - Spring, JPA, JavaEE. Founder and creator of Computoser and Welshare. Worked on Ericsson projects, Bulgarian e-government projects and large-scale online recruitment platforms.

A Scraping Library

As part of a project I’m working on, I needed to get documents from state institutions. And instead of writing code specific for each site, I decided to try creating a “universal” document scraper. It can be found as a separate module within the main project https://github.com/Glamdring/state-alerts/. The project is written in Scala, and can be used in any JVM project (provided you add a scala jar dependency). It is meant for scraping documents, rather than random data. It can probably be extended to do that, but for now I’d like it to be more (state)-document / open-data oriented, rather than a tool for commercial scraping (which is often frowned upon).

It is now in a more or less stable form, I’ve already deployed the application and it works properly, so I’ll just share a short description of the functionality. The point is to be able to specify scraping only via configuration. The class used to configure individual scraping instances is ExtractorDescriptor. There you specify a number of things:

  • Target URL, http method, body parameters (in case of POST). You can put a placeholder {x} which will be used for paging
  • The type of document (PDF, doc, HTML) and the type of the scraping workflow – i.e. how is the document reached on the target page. There are 4 options, depending on whether there’s a separate details page, whether there’s only a table and where the link to the document is located
  • XPath expressions for elements, containing meta data and the links to the documents. There’s a different expression depending on where the information is located – in a table or in separate details page
  • Date format, for the date of the document; optionally regex can be used, in case the date cannot be strictly located by XPath
  • Simple “heuristics” – if you know the URL structure of the document you are looking for, there’s no need to locate it via XPath.
  • Other configurations, like javascript requirements, whether scraping should fail on error, etc.

When you have an ExtractorDescriptor instance ready (for java apps you can use the builder to create one), you can create a new Extractor(descriptor), and then (usually with a scheduled job) call extractor.extractDocuments(since)

The result is a list of documents (there are two methods – one returns a scala list, and one returns a java list).

The library depends on htmlunit, nekohtml, scala, xml-apis and some more, visible in the pom. It doesn’t support multiple parsers. It also doesn’t handle distributed running of scraping tasks – this you should handle yourself. No jar release or maven dependency is published yet – if one needs it, it has to be checked-out and built. I hope it is useful, though. If not as code, then at least as an approach to getting data from web pages programatically.
 

Reference: A Scraping Library from our JCG partner Bozhidar Bozhanov at the Bozho’s tech blog blog.

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

JPA Mini Book

Learn how to leverage the power of JPA in order to create robust and flexible Java applications. With this Mini Book, you will get introduced to JPA and smoothly transition to more advanced concepts.

JVM Troubleshooting Guide

The Java virtual machine is really the foundation of any Java EE platform. Learn how to master it with this advanced guide!

Given email address is already subscribed, thank you!
Oops. Something went wrong. Please try again later.
Please provide a valid email address.
Thank you, your sign-up request was successful! Please check your e-mail inbox.
Please complete the CAPTCHA.
Please fill in the required fields.

Leave a Reply


+ 8 = nine



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
Do you want to know how to develop your skillset and become a ...
Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

Get ready to Rock!
You can download the complementary eBooks using the links below:
Close