Bozhidar Bozhanov

About Bozhidar Bozhanov

Senior Java developer, one of the top stackoverflow users, fluent with Java and Java technology stacks - Spring, JPA, JavaEE. Founder and creator of Computoser and Welshare. Worked on Ericsson projects, Bulgarian e-government projects and large-scale online recruitment platforms.

A Scraping Library

As part of a project I’m working on, I needed to get documents from state institutions. And instead of writing code specific for each site, I decided to try creating a “universal” document scraper. It can be found as a separate module within the main project https://github.com/Glamdring/state-alerts/. The project is written in Scala, and can be used in any JVM project (provided you add a scala jar dependency). It is meant for scraping documents, rather than random data. It can probably be extended to do that, but for now I’d like it to be more (state)-document / open-data oriented, rather than a tool for commercial scraping (which is often frowned upon).

It is now in a more or less stable form, I’ve already deployed the application and it works properly, so I’ll just share a short description of the functionality. The point is to be able to specify scraping only via configuration. The class used to configure individual scraping instances is ExtractorDescriptor. There you specify a number of things:

  • Target URL, http method, body parameters (in case of POST). You can put a placeholder {x} which will be used for paging
  • The type of document (PDF, doc, HTML) and the type of the scraping workflow – i.e. how is the document reached on the target page. There are 4 options, depending on whether there’s a separate details page, whether there’s only a table and where the link to the document is located
  • XPath expressions for elements, containing meta data and the links to the documents. There’s a different expression depending on where the information is located – in a table or in separate details page
  • Date format, for the date of the document; optionally regex can be used, in case the date cannot be strictly located by XPath
  • Simple “heuristics” – if you know the URL structure of the document you are looking for, there’s no need to locate it via XPath.
  • Other configurations, like javascript requirements, whether scraping should fail on error, etc.

When you have an ExtractorDescriptor instance ready (for java apps you can use the builder to create one), you can create a new Extractor(descriptor), and then (usually with a scheduled job) call extractor.extractDocuments(since)

The result is a list of documents (there are two methods – one returns a scala list, and one returns a java list).

The library depends on htmlunit, nekohtml, scala, xml-apis and some more, visible in the pom. It doesn’t support multiple parsers. It also doesn’t handle distributed running of scraping tasks – this you should handle yourself. No jar release or maven dependency is published yet – if one needs it, it has to be checked-out and built. I hope it is useful, though. If not as code, then at least as an approach to getting data from web pages programatically.
 

Reference: A Scraping Library from our JCG partner Bozhidar Bozhanov at the Bozho’s tech blog blog.
Related Whitepaper:

Java Essential Training

Author David Gassner explores Java SE (Standard Edition), the language used to build mobile apps for Android devices, enterprise server applications, and more!

The course demonstrates how to install both Java and the Eclipse IDE and dives into the particulars of programming. The course also explains the fundamentals of Java, from creating simple variables, assigning values, and declaring methods to working with strings, arrays, and subclasses; reading and writing to text files; and implementing object oriented programming concepts. Exercise files are included with the course.

Get it Now!  

Leave a Reply


nine − = 8



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.

Sign up for our Newsletter

15,153 insiders are already enjoying weekly updates and complimentary whitepapers! Join them now to gain exclusive access to the latest news in the Java world, as well as insights about Android, Scala, Groovy and other related technologies.

As an extra bonus, by joining you will get our brand new e-books, published by Java Code Geeks and their JCG partners for your reading pleasure! Enter your info and stay on top of things,

  • Fresh trends
  • Cases and examples
  • Research and insights
  • Two complimentary e-books
Get tutored by the Geeks! JCG Academy is a fact... Join Now
Hello. Add your message here.