Home » Software Development » The Low Quality of Scientific Code

About Bozhidar Bozhanov

Bozhidar Bozhanov
Senior Java developer, one of the top stackoverflow users, fluent with Java and Java technology stacks - Spring, JPA, JavaEE, as well as Android, Scala and any framework you throw at him. creator of Computoser - an algorithmic music composer. Worked on telecom projects, e-government and large-scale online recruitment and navigation platforms.

The Low Quality of Scientific Code

Recently I’ve been trying to get a bit into music theory, machine learning, computational linguistics, so I ended up looking at libraries and tools written by the scientific community – examples include the Stanford Core NLP library, GATE, Weka, jMusic, and several more.

The general feeling is that scientific libraries have mostly bad code. I will not point fingers, but there are too many freshman mistakes – not considering thread-safety, cryptic, ugly and/or stringly-typed APIs, lack of type-safety, poorly named variables and methods, choosing bad/slow serialization formats, writing debug messages to System.err (or out), lack of documentation, lack of tests.

 
Thus using these libraries becomes time consuming and error prone. Every 10 minutes you see some horribly written code that you don’t have the time to fix. And it’s not just one or two things, that you would report in a normal open-source project – it’s an overall low level of quality. On the other hand these libraries have a lot of value, because the low-level algorithms will take even more time and especially know-how to implement, so just reusing them is obviously the right approach. Some libraries are even original research and so you just can’t write them yourself, without spending 3 years on a PhD thesis.

I cannot but mention Heartbleed here – OpenSSL is written by scientific people, and much has been written on topic that even OpenSSL does not meet modern software engineering standards.

But that’s only the surface. Scientists in general can’t write good code. They write code simply to achieve their immediate goal, and then either throw it away, or keep using it for themselves. They are not software engineers, and they don’t seem to be concerned with code quality, code coverage, API design. Not to mention scientific infrastructure, deployment on multiple servers, managing environment. These things are rarely done properly in the scientific community.

And that’s not only in computer science and related fields like computational linguistics – it’s everywhere, because every science now requires at least computer simulations. Biology, bioinformatics, astronomy, physics, chemistry, medicine, etc – almost every scientists has to write code. And they aren’t good at it.

And that’s OK – we are software engineers and we dedicate our time and effort to these things; they are scientists, and they have vast knowledge in their domain. Scientists use programming the way software engineers use public transport – just as a means to get to what they have to do. And scientists should not be distracted from their domain by becoming software engineers.

But the problem is still there. Not only there are bad libraries, but the code scientists write may yield wrong results, work slowly, or regularly crash, which directly slows down or even invisibly hampers their work.

For the libraries, we, software engineers can contribute, or companies using them can dedicate an engineer to improving the library. Refactor, cleanup, document, test. The authors of the libraries will be more than glad to have someone prettify their hairy code.

The other problem is tougher – science needs funding for dedicated software engineers, and they prefer to use that funding for actual scientists. And maybe that’s a better investment, maybe not. I can say for myself that I’ll be glad to join a research team and help with the software part, while at the same time gaining knowledge in the field. And that would be fascinating, and way more exciting than writing boring business software. Unfortunately that doesn’t happen too often now (I tried once, a couple of years ago, and got rejected, because I lacked formal education in biology).

Maybe software engineers can help in the world of science. But money is a factor.

 

Reference: The Low Quality of Scientific Code from our JCG partner Bozhidar Bozhanov at the Bozho’s tech blog blog.

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you our best selling eBooks for FREE!

1. JPA Mini Book

2. JVM Troubleshooting Guide

3. JUnit Tutorial for Unit Testing

4. Java Annotations Tutorial

5. Java Interview Questions

6. Spring Interview Questions

7. Android UI Design

and many more ....

 

6 comments

  1. I’ve used a lot of scientific libraries and I’d have to agree that the code quality is not always as good as it could be. However, if we just criticise, we run the risk that some library of useful tools will go unshared because the author daren’t risk the ridicule.

    Maybe the way to go is to do a kind of Apache Commons thing, and take on the development of a library once it’s out there. In fact, a lot of the complicated maths I’ve done (including some of the stuff I used in my PhD), I learned by looking at someone else’s source code and trying to understand what it was doing. So the effort on improving libraries might work both ways…

  2. Good article, I agree with many of your points.

    I also think that a part of the problem is the model of funding for many science research communities (grants) and how it doesn’t often lend itself to paying big money for well designed software. Hence some scientists end up writing it themselves, or cheap outsourcing, and the quality is reduced in many cases (as you said, they are scientists, not software engineers).

    For a senior project in University I was on a team that implemented a motion-detection tool for biologists, and this was something our client was really excited about, because she previously had no way to get this tool developed, as grant money was very tight in her department.

    We have since released it as open source on Github, but another part of the problem (as you alluded to) is that scientific software is usually quite complex, and also sometimes limited to select use cases, and that makes good open source contributors sometimes hard to come by, and affects the features of the product as well as the quality.

  3. this is so true.

  4. Are you only referring to open source software? If so, what quality requirement do you expect, and would you be willing to contribute then yourself to open source developments?

  5. You make plenty of fair points about academic software. Here’s the thing — I would argue that this lack of engineering quality in academic software is a feature, not a bug. There is basically little to no incentive to produce high quality software for academics, and that is how it should be. Our currency is ideas and publications based on them, and those are obtained not by having wonderful software, but by having great results. We have limited time, and that time is best put into thinking about interesting models and careful evaluation and analysis. The code is there to support that, and is fine as long as it is correct.

    The truly important metric for me is whether the code supports replicability of results from the paper it supports. The code can be as ugly as you can possibly imagine as long as it does this. Unfortunately, a lot of academic software doesn’t make replication easy. Nonetheless, having the code open sourced makes it at least possible to hack with it to try to replicate previous results. In the last few years, I’ve personally put a lot of effort into having my work and my students’ work easy to replicate. I’m particularly proud of how I put code, data and documentation together for a paper I did on topic model evaluation:

    https://github.com/utcompling/topicmodel-eval

    That was a lot of work, but I’ve already benefited from it myself (in terms of being able to get the data and run my own code).

    Having said the above, I think it is really interesting to see how people who have made their code easy to use (though not always well-engineered) have benefited from doing so in the academic realm. A good example is word2vec and how the software that was released for it generated tons of interest in industry as well as academia and probably led to much wider dissemination of that work, and to more follow on work. Academia itself doesn’t reward that directly, nor should it. That’s one reason you see it coming out of companies like Google, but it might be worth it to some researchers in some cases, especially PhD students who seek industry jobs after they defend their dissertation.

  6. Oh! Thank God there is not a single drop of bigotry in all the developers community !

Leave a Reply

Your email address will not be published. Required fields are marked *

*


Want to take your Java Skills to the next level?
Grab our programming books for FREE!
  • Save time by leveraging our field-tested solutions to common problems.
  • The books cover a wide range of topics, from JPA and JUnit, to JMeter and Android.
  • Each book comes as a standalone guide (with source code provided), so that you use it as reference.
Last Step ...

Where should we send the free eBooks?

Good Work!
To download the books, please verify your email address by following the instructions found on the email we just sent you.