The Low Quality of Scientific Code
Recently I’ve been trying to get a bit into music theory, machine learning, computational linguistics, so I ended up looking at libraries and tools written by the scientific community – examples include the Stanford Core NLP library, GATE, Weka, jMusic, and several more.
The general feeling is that scientific libraries have mostly bad code. I will not point fingers, but there are too many freshman mistakes – not considering thread-safety, cryptic, ugly and/or stringly-typed APIs, lack of type-safety, poorly named variables and methods, choosing bad/slow serialization formats, writing debug messages to System.err (or out), lack of documentation, lack of tests.
Thus using these libraries becomes time consuming and error prone. Every 10 minutes you see some horribly written code that you don’t have the time to fix. And it’s not just one or two things, that you would report in a normal open-source project – it’s an overall low level of quality. On the other hand these libraries have a lot of value, because the low-level algorithms will take even more time and especially know-how to implement, so just reusing them is obviously the right approach. Some libraries are even original research and so you just can’t write them yourself, without spending 3 years on a PhD thesis.
I cannot but mention Heartbleed here – OpenSSL is written by scientific people, and much has been written on topic that even OpenSSL does not meet modern software engineering standards.
But that’s only the surface. Scientists in general can’t write good code. They write code simply to achieve their immediate goal, and then either throw it away, or keep using it for themselves. They are not software engineers, and they don’t seem to be concerned with code quality, code coverage, API design. Not to mention scientific infrastructure, deployment on multiple servers, managing environment. These things are rarely done properly in the scientific community.
And that’s not only in computer science and related fields like computational linguistics – it’s everywhere, because every science now requires at least computer simulations. Biology, bioinformatics, astronomy, physics, chemistry, medicine, etc – almost every scientists has to write code. And they aren’t good at it.
And that’s OK – we are software engineers and we dedicate our time and effort to these things; they are scientists, and they have vast knowledge in their domain. Scientists use programming the way software engineers use public transport – just as a means to get to what they have to do. And scientists should not be distracted from their domain by becoming software engineers.
But the problem is still there. Not only there are bad libraries, but the code scientists write may yield wrong results, work slowly, or regularly crash, which directly slows down or even invisibly hampers their work.
For the libraries, we, software engineers can contribute, or companies using them can dedicate an engineer to improving the library. Refactor, cleanup, document, test. The authors of the libraries will be more than glad to have someone prettify their hairy code.
The other problem is tougher – science needs funding for dedicated software engineers, and they prefer to use that funding for actual scientists. And maybe that’s a better investment, maybe not. I can say for myself that I’ll be glad to join a research team and help with the software part, while at the same time gaining knowledge in the field. And that would be fascinating, and way more exciting than writing boring business software. Unfortunately that doesn’t happen too often now (I tried once, a couple of years ago, and got rejected, because I lacked formal education in biology).
Maybe software engineers can help in the world of science. But money is a factor.
|Reference:||The Low Quality of Scientific Code from our JCG partner Bozhidar Bozhanov at the Bozho’s tech blog blog.|
I’ve used a lot of scientific libraries and I’d have to agree that the code quality is not always as good as it could be. However, if we just criticise, we run the risk that some library of useful tools will go unshared because the author daren’t risk the ridicule. Maybe the way to go is to do a kind of Apache Commons thing, and take on the development of a library once it’s out there. In fact, a lot of the complicated maths I’ve done (including some of the stuff I used in my PhD), I learned by looking… Read more »
Good article, I agree with many of your points. I also think that a part of the problem is the model of funding for many science research communities (grants) and how it doesn’t often lend itself to paying big money for well designed software. Hence some scientists end up writing it themselves, or cheap outsourcing, and the quality is reduced in many cases (as you said, they are scientists, not software engineers). For a senior project in University I was on a team that implemented a motion-detection tool for biologists, and this was something our client was really excited about,… Read more »
this is so true.
Are you only referring to open source software? If so, what quality requirement do you expect, and would you be willing to contribute then yourself to open source developments?
You make plenty of fair points about academic software. Here’s the thing — I would argue that this lack of engineering quality in academic software is a feature, not a bug. There is basically little to no incentive to produce high quality software for academics, and that is how it should be. Our currency is ideas and publications based on them, and those are obtained not by having wonderful software, but by having great results. We have limited time, and that time is best put into thinking about interesting models and careful evaluation and analysis. The code is there to… Read more »
Oh! Thank God there is not a single drop of bigotry in all the developers community !