The Broken Scientific Publishing Model and My Attempt to Improve It

Bozhidar BozhanovOctober 12th, 2016Last Updated: October 12th, 2016

0 13 10 minutes read

I’ll begin this post with a rant about the state of scientific publishing, then review the technology “disruption” landscape and offer a partial improvement that I developed (source).

Scientific publishing is quite important – all of science is based on previously confirmed “science”, so knowing what the rest of the scientific community has done or is doing is essential to research. And allows scientists to “stand on the shoulders of giants”.

The web was basically invented to improve the sharing of scientific information – it was created at CERN and allowed linking from one (research) document to others.

However, scientific publishing at the moment is one of the few industries that haven’t benefited from the web. Well, the industry has – the community hasn’t, at least not as much as one would like.

Elsevier, Thomson-Reuters, Springer and other publishers make huge profits (e.g. 39% margin on a 2 billion revenue) for doing something that should basically be free in this century – they spread the knowledge that scientists have created. You can see here some facts about their operation, the most striking being that each university has to pay more than a million dollars to get the literature it needs.

It’s because they rely on a centuries old process of submission to journals, accepting the submission, then printing and distributing to university libraries. Recently publishers have put publications online, but they are behind paywalls or accessible only after huge subscription fees have been paid.

I’m not a “raging socialist” but sadly, publishers don’t provide (sufficient) value. They simply gather the work of scientists that is already funded by public money, sometimes get the copyright on that, and disseminate it in a pre-Internet way.

They also do not pay for peer review of the submitted publications, they simply “organize it” – which often means “a friend of the editor is a professor and he made his postdocs write peer reviews”. Peer review is thus itself broken, as it is non-transparent and often of questionable quality. The funny side of the peer review process is caught at “shitsmyreviewerssay”.

Oh, and of course authors should themselves write their publication in a journal-preferred template (and each journal has its own preferences). So the only actual work that the journals do is typesetting and editorial filtering.

So, we have expensive scientific literature with no added value and broken peer review system.

And at that point you may argue that if they do not add value, they can be easily replaced. Well, no. Because of the Impact Factor – the metric for determining the most cited journals, and by extension – the reputation of the authors that manage to get published in these journals. The impact factor is calculated based on a big database (Web of Science) and assigns a number on each journal. The higher impact factor a journal has, the better career opportunities a scientist has if they managed to get accepted for publication in that journal.

You may think that the impact factor is objective – well, it isn’t. It is based on data that only publishers (Thomson-Reuters in particular) have and when others tried to reproduce the impact factor, it was nearly 40% off (citation needed, but I lost the link). Not only that, but it’s an impact factor of the journal, not the scientists themselves.

So the fact that publishers are the judge, jury and executioner, means they can make huge profits without adding much value (and yes, they allow searching through the entire collection they have, but full-text search on a corpus of text isn’t exactly rocket science these days). That means scientists don’t have access to everything they may need, and that poor universities won’t be able to keep up. Not to mention individual researchers who are just left out. In general, science suffers from the inefficient sharing and assessment of research.

The situation is even worse, actually – due to the lack of incentive for publishers to change their process (among other things), as a popular journal editor once said – “much of the scientific literature, perhaps half, may simply be untrue”. So the fact that you are published in a somewhat impactful journal doesn’t mean your publication has been thoroughly reviewed, nor that the reviewers bear any responsibility for their oversights.

Many discussions have been held about why disruption hasn’t yet happened in this apparently broken field. And it’s most likely because of the “chicken and egg problem” – scientists have an incentive to publish to journals because of the impact factor, and that way the impact factor is reinforced as a reputation metric.

Then comes open access – a movement that requires scientific publications to be publicly accessible. It comes in two forms:

“green open access”, or “preprints” (yup, “print” is still an important word) – you just push your work to an online repository – it’s not checked by editors or reviewers, it just stays there.
“gold open access” – the author/library/institution pays a processing fee to publish the publication and then it becomes public. Important journals that use this include PLOS, F1000 and others

The “gold open access” doesn’t solve almost anything, as it just shifts the fees (maybe it reduces them, but again – processing fee to get something published online, really?). The “green open access” doesn’t give you the reputation benefits – preprint repos don’t have impact factor.

Then there’s Google Scholar, which has agreements with publishers to aggregate their content and provide search results (not the full publications). It also provides some metrics ontop of that, regarding citation. It forms a researcher profile based on that, which can actually be used as a replacement for the impact factor.

Because of that, many attempts have been made to either “revolutionize” scientific publishing, or augment it with additional services that would have the potential to one day become prelevant and take over the process. I’ll try to summarize the various players:

preprint repositories – this is where scientists publish their works before submitting them to a journal. The major player is arXiv, but there are others as well (list, map)
scientific “social networks” – Academia.edu, ResearchGate offer a way to connect with fellow-researchers and share your publications, thus having a public researcher profile. Scientists get analytics about the number of reads their publications get and notifications about new research they might be interested in. It is similar to a preprint repo, as they try to get hold of a lot of publications.
services which try to completely replace the process of scientific publishing – they try to be THE service where you publish, get reviewed and get a “score”. These include SJS, The Winnower and possibly science.ai. Academia.edu and ResearchGate can also maybe fit in this category, as they offer some way of feedback (and plan or already have peer-review) and/or some score (RG score).
tools to support researchers – Mendeley (a personal collection of publications), Authorea (a tool for collaboratively editing publications), Figshare (a place for sharing auxiliary materials like figures, datasets, source code, etc.), Publons (a system to collect everyone’s peer reviews), labii.com and Open Science Framework (sets of tools for researchers), Altmetric (tool to track the activity around research), ScholarPedia and OpenWetWare (wikis)
impact calculation services – in addition to the RG score, there’s ImpactFactory
scientist identity – each of the social networks try to be “the profile page” of a scientist. Additionally, there are the identifiers such as ORCID, researcherId, and a few others by individual publishers. Maybe fortunately, all are converging towards ORCID at the moment.
search engines – Google Scholar, Microsoft Academic, Science Direct (by Elsevier), Papers, PubPeer, Base Search, CLOCKSS and of course Sci-Hub – which mostly rely on contracts with publishers (with the exception of SciHub)
journals with a more modern, web-based workflow – F1000Research, Cureus, Frontiers, PLoS

Most of these services are great and created with the real desire to improve the situation. But unfortunately, many have problems. ResearchGate has bee accused of too much spamming, its RG score is questionable; Academia.edu is accused of too many fake accounts for the sake of making investors happy, Publons is a place where peer review should be something you brag about, yet very few reviews are made public by the reviewers (which signifies a cultural problem). SJS and The winnower have too few users, and the search engines are dependent on the publishers. Mendeley and others were acquired by the publishers so they no longer pose a threat to the existing broken model.

Special attention has to be paid to Sci-Hub. The “illegal” place where you can get the knowledge you want to find. Alexandra Elbakyan created Sci-Hub which automatically collects publications through library and university networks by credentials donated by researchers. That way all of the content is public and searchable by DOI (the digital identifier of an article, which by the way is also a broken concept, because in order to give your article and identifier, you need to pay for a “range”). So sci-hub seems like a good solution, but doesn’t actually fix the underlying workflow. It has been sued and its original domain(s) – taken, so it’s something like the pirate bay for science – it takes effort and idealistic devotion in order to stay afloat.

The lawsuits against sci-hub, by the way, are an interesting thing – publishers want to sue someone for giving access to content that they have taken for free from scientists. Sounds fair and the publishers are totally not “evil”?

I have had discussions with many people, and read a lot of articles discussing the disruption of the publishing market (here, here, here, here, here, here, here). And even though some of the articles are from several years ago, the change isn’t yet here.

Approaches that are often discussed are the following, and I think neither of them are working:

have a single service that is a “mega-journal” – you submit, get reviewed, get searched, get listed in news sections about your area and/or sub-journals. “One service to rule them all”, i.e. a monopoly, is also not good in the long term, even if the intentions of its founders are good (initially)
have tools that augment the publishing process in hope to get more traction and thus gradually get scientists to change their behaviour – I think the “augmenting” services begin with the premise that the current system cannot be easily disrupted, so they should at least provide some improvement on it and easy of use for the scientists.

On the plus side, it seems that some areas of research almost exclusively rely on preprints (green open access) now, so publishers have a diminishing influence. But that process is very slow. That’s why I wanted to do something to help make it faster and better.

So I created a wordpress plugin (source). Yes, it’s so trivial. I started with a bigger project in mind and even worked on it for a while, but it was about to end up in the first category above, of “mega-journal”, and that seems to have been tried already, hasn’t been particularly successful, and is risky long term.

Of course a wordpress plugin isn’t a new idea either. But all attempts that I’ve seen either haven’t been published, or provide just extras and tools, like reference management. My plugin has three important aspect:

JSON-LD – it provides semantic annotations for the the scientific content, making it more easily discoverable and parseable
peer review – it provides a simple, post-publication peer review workflow (which is an overstatement for “comments with extra parameters”)
it can be deployed by anyone – both as a personal website of a scientist and as a library/university-provided infrastructure for scientists. Basically, you can have a wordpress intallation + the plugin, and get a green open access + basic peer review for your institution. For free.

What is the benefit of the semantic part? I myself have argued that the semantic web won’t succeed anytime soon because of a chicken-and-egg problem – there is no incentive to “semanticize” your page, as there is no service to make use of it; and there are no services, because there are no semantic pages. And also, there’s a lot of complexity for making something “semantic” (RDF and related standards are everything but webmaster-friendly). There are niche cases, however. The Open Graph protocol, for example, makes a web page “shareable on facebook”, so web masters have the incentive to add these tags.

I will soon contact Google Scholar, Microsoft Academic and other search engines to convince them to index semantically-enabled web-published research. The point is to have an incentive, just like with the facebook example, to use the semantic options. I’ll also get in contact with ResearchGate/Academia/Arxiv/etc. to suggest the inclusion of semantic annotations and/or JSON-LD.

The general idea is to have green open access with online post-publication peer review, which in turn lets services make profile pages and calculate (partial) impact scores, without reliance on the publishers. It has to be easy, and it has to include libraries as the main contributor – they have the “power” to change the status-quo. And supporting a WordPress installation is quite easy – a library, for example, can setup one for all of the researchers in the institution and let them publish there.

A few specifics of the plugin:

the name “scienation” comes from “science” and either “nation” or the “-ation” suffix.
it uses URLs as article identifiers (which is compatible with DOIs that can also be turned into URLs). There is an alternative identifier, which is the hash of the article (text-only) content – that way the identifier is permanent and doesn’t rely on one holding a given domain.
it uses ORCID as an identity provider (well, not fully, as the OAuth flow is not yet implemented – it requires a special registration which won’t be feasible). One has to enter his ORCID in a field and the system will assume it’s really him. This may be tricky and there may be attempts to publish a bad peer review on behalf of someone else.
the hierarchy of science branches is obtained from Wikipedia, combined with other small sources.
the JSON-LD properties in use are debatable (sample output). I’ve started a discussion on having additional, more appropriate properties in schema.org’s ScholarlyArticle. I’m aware of ScholarlyHTML (here, here and here – a bit confusing which is “correct”), codemeta definitions and the scholarly article ontology. They are very good, but their purpose is different – to represent the internal details of a scientific work in a structured way. There is probably no need of that if the purpose is to make the content searchable and to annotate it with metadata like authors, id, peer reviews and citations. Still, I reuse the ScholarlyArticle standard definition and will gladly accept anything else that is suitable for the usecase.
I got the scienation.com domain (nothing to be seen there currently) and one can choose to add his website to a catalog that may be used in the future for easier discovering and indexing semantically-enabled websites.

The plugin is open source, licensed under GPL (as is required by WordPress), and contributions, discussions and suggestions are more than welcome.

I’m well aware that a simple wordpress plugin won’t fix the debacle that I’ve described in the first part of this article. But I think the right approach is to follow the principle of decentralization and reliance on libraries and individual researchers, rather than on (centralized) companies. The latter has so far proved inefficient and actually slows science down.

Reference:

The Broken Scientific Publishing Model and My Attempt to Improve It from our JCG partner Bozhidar Bozhanov at the Bozho’s tech blog blog.