What is NoSQL?
NoSQL is a classification of database systems that do not conform to the relational database or SQL standard. Most often they are categorized according to the way they store the data and fall under categories such as key-value stores, BigTable implementations, document store databases, and graph databases. In general the term isn’t well enough defined to reduce it to a single supporting JSR or technology. So the only way to find suitable integration technologies is to dig through every single category.
Key/Value stores allow data storage in a schema-less way. It could be stored in a datatype of a programming language or an object. Because of this, there is no need for a fixed data model. This is obviously comparable to parts of JSR 338 (Java Persistence 2.1) and JSR 347 ( Data Grids for the Java Platform) and also to what is done with JSR 107( JCACHE – Java Temporary Caching API).
with native JPA2
Also primary aimed at caching is the JPA L2 Cache. The JPA Cache API is good for basic cache operations, while L2 cache shares the state of an entity — which is accessed with the help of the entity manager factory — across various persistence contexts. Level 2 cache underlies the persistence context, which is highly transparent to the application. When Level 2 cache is enabled, the persistence provider will look for the entities in the persistence context first. If it does not find them there, the persistence provider will look in the Level 2 cache next instead of sending a query to the database. The drawback here obviously is, that as of today this only works with NoSQL as some kind of “Cache”. And not as a replacement for the RDBMS data store. Given the scope of this spec it would be a good fit: But I strongly believe that JPA is designed to be an abstraction on RDBS and nothing else. If there has to be some kind of support for non relational databases we might end up having a more high level abstraction layer in place which tons of different persistence modes and features (maybe something like Spring Data). Generally mapping at the object level has many advantages including the ability to think object and let the underlying engine drive the de-normalization if needed. So reducing JPA to the caching features probably is the wrong decision.
JCache having a CacheManager that holds and controls a collection of Caches and every single Caches have it’s entries. The basic API can be thought of map-like with additional features (compare Greg’s blog). With JCache being designed as a “Cache” using it as a standardised interface against NoSQL data stores this isn’t a good fit on the first look. But given the nature of the use-cases for unstructured Key/Value based data with enterprise java this might be the right kind of integration. And the NoSQL concept also allows for the “Key-value cache in RAM” category which is an exact fit for both JCache and DataGrids.
This JSR proposes an API for interacting with in-memory and disk-based distributed data grids. The API aims to allow users to perform operations on the data grid (PUT, GET, REMOVE) in an asynchronous and non-blocking manner returning a java.util.concurrent.Futures rather than the actual return values. The process here is not really visible at the moment (at least to me). So there aren’t any examples or concepts for integration of a NoSQL Key/Value store available until today. Beside this the same reservations as for the JCache API are in place.
EclipseLink’s NoSQL support is based on previous EIS support offered since EclipseLink 1.0. EclipseLink’s EIS support allowed persisting objects to legacy and non-relational databases. EclipseLink’s EIS and NoSQL support uses the Java Connector Architecture (JCA) to access the data-source similar to how EclipseLink’s relational support uses JDBC. EclipseLink’s NoSQL support is extendable to other NoSQL databases, through the creation of an EclipseLink EISPlatform class and a JCA adapter. At the moment it supports MongoDB (Document Oriented) and Oracle NoSQL (BigData). It’s interesting to see, that Oracle doesn’t address the Key/Value DBs first. Might be because of the possible confusion with the Cache features (e.g. Coherence).
Column based DBs
Read and write is done using columns rather than rows. The best known examples are Google’s BigTable and the likes of HBase and Cassandra that were inspired by BigTable. The BigTable paper says that BigTable is a sparse, distributed, persistent, multidimensional sorted Map. GAE for example works only with BigTable. It offers variety of APIs: from “native” low-level API to “native” high-level ones ( JDO and JPA). With the older Datanucleus version used by Google there seem to be a lot of limitations in place which could be removed ( see comments) but still are in place.
The Document-oriented DBs are most obviously best addressed by JSR 170 (Content Repository for Java) and JSR 283 (Content Repository for Java Technology API Version 2.0). With JackRabbit as a reference implementation it’s a strong sign for that :) The support for other NoSQL document stores is non existent as of today. Even Apache’s CouchDB doesn’t provide a JSR 170/283 compliant way of accessing the documents. The only drawback is that both JSR’s aren’t sexy or bleeding edge. But for me this would be the right bucket to put support for document-oriented DBs.Flip side of the medal?The content repository API isn’t exactly a natural model for an application. Does an app really want to deal with Nodes and attributes in Java?The notion of a domain model works nicely for many apps and if there is no chance to use it, you probably would be better off going native and use the MondoDB driver directly.
Graph oriented DBs
This kind of databases are thought for data whose relations are well represented with a graph-style (elements interconnected with an undetermined number of relations between them). Aiming primarily at any kind of network topology the recently rejected JSR 357 (Social Media API) would have been a good place to put support. At least from a use-case point of view. If those graph-oriented DBs are considered as a data-store there are a couple of options. If the Java EE persistence is steering into the direction of a more general data abstraction layer the 338 or it’s successors would be the right place to put support. If you know a little bit about how Coherence works internally and what had to be done to put JPA on top of it you also could consider 347 a good fit for it. With all the drawbacks already mentioned. Another alternative would be to have a separate JSR for it. The most prominent representative of this category is Neo4J which itself has an easy API available to simply include everything you need directly into your project. There is additional stuff to consider if you need to control the Neo4J instance via the application server.
To sum it up: We already have a lot in place for the so-called “NoSQL” DBs. And the groundwork for integrating this into new Java EE standards is promising. Control of embedded NoSQL instances should be done via JSR 322 (Java EE Connector Architecture) with this being the only allowed place spawn threads and open files directly from a filesystem. I’m not a big supporter of having a more general data abstraction JSR for the platform comparable to what Spring is doing with Spring Data. To me the concepts of the different NoSQL categories are too different than to have a one-size-fits-all approach.The main pain point of NoSQL besides the lack of standard API is that users are forced to denormalize and maintain de-normalization by hand.
What I would like to see are some smaller changes to both the products to be more Java EE ready and also to the way the integration into the specs is done. Might be a good idea to simply define the different persistence types and generally define the JSRs which could be influenced by this and noSQLing those accordingly.
For users willing to facilitate domain model (ie a higher level of abstraction compared to the raw NoSQL API), JPA might be the best vehicle for that at the moment. The feedback from both EclipseLink and Hibernate OGM users is needed to value what is working and what not. From a political point of view it might also make sense to pursue 347. Especially since main big players are present here already.
The really hard part is querying.Should there be standardised query APIs for each family? With Java EE? Or would that better be placed within the NoSQL space? Would love to read your feedback on this!