Software Development

SASI Empowering Secondary Indexes

Back in 2010, a new feature was introduced for Cassandra 0.7: Secondary Indexes. Unlike a relational database index, it is an index associated to the values of a column which provides a key for all rows in a table. For example imagine the table “Users”. It has a primary index as “user_id”. Now imagine that you want to access the email of an user. You must search by ID and then collect the value that you are interested in (email, for example). This isn’t practical with any kind of performance. See the image below:
 
 
Screen-Shot-2016-04-03-at-9.35.28-PM

Humans can fetch the information we want in a single step by making the inverse operation. Given an email, recover the ID. In terms of computing, how could we achieve a similar process in our database? Secondary indexes solve these scenarios by using email, fetching the user ID, which acts as an inverse function of the query. This feature can be powerful and practical. Applying secondary indexes for small sets of data makes queries extremely efficient. It’s pretty easy to create a secondary index in Cassandra:

CREATE TABLE users (
   userID uuid,
   firstname text,
   lastname text,
   email text,
   zip int,
   PRIMARY KEY (userID)
 );
 
CREATE INDEX users_email
  ON users (email);

It is also possible to define several secondary indexes by table since we are not limited to just one.

Secondary Index Problems

Unfortunately, this is not silver bullet. Since secondary indexes are managed locally and not globally as the regular indexes, they may lead to some problems. Especially if we are dealing with large datasets containing sparse and very distinct information. For our user email example, if we have a huge amount of users. We may have an equally huge amount of emails and they will all be unique. To handle with a secondary index at ’email’ column, Cassandra would have to ‘join’ all the information spread across the rings because this index is local. This may be harmful in terms of performance. If you have five instances, you would need five readings for a single query including five disk operations. On the other hand, for a regular index, this would not be necessary since ‘user_id’ is globally handled. Due to this reason, some developers criticise the use of secondary indexes.

Several strategies for secondary indexes have been proposed

  • [1] (http://brianoneill.blogspot.com.br/2012/03/cassandra-indexing-good-bad-and-ugly.html) ,
  • [2] (http://blog.websudos.com/2014/08/23/a-series-on-cassandra-part-2-indexes-and-keys/),
  • [3] (http://www.slideshare.net/edanuff/indexing-in-cassandra) and
  • [4] (https://dzone.com/articles/cassandra-indexing-good-bad) are all good examples.

To enhance this set, Apple recently open sourced its own secondary index strategy: SSTableAttachedSecondaryIndex, or simply [SASI](https://github.com/xedin/sasi). This strategy has been shown extremely efficient and powerful. By presenting robust and flexible queries, SASI manages CPU, I/O and memory properly in order to save computational resources.

SASI by Example

Let’s look at a simple example to understand how SASI works. At the moment there is Thrift and CQL3 support. For an arbitrary ‘foo’ keyspace:

cqlsh> CREATE KEYSPACE foo WITH replication = {
  'class': 'SimpleStrategy',
  'replication_factor': '1'
  };
cqlsh> USE foo;

Then we create ‘bar’ table:

CREATE TABLE bar (
id uuid, 
fname text, 
lname text,
age int, 
created_at bigint, 
primary key (id)
);

We use “CREATE CUSTOM INDEX” command in order to create a secondary index with SASI:

CREATE CUSTOM INDEX ON bar (fname) USING 
'org.apache.cassandra.db.index.SSTableAttachedSecondaryIndex'
WITH OPTIONS = {
'analyzer_class':
'org.apache.cassandra.db.index.sasi.analyzer.NonTokenizingAnalyzer',
'case_sensitive': 'false'
};

Note the configuration ‘case_sensitive’: ‘false’. It may remind you of full text search tools. Take a look into the class [Analyzer](https://github.com/xedin/sasi/blob/master/src/java/org/apache/cassandra/db/index/sasi/analyzer/StandardA

nalyzer.java). SASI allows you to do some interesting customization for indexes. Take a look at the next example:

CREATE CUSTOM INDEX ON bar (lname) USING 
'org.apache.cassandra.db.index.SSTableAttachedSecondaryIndex'
WITH OPTIONS = {'mode': 'SUFFIX'};

Here we create an index with a suffix analyser. That evaluates data by suffix and not by the entire term.
There is also the ‘SPARSE’ mode:

CREATE CUSTOM INDEX ON bar (created_at) USING 
'org.apache.cassandra.db.index.SSTableAttachedSecondaryIndex'
WITH OPTIONS = {'mode': 'SPARSE'};

‘SPARSE’ mode allows us to make queries by timestamp intervals in an efficient way. If no mode is declared, ‘NORMAL’ is the default, providing exact case sensitive matches only.

Understanding SASI Basics

SASI is divided in two parts: Indexing and Search. Cassandra provides a distinction between memory and disk resources and SASI takes advantage of that, as well as write only, immutability and sorted data to perform properly. The built indexes are then flushed from a memtable to disk. SASI data structures are created in memory while SSTable begins the write process and flushed to disk before the process terminates. This is good for performance since writing the indexes requires only a sequential write on disk.

For each SSTable, SASI writes one index file for each indexed column. The file contents are built in memory by the class [OnDiskIndexBuilder](https://github.com/xedin/sasi/blob/master/src/java/org/apache/cassandra/db/index/sasi/disk/OnDiskIndexBuilder.java).

Once flushed to disk, the information will be read by [OnDiskIndex](https://github.com/xedin/sasi/blob/master/src/java/org/apache/cassandra/db/index/sasi/disk/OnDiskIndex.java), which contains an optimised search data structure. A key value mapping for the SSTable positions is created and then after the disk write, this information is stored in memory to speed up the index access.

When we perform a search, there is another group of classes that parses and interprets the indexes into a tree structure, optimises it and analyses what should be searched. To perform Unions and Intersections of data, SASI has a group of smart iterators, such as [RangeUnionIterator](https://github.com/xedin/sasi/blob/master/src/java/org/apache/cassandra/db/index/sasi/utils/RangeUnionIt

erator.java), that can perform union operations across hashes, reading only the minimum necessary of each set to satisfy query needs. There is also [RangeIntersectionIterator](https://github.com/xedin/sasi/blob/master/src/java/org/apache/cassandra/db/index/sasi/utils/RangeIntersec

tionIterator.java) for intersections.

Given this indexing and searching toolbox powered by optimised data structures, we can see that SASI is not magic. SASI juggles the data and performs smart operations taking advantage of Cassandra architecture and structure.

SASI has some limitations. For example, it requires that your cluster is configured to produce LongTokens (ex: Murmur3Partitioner). ByteOrderedPartitioner and RandomPartitioner do not work with SASI and only Cassandra on versions > 3.4 are supported.

Final Considerations

The main goal of this article was to provide a quick and general idea about Cassandra secondary indexes and present the basics from SASI. There is still work that needs to be done into this tool, of course. However, it has been proving itself as a robust and powerful alternative to the previous Secondary Index implementation. Visit the [official Github repository](https://github.com/xedin/sasi) for detailed information about the project, issues and releases.

Reference: SASI Empowering Secondary Indexes from our JCG partner Hanneli Tavante at the Planet Cassandra blog.

Rebecca Mills

Rebecca Mills is a Junior Evangelist at DataStax, the company that delivers Apache Cassandra™ to the enterprise. She is interested in the new Cassandra user experience and is involved making it more approachable for everyone. She holds a B.Sc in biochemistry from Memorial University of Newfoundland and has been known to go on a bit about genome analysis.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments
Back to top button