How to: Refine Hive ZooKeeper Lock Manager Implementation

Na YangFebruary 20th, 2015Last Updated: February 14th, 2015

0 95 3 minutes read

Hive has been using ZooKeeper as distributed lock manager to support concurrency in HiveServer2. The ZooKeeper-based lock manager works fine in a small scale environment. However, as more and more users move to HiveServer2 from HiveServer and start to create a large number of concurrent sessions, problems can arise. The major problem is that the number of open connections between Hiveserver2 and ZooKeeper keeps rising until the connection limit is hit from the ZooKeeper server side. At that point, ZooKeeper starts rejecting new connections, and all ZooKeeper-dependent flows become unusable. Several Hive JIRAs (such as HIVE-4132, HIVE-5853 and HIVE-8135 etc.) have been opened to address this problem, and it recently got fixed through HIVE-9119.

Let’s take a closer look at the ZooKeeperHiveLockManager implementation in Hive to see why it caused a problem before, and how we fixed it.

ZooKeeperLockManager uses simple ZooKeeper APIs to implement a distributed lock. The protocol that it uses is listed below.

Clients wishing to obtain a shared lock should do the following:

Call create() to create a node with a pathname “_lockresource_/lock-shared-” with the sequence flags set.
Call getChildren() on the node without setting the watch flag.
If there are no children with a pathname starting with “lock-exclusive-”, then the client acquires the lock and exits.
Otherwise, call delete() to delete the node they created in step 1, sleep a pre-defined time period, and then do a retry by going to step 1 until reaching the maximum retry number.

Clients wishing to obtain an exclusive lock should do the following:

Call create() to create a node with a pathname “_lockresource_/lock-exclusive-” with the sequence flags set.
Call getChildren() on the node without setting the watch flag.
If there are no children with a lower sequence number than the node created in step 1, the client acquires the lock and exits.
Otherwise, call delete() to delete the node they created in step 1, sleep a pre-defined time period, and then do a retry by going to step 1 until reaching the maximum retry number.

Clients wishing to release a lock should simply delete the node they created in step 1. In addition, if all child nodes have been deleted, delete the parent node as well.

The above lock and unlock protocols are simple and straightforward. However, the previous implementation of this protocol did not use the ZooKeeper client properly. For each Hive query, a new ZooKeeper client instance was created to acquire and release locks. That causes a lot of overhead to the ZooKeeper server to handle new connections. In addition, in a multi-session environment, it is easy to hit the ZooKeeper server connection limit if there are too many concurrent queries happening. Furthermore, this can also happen when users use Hue to do Hive queries. Hue does not closes the Hive query by default, which means the ZooKeeper client created for that query is never closed. If query volume is high, the ZooKeeper connection limit can be reached very quickly.

Do we really need to create a new ZooKeeper client for each query? We found that it is not necessary. From the above discussion, we can see that the ZooKeeper client is used by HiveServer2 to talk to the ZooKeeper server to be able to acquire and release locks. The major workload is on the ZooKeeper server side, not the client side. One ZooKeeper client can be shared by all queries against a HiveServer2 server. With a singleton ZooKeeper client, the server overhead of handling connections is eliminated. And Hue users do not suffer from the ZooKeeper connection issue any more.

Singleton ZooKeeper client is able to solve the lock management problems. However, we still need to handle some extra things by using ZooKeeper client directly, such as:

Initial connection: the ZooKeeper client and server handshake takes some time. The synchronous method call (e.g. create(), getChildren(), delete()) used by the ZooKeeperHiveLockManager will throw an exception if this handshake has not completed. In this case, we need a latch to control when the ZooKeeper client starts to send method calls to the server.
Disconnection and failover: If the singleton ZooKeeper client loses its connection to the server, we need to handle the connection retry and failover to another server in the cluster.
Session timeout: If the connection session timeout happens, the singleton ZooKeeper client needs to be closed and re-created.

Apache Curator is open source software which is able to handle all of the above scenarios transparently. Curator is a Netflix ZooKeeper Library and it provides a high-level API-CuratorFramework that simplifies using ZooKeeper. By using a singleton CuratorFramework instance in the new ZooKeeperHiveLockManager implementation, we not only fixed the ZooKeeper connection issues, but also made the code easy to understand and maintain.

Thanks to the Hive open source community for including this fix in Apache Hive 1.1. This fix has also been included in the latest Hive 0.12 and Hive 0.13 releases and the coming Hive 1.0 release of the MapR Distribution.