Considerable speed improvement in WebHcat HCatalog REST API

Here at Data Republic, we’re committed to delivering technology which makes it simple, secure and fast for our users to handle data exchanges – that’s where our standard data-lake APIs come into play.

The same APIs which it make it easier for our Data Contributors to manage dataset permissions are used by our web UI – Senate– i.e., to render the meta-data definition we expose and consume HCatalog DDL REST API over WebHCat.

Out of the box, Hadoop WebHcat internally invokes a hcat shell script which spawns a new Java process for every call. That might be OK for the occasional administration requirements, but it’s certainly not suitable for interactive web scenarios.

To solve this, our team have come up with small extensions to WebHcat DDL which connect to Hive Thrift servers and execute DDL commands over JDBC. Hive JDBC already supports JSON output for DDL, so results can be directly pumped back to the client without any further mapping.

## WebHcat over JDBC Architecture

In this post, I’ll walk-through how to patch and enable this feature in your own webhcat server. These changes are minimal and you can always switch back by simply flicking an attribute.

Source code against Hive master and 2.1 branches are available. Also, we’re in process of bringing this feature back into Hive with HIVE-15690 with this PR.

Step 1. Add Hive JDBC Driver to WebHCat

cd hive-hcatalog/share/webhcat/svr/libln -s ../../../../../hive/lib/hive-jdbc-*-standalone.jar .

Step 2. Allow Client Define DDL Output Format

In your hive servers, modify hiverserver2-site.xml and add this attribute:

<property><name>hive.security.authorization.sqlstd.confwhitelist.append</name><value>hive.ddl.output.format</value></property>

Restart Hive server2.

Step 3. Build Patched WebHcat

You need to build and replace default hive-webhcat.jar with the patched version:

git clone https://github.com/datarepublic/hive.gitcd hivemvn -pl hcatalog/webhcat/svr package

Step 4. Deploy Patched WebHcat to Servers

Copy the patched JAR file from previous step to webhcat server boxes and replace the original one:

scp hcatalog/webhcat/svr/target/hive-webhcat-*.jar $wehcat-servers:/tmpssh $wehcat-serverscd hive-hcatalog/share/webhcat/svr/lib/mkdir origmv hive-webhcat-*.jar origmv /tmp/hive-webhcat-*.jar .

Step 5. Enable JDBC DDL Mode

The last step is to tell WebHcat that you would like to use JDBC for DDL calls. For this you need to add two attributes in webhcat-site.xml:

<property><name>templeton.ddl.mode</name><value>jdbc</value></property>

<blockquote><property><name>hive.jdbc.url</name><value>jdbc:hive2://hs2:10001;transportMode=http;url=/cliserver</value></property>If you're in Kerberos secure mode, two extra attributes are required in webhcat-site.xml:<property><name>hive.server2.kerberos.keytab</name><value>/etc/security/keytabs/hive.service.keytab</value></property>

<property><name>hive.server2.kerberos.principal</name><value>hive/_HOST@REALM</value></property>

Once you’ve completed the above, restart WebHcat server and check the logs:

tail -f /var/log/webhcat/webhcat.log

Step 6. Test it

Now with everything in place, you can make a REST call to list all databases belonging to a user joe:

curl [-u : --negotiation] http://webhcat-server:50111/templeton/v1/ddl/database/?doAs=joeIf WebHcat is behind a Knox gateway, you can also use credentials and let the knox handle Kerberos negotiation.curl -u user:pass https://knox-server/gateway/default/templeton/v1/ddl/databaseOr add a new string column (newcol) to table (mydatabase.mytbale):curl -X PUT -H'Content-type:application/json' -d '{\"\"type\"\": \"\"string\"\"}' -u user:pass https://knox-server/gateway/default/templeton/v1/ddl/database/mydatabase/table/mytbale/column/newcol

Note 1: ZooKeeper JDBC URL

I would not recommend using ZK JDBC URL here. Although technically there is nothing wrong with a ZK quorum JDBC URL, it may result in inconsistent results if the table’s metadata is modified rapidly because of hive server metadata caching.If subsequent calls hit another hive server while the first one is modifying an ObjectStore, it may result in an outdated DDL response. The bottom line is that you need to balance HA and DDL results for consistency. With one HS2 in URL, you’ll always get consistent results with degradation HA. With ZK JDBC you will have full HA with some intermittent results generated from the cache.

Note 2: Access Control

As a direct result of execution over JDBC, even meta-data is compliant with any access policies defined in Hive. If you’re in Kerberos secure mode, JDBC connection will be in impersonate mode in a URL similar to this:

jdbc:hive2://hs2:port/;hive.server2.proxy.user=$USER?hive.ddl.output.format=json;

If the requesting user does not have visibility over databases/tables due to Hive/Ranger policies, it won’t be visible to JDBC, hence not in the displayed results.

Depending on your use case this could be considered a huge governance benefit to your data lake. WebHcat with hcat is not compliant with any access policies and exposes all meta-data to direct access by any authenticated user. It is possible to limit results by using arguments such as ?like=dbname* but in general that’s client side filtering. Any malicious user can easily bypass it and submit DDL against other databases. At the time of writing this post, there is active development to fix this with a Ranger plugin for Hive MetaStore.

Note 3: JDBC Connection Pooling

The code patch we’ve described above is simple and minimal. There is no connection pooling by default. For web interactive calls, the required millisecond lag to open up a new TCP connection does not negatively impact users.

Having said that, adding connection pooling is still feasible and makes sense for high traffic use cases. Bear in mind that for Kerberos mode, connection pooling should be on a per-user basis, as each connection is in impersonate mode.