Monday, November 19, 2012

eValhalla User Management

[Previous in this series: eValhalla Setup]

In this installment of the eValhalla development chronicle, we will be tackling what's probably the one common feature of most web-based application - user management. We will implement:
  1. User login
  2. A "Remember me" checkbox for auto-login of users on the same computer
  3. Registration with email validation
The end result can be easily taken and plugged in your next web app! Possible future improvements would be adding captchas during registration and/or login and the ability to edit more extensive profile information.

I have put comments wherever I felt appropriate in the code, which should be fairly straightforward anyway. So I won't be walking you line by line. Rather, I will explain what it does and comment on design decisions or on less obvious parts.   First, let me say a few words about how user sessions are managed.

User Sessions

Since we are relying on a RESTful architecture, we can't have a server hold user sessions. We need to store user session information at the client and transfer it with each request. Well, the whole concept of a user session is kind of irrelevant here since the application logic is mostly residing at the client and the server is mainly consulted as a database. Still, the data needs protection and we need the notion of a user with certain access rights. So we have to solve the problem of authentication. Each request needs to carry enough information for the server to decide whether the request should be honored or not. We cannot just rely on the user id because this way anybody can send a request with anybody else's id. To authenticate the client, the server will first supply it with a special secret key, an authentication token,  that the client must send on each request along with the user id. To obtain that authentication token, the client must however identify themselves by using a password. And that's the purpose of the login operation: obtaining an authentication token for use in subsequent requests. The client will then keep that token together with the user id and submit them as HTTP headers on every request. The natural way to do that with JavaScript is storing the user id and authentication tokens as cookies.

This authentication mechanism is commonly used when working within a RESTful architecture. For more on the subtleties of that approach, just google "user authentication and REST applications". One question is why not just send the password with every request instead of a separate token. That's possible, but more risky - a token is generated randomly and it expires, hence it is harder to guess. The big issue however is XSS (cross-site scripting) attacks. In brief, with XSS an attacker insert HTML code into a field that gets displayed supposedly as just static text to other users (e.g. a blog title) and the code simply does an HTTP request to a malicious server submitting all the users' private cookies with it. To avoid them, we will have to pay special attention on HTML sanitation. That is, we have to properly escape every HTML tag displayed as static text. We can also put that authentication token in an HTTPOnly cookie for extra security.

Implementation - a User Module

Since user management is so common, I made a small effort to build a relatively self-contained module for it. There are no standard packaging mechanisms for the particular technology stack we're using, so you'd just have to copy&paste a few files:

  • /html/ahguser.ht - contains the HTML code for login and registration dialogs as well as top-level Login and Register links that show up right aligned. This depends on the whole Angular+AngularUI+Bootstrap+jQuery environment. 
  • /javascript/ahguser.js - contains the Angular module 'ahgUser' that you can use in your Angular application. This depends on the backend:
  • /scala/evalhalla/user/package.scala - the evalhalla mJson-HGDB backend REST user service. 

The backend can be easily packaged in a jar, mavenized and all, and this is something that I might do near the end of the project. 

The registration process validates the user's email by emailing them a randomly generated UUID (a HyperGraphDB handle) and requiring that they provide it back for validation before they can legitimately log into the site. Since the backend code is simple enough, repetitive even, let's look at one method only, the register REST endpoint:

@POST
@Path("/register")
def register(data:Json):Json = {
  return transact(Unit => {
    normalize(data);
    // Check if we already have that user registered
    var profile = db.retrieve(
        jobj("entity", "user", 
             "email", data.at("email")))
    if (profile != null)
      return ko().set("error", "duplicate")
    // Email validation token
    var token = graph.getHandleFactory()
          .makeHandle().toString()
    db.addTopLevel(data.set("entity", "user")
                       .set("validationToken", token)
                       .delAt("passwordrepeat"))
    evalhalla.email(data.at("email").asString(), 
      "Welcome to eValhalla - Registration Confirmation",
      "Please validate registration with " + token)
    return ok
  })
}

So we see our friends db, transact, jobj etc. from last time. The whole thing is a Scala transaction closure with the much nicer than java.lang.Callable Scala syntax. As a reminder, note that most of the functions and objects referred here are globally declared in the evalhalla package. For example, db is an instance of the HyperNodeJson class. The call to normalize just ensures the email is lower-case, because email in general are case-insensitive. While the logic is fairly straightforward, let me make a few observations about the APIs.

User profile lookup is done with a Json pattern matching.  Recall that the database stores arbitrary Json structures (primitives, arrays, objects and nulls) as HyperGraphDB atoms. It doesn't have the notion of "document collections" like MongoDB for example. This is because each Json structure is just a portion of the whole graph. So to distinguish between different types of objects we define a special JSON property called entity that we reserve as a type name attached to all of our top-level Json entities. Here we are dealing with entities of type "user". Now, each atom in HyperGraphDB and consequently each user profile has a unique identifier - the HyperGraphDB handle (a UUID). There is no notion of a primary key enforced at the database level. We know that an email should uniquely identify a user profile so we perform the lookup with the Json object {entity:"user", email:<the email>} as a pattern. But this uniqueness is enforced at the application level because new profiles are added only precisely via this register method. I've explained the db.addTopLevel method on the  HGDB-mJson wiki page, but here is a summary for the impatient: while db.add will create a duplicate version of the whole Json structure recursively and while db.assertAtom will only create something if it doesn't exist yet, db.addTopLevel will create a new database atom only for the top-level JSON, but perform an assert operation for each of its components.

Finally, note that before adding the profile to the database, we delete the passwordrepeat property. This is because we are storing in the database whatever JSON object the client gives us. And the client is giving us all fields coming from the HTML registration form, pretty much as the form is defined. So we get rid of that unneeded field. Granted, it would be actually better design to remove that property at the client-side since the repeat password validation is done there. But I wanted to illustrate the fact that the JSON data is really flowing as is from the HTML form directly to the database, with no mappings or translations of any sort needed. 

So far, so good. Let's move on the UI side.

Client-Side Application

In addition to AngularJS, I've incorporated the Bootstrap front-end library by Twitter. It looks good and it has some handy components. I've also included the Angular-UI library which has some extras like form validation and it plays well with Bootstrap.

The whole application resides on one top-level HTML page /html/index.html. The main JavaScript program that drives it resides in the /javascript/app.js file. The idea is to dynamically include fragments of HTML code inside a reserved main area while some portions of the page such as the header with user profile functionality remain stable. By convention, I use the extension '.ht' for such HTML fragments instead of '.html' to indicate that the content is not a complete HTML page. Here's the main page in its essence:


<div ng-controller="MainController">
<div ng-include="ng-include" src="'/ahguser.ht'"></div>
<!-- some extra text here... -->
<hr/>
<a href="#m">main</a> |
<a href="#t">test</a>
<hr/>
<ng-view/>  <!-- this is where HTML fragments get included -->

The MainController function is in app.js and it will eventually establish some top-level scoped variables and functions, but it's currently empty. The user module described above is included as a UI component by the <div ng-include...> tag. The login and register links that you see at the top right of the screen with all their functionality come from that component. Finally, just for the fun of it, I made a small menu with two links to switch the main view from one page fragment to another: 'main' and 'test'. This is just to setup a blueprint for future content.

A Note on AngularJS

Let me explain a bit the Angular portion of that code since this framework, as pretentious as it is, lacks in documentation and it's far from being an intuitive API. In fact, I wouldn't recommend yet. My use of it is experimental and the first impressions are a bit ambivalent. Yes, eventually you get stuff to work. However, it's bloated and it fails the first test of any over-reaching framework: simple things are not easy. It seems built by a team of young people that use the word awesome a lot, so we'll see if it meets the second test, that complicated things should be possible. Over the years, I've managed to stay away from other horrific, bloated frameworks, aggressively marketed by a big companies (e.g. EJBs), but here I may be just a bit too pessimistic. Ok, enough.

Angular reads and interprets your HTML before it gives it to the browser. That opens many doors. In particular, it allows you to define custom attributes and tags to implement some dynamic behaviors. It has the notion of an application and the notion of modules. An application is associated with a top-level HTML tag, usually the 'html' tag itself by setting the custom ng-app='appname' attribute. Then appname is declared as a module in JavaScript:

var app = angular.module('appname', [array of module dependencies])

It's not clear what's special about an application vs. mere modules, presumably nothing. Then functionality is attached to markup (the "view") via Angular controllers. Those are JavaScript functions that you write and Angular calls to setup the model bound to the view. Controller functions take any number of parameters and Angular uses a nifty trick here. When you call the toString method of a JavaScript function, it returns its full text as it was originally parsed. That includes the formal arguments, exactly with the names you have listed in the function declaration (unless you've used some sort of minimization/obfuscation tool). So Angular parses that argument list and uses the names of the parameters to determine what you need in your function. For example, when you declare a controller like this:

function MainController($scope, $http) {
}
A call to MainController.toString() returns the string "function MainController($scope, $http) { }". Angular parses that string and determines that you want "$scope" and "$http". It recognizes those names and passes in the appropriate arguments for them. The name "$scope" is a predefined AngularJS name that refers to an object to be populated with the application model. Properties of that object can be bound to form elements, or displayed in a table or whatever. The name "$http" refers to an Angular service that allows you to make AJAX calls. As far as I understand it, any global name registered as a service with Angular can be used in a controller parameter list. There's a dependency injection mechanism that takes care of, among other things, hooking up services in controllers by matching parameter names with globally registered functions. I still haven't figured out what the practical benefit of that is, as opposed to having global JavaScript objects yourself....perhaps in really large applications where different clusters of the overall module dependency graph use the same names for different things.

Beginnings of a Data Service

One of the design goals in this project is to minimize the amount of code handling CRUD data operations. After all, CRUD is pretty standard and we are working with a schema-less flexible database. So we should be able to do CRUD on any kind of structured object we want without having to predefine its structure. In all honesty, I'm not sure this will be as easy as it sounds. As mentioned before, the main difficulty are security access rules. It's certainly doable, but we shall see what kind of complexities it leads to in the subsequent iterations. For now, I've created a small class called DataService that allows you to perform a simple JSON pattern lookup as well as all CRUD operations on any entity identified by its HyperGraphDB handle.

One can experiment with this interface by making $.ajax calls in the browser's REPL. The interface is certainly going to evolve and change, but here are a couple of calls you can try out. I've made the '$http' service available as a global variable:

$http.post("/rest/data/entity", {entity:'story', content:'My story starts with....'})
$http.get("/rest/data/list?pattern=" + JSON.stringify({entity:'story'})).success(function(A) { console.log(A); })
The above creates a new entity in the DB and then retrieves it via a query for all entities of that type (i.e. entity:'story'). You can also play around with $http.put and $http.delete.

Conclusion

All right, this concludes the 2nd iteration of eValhalla. We've implemented a big part of what we'd need for user management. We've explored AngularJS as a viable UI framework and we've laid the groundwork of a data-centered REST service. To get it, follow the same steps as before, but checkout the phase2 GIT tag instead of phase1:


  1. git clone https://github.com/publicvaluegroup/evalhalla.git
  2. cd evalhalla
  3. git checkout phase2
  4. sbt
  5. run


Coming Up...

On the next iteration, we will do some data modeling and define the main entities of our application domain. We will also implement a portion of the UI dealing with submission and listing of stories. 

Sunday, November 4, 2012

HyperGraphDB 1.2 Final Released

Kobrix Software is pleased to announce the release of HyperGraphDB 1.2 Final.

Several bugs were found and corrected during the beta testing period, most notably having to do with indexing.

Go directly to the download page.

HyperGraphDB is a general purpose, free open-source data storage mechanism. Geared toward modern applications with complex and evolving domain models, it is suitable for semantic web, artificial intelligence, social networking or regular object-oriented business applications.
This release contains numerous bug fixes and improvements over the previous 1.1 release. A fairly complete list of changes can be found at the Changes for HyperGraphDB, Release 1.2 wiki page.
  1. Introduction of a new HyperNode interface together with several implementations, including subgraphs and access to remote database peers. The ideas behind are documented in the blog post HyperNodes Are Contexts.
  2. Introduction of a new interface HGTypeSchema and generalized mappings between arbitrary URIs and HyperGraphDB types.
  3. Implementation of storage based on the BerkeleyDB Java Edition (many thanks to Alain Picard and Sebastian Graf!). This version of BerkeleyDB doesn't require native libraries, which makes it easier to deploy and, in addition, performs better for smaller datasets (under 2-3 million atoms).
  4. Implementation of parametarized pre-compiled queries for improved query performance. This is documented in the Variables in HyperGraphDB Queries blog post.
HyperGraphDB is a Java based product built on top of the Berkeley DB storage library.

Key Features of HyperGraphDB include:
  • Powerful data modeling and knowledge representation.
  • Graph-oriented storage.
  • N-ary, higher order relationships (edges) between graph nodes.
  • Graph traversals and relational-style queries.
  • Customizable indexing.
  • Customizable storage management.
  • Extensible, dynamic DB schema through custom typing.
  • Out of the box Java OO database.
  • Fully transactional and multi-threaded, MVCC/STM.
  • P2P framework for data distribution.
In addition, the project includes several practical domain specific components for semantic web, reasoning and natural language processing. For more information, documentation and downloads, please visit the HyperGraphDB Home Page.

Many thanks to all who supported the project and actively participated in testing and development!

Thursday, October 25, 2012

RESTful Services in Java - a Busy Developer's Guide

The Internet doesn't lack expositions on REST architecture, RESTful services, and their implementation in Java. But, here is another one. Why? Because I couldn't find something concise enough to point readers of the eValhalla blog series.

What is REST?
The acronym stands for Representational State Transfer. It refers to an architectural style (or pattern) thought up by one of the main authors of the HTTP protocol. Don't try to infer what the phrase "representational state transfer" could possibly mean. It sounds like there's some transfer of state that's going on between systems, but that's a bit of a stretch. Mostly, there's transfer of resources between clients and servers. The clients initiate requests and get back responses. The responses are resources in some standard media type such as XML  or JSON or HTML. But, and that's a crucial aspect of the paradigm, the interaction itself is stateless. That's a major architectural departure from the classic client-server model of the 90s. Unlike classic client-server, there's no notion of a client session here. 

REST is offered not as a procotol, but as an architectural paradigm. However, in reality we are pretty much talking about HTTP of which REST is an abstraction. The core aspects of the architecture are (1) resource identifiers (i.e. URIs);  (2) different possible representations of resources, or internet media types (e.g. application/json); (3) CRUD operations support for resources like the HTTP methods  GET, PUT, POST and DEL. 

Resources are in principle decoupled from their identifiers. That means the environment can deliver a cached version or it can load balance somehow to fulfill the request. In practice, we all know URIs are actually addresses that resolve to particular domains so there's at least that level of coupling.  In addition, resources are decoupled from their representation. A server may be asked to return HTML or XML or something else. There's content negotiation going on where the server may offer the desired representation or not. The CRUD operations have constraints on their semantics that may or may not appear obvious to you. The GET, PUT and DEL operations require that a resource be identified while POST is supposed to create  a new resource. The GET operation must not have side-effects. So all other things being equal, one should be able to invoke GET many times and get back the same result.  PUT updates a resource, DEL removes it and therefore they both have side-effects just like POST. On the other hand, just like GET, PUT may be repeated multiple times always to the same effect. In practice, those semantics are roughly followed. The main exception is the POST method which is frequently used to send data to the server for some processing, but without necessarily expecting it to create a new resource. 

Implementing RESTful services revolves around implementing those CRUD operations for various resources. This can be done in Java with the help of a Java standard API called JAX-RS.

REST in Java = JAX-RS = JSR 311
In the Java world, when it comes to REST, we have the wonderful JAX-RS. And I'm not being sarcastic! This is one of those technologies that the Java Community Process actually got right, unlike so many other screw ups. The API is defined as JSR 311 and it is at version 1.1, with work on version 2.0 under way. 

The beauty of JAX-RS is that it is almost entirely driven by annotations. This means you can turn almost any class into a RESTful service. You can simply turn a POJO into a REST endpoint by annotating it with JSR 311 annotations. Such an annotated POJOs is called a resource class in JAX-RS terms.

Some of the JAX-RS annotations are at the class level, some at the method level and others at the method parameter level. Some are available both at class and method levels. Ultimately the annotations combine to make a given Java method into a RESTful endpoint accessible at an HTTP-based URL. The annotations must specify the following elements:

  • The relative path of the Java method - this is accomplished with @Path annotation.
  • What the HTTP verb is, i.e. what CRUD operation is being performed - this is done by specifying one of @GET, @PUT, @POST or @DELETE annotations. 
  • The media type accepted (i.e. the representation format) - @Consumes annotation.
  • The media type returned - @Produces annotation.

The two last ones are optional. If omitted, then all media types are assumed possible. Let's look at a simple example and take it apart:

import javax.ws.rs.*;
@Path("/mail")
@Produces("application/json")
public class EmailService
{
    @POST
    @Path("/new")
    public String sendEmail(@FormParam("subject") String subject,
                            @FormParam("to") String to,
                            @FormParam("body") String body)  {
        return "new email sent";
    }
    
    @GET
    @Path("/new")
    public String getUnread()  {
        return "[]";
    }
   
    @DELETE
    @Path("/{id}")
    public String deleteEmail(@PathParam("id") int emailid)  {
        return "delete " + id;
    }
    
    @GET
    @Path("/export")
    @Produces("text/html")
    public String exportHtml(@QueryParam("searchString") 
                             @DefaultValue("") String search) {
        return "<table><tr>...</tr></table>";
    }
}
The class define a RESTful interface for a hypothetical HTTP-based email service. The top-level path mail is relative to the root application path. The root application path is associated with the JAX-RS javax.ws.rs.core.Application  that you extend to plugin into the runtime environment. Then we've declared with the @Produces annotation that all methods in that service produce JSON. This is just a class-default that one can override for individual methods like we've done in the exportHtml method. The sendMail method defines a typical HTTP post where the content is sent as an HTML form. The intent here would be to post to http://myserver.com/mail/new a form for a new email that should be sent out. As you can see, the API allows you to bind each separate form field to a method parameter. Note also that you have a different method for the exact same path. If you do an HTTP get at /mail/new, the Java method annotated with @GET will be called instead. Presumably the semantics of get /mail/new would be to obtain the list of unread emails. Next, note how the path of the deleteEmail method is parametarized by an integer id of the email to delete. The curly braces indicate that "id" is actually a parameter. The value of that parameter is bound to the whatever is annotated with  @PathParam("id"). Thus if we do an HTTP delete at http://myserver.com/mail/453 we would be calling the deleteEmail method with argument emailid=453. Finally, the exportHtml method demonstrates how we can get a handle on query parameters. When you annotate a parameter with @QueryParam("x") the value is taken from the HTTP query parameter named x. The @DefaultValue annotation provides a default in case that query parameter is missing. So, calling http://myserver.org/mail/export?searchString=RESTful will call the exportHtml method with a parameter search="RESTful".

To expose this service, first we need to write an implementation of javax.ws.rs.core.Application. That's just a few lines:

public class MyRestApp extends javax.ws.rs.core.Application {
   public Set>Class> getClasses() {
      HashSet S = new HashSet();
      S.add(EmailService.class);
      return S;
   }
}

How this gets plugged into your server depends on your JAX-RS implementation. Before we leave the API, I should mentioned that there's more to it. You do have access to a Request and Response objects. You have annotations to access other contextual information and metadata like HTTP headers, cookies etc. And you can provide custom serialization and deserialization between media types and Java objects.

RESTful vs Web Services
Web services (SOAP, WSDL) were heavily promoted in the past decade, but they didn't become as ubiquitous as their fans had hoped. Blame XML. Blame the rigidity of the XML Schema strong typing. Blame the tremendous overhead, the complexity of deploying and managing a web service. Or, blame the frequent compatibility nightmares between implementations. Reasons are not hard to find and the end result is that RESTful services are much easier to develop and use. But there is a flip side!

The simplicity of RESTful services means that one has less guidance in how to map application logic to a REST API. One of the issues is that instead of the programmatic types we have in programming languages, we have the Java primitives and media types. Fortunately, JAX-RS allows to implement whatever conversions we want between actual Java method arguments and what gets sent on the wire.  The other issue is the limited set of operations that a REST service can offer. While with web services, you define the operation and its semantics just as in a general purpose programming language, with RESTful you're stuck with get, put, post and delete. So, free from the type mismatch nightmare, but tied into only 4 possible operations. This is not as bad as it seems if you view those operations as abstract, meta operations.

The key point when designing RESTful services, whether you are exposing existing application logic or creating a new one, is to think in terms of data resources. That's not so hard since most of what common business applications do is manipulate data. First, because every single thing is identified as a resource, one must come up with an appropriate naming schema. Because URIs are hierarchical, it is easy to devise a nested structure like /productcategory/productname/version/partno. Second, one must decide what kinds of representations are to be supported, both in output and input. For a modern AJAX webpp, we'd mostly use JSON. I would recommend JSON over XML even in a B2B setting where servers talk to each other.

Finally, one must categorize business operation as one of GET, PUT, POST and DELETE. This is probably a bit less intuitive, but it's just a matter of getting used to. For example, instead of thinking about a "Checkout Shopping Cart" operation, think about POSTing a new order. Instead of thinking about a "Login User" operation think about GETing an authentication token. In general, every business operation manipulates some data in some way. Therefore, every business operation can fit into this crude CRUD model. Clearly, most read-only operations should be a GET. However, sometimes you have to send a large chunk of data to the server in which case you should use POST. For example you could post some very time consuming query that require a lot of text to specify. Then the resource you are creating is for example the query result. Another way to decide if you should POST or no is if you have a unique resource identifier. If not, then use POST. Obviously, operations that cause some data to be removed should be a DELETE. The operations that "store" data are PUT and again POST. Deciding between those two is easy: use PUT whenever you are modifying an existing resource for which you have an identifier. Otherwise, use POST. 

Implementations & Resources
There are several implementations to choose from. Since, I haven't tried them all, I can't offer specific comments. Most of them used to require a servlet containers. The Restlet framework by Jerome Louvel never did, and that's why I liked it. Its documentation leaves to be desired and if you look at its code, it's over-architected to a comical degree, but then what ambitious Java framework isn't. Another newcomer that is strictly about REST and seems lightweight is Wink, an Apache incubated project. I haven't tried it, but it looks promising. And of course, one should not forget the reference implementation Jersey. Jersey has the advantage of being the most up-to-date with the spec at any given time. Originally it was dependent on Tomcat. Nowadays, it seems it can run standalone so it's on par with Restlet which I mentioned first because that's what I have mostly used. 

Here are some further reading resources, may their representational state be transferred to your brain and properly encoded from HTML/PDF to a compact and efficient neural net:
  1. The Wikipedia article on REST is not in very good shape, but still a starting point if you want to dig deeper into the conceptual framework. 
  2. Refcard from Dzone.com: http://refcardz.dzone.com/refcardz/rest-foundations-restful#refcard-download-social-buttons-display 
  3. Wink's User Guide seems well written. Since it's an implementation of JAX-RS, it's a good documentation of that technology.
  4. http://java.dzone.com/articles/putting-java-rest: A fairly good show-and-tell introduction to the JAX-RS API, with a link in there to a more in-depth description of REST concepts by the same author. Worth the read. 
  5.  http://jcp.org/en/jsr/detail?id=311: The official JSR 311 page. Download the specification and API Javadocs from there.
  6. http://jsr311.java.net/nonav/javadoc/index.html: Online access of JSR 311 Javadocs.
If you know of something better, something nice, please post it in a comment and I'll include in this list.

PS: I'm curious if people start new projects with Servlets, JSP/JSF these days? I would be curious as to what the rationale would be to pick those over AJAX + RESTful services communication via JSON. As I said above, this entry is intended to help readers of the eValhalla blogs series which chronicles the development of the eValhalla project following precisely the AJAX+REST model 

Saturday, October 20, 2012

eValhalla Setup

[Previous in this series: eValhalla Kick Off, Next: eValhalla User Management]

The first step in eValhalla after the official kick off is to setup a development environment with all the selected technologies. That's the goal for this iteration. I'll quickly go through the process of gathering the needed libraries and implement a simple login form that ties everything together.

Technologies

Here are the technologies for this project:
  1. Scala programming language - I had a dilemma. Java has a much larger user base and therefore should have been the language of choice for a tutorial/promotional material on HGDB and JSON storage with it. However, this is actually a serious project to go live eventually and I needed an excuse to code up something more serious with Scala, and Scala has enough accumulated merits, so Scala it is. However, I will show some Java code as well,  just in the form of examples, equivalent to the main code.
  2. HyperGraphDB with mJson storage - that's a big part of my motivation to document this development. I think HGDB-mJson are a really neat pair and more people should use them to develop webapps. 
  3. Restlet framework REST - this is one of very few implementations of JSR 311, that is sort of lightweight and has some other extras when you need them. 
  4. jQuery - That's a no brainer.
  5. AngularJS - Another risky choice, since I haven't used this before. I've used KnockoutJS and Require.js, both great frameworks and well-thought out. I've done some ad hoc customization of HTML tags, tried various template engines, AngularJS promises to give me all of those in a single library. So let's give it a shot.
Getting and Running the Code

Before we go any further, I urge you to get, build and run the code. Besides Java and Scala, I encourage you to get a Git client (Git is now supported on Windows as well), and you need the Scala Build Tool (SBT). Then, on a command console, issue the following commands:
  1. git clone https://github.com/publicvaluegroup/evalhalla.git
  2. cd evalhalla
  3. git checkout phase1
  4. sbt
  5. run
Note the 3d step of checking out the  phase1 Git tag - every blog entry is going to be a separate development phase so you can always get the state of the project at a particular blog entry.  If you don't have Git, you can download an archive from:

https://github.com/publicvaluegroup/evalhalla/zipball/phase1

All of the above commands will take a while to execute the first time, especially if you don't have SBT yet. But at the end you should see the something like this on your console:

[info] Running evalhalla.Start 
No config file provided, using defaults at root /home/borislav/evalhalla
checkpoint kbytes:0
checkpoint minutes:0
Oct 18, 2012 12:01:01 AM org.restlet.engine.connector.ClientConnectionHelper start
INFO: Starting the internal [HTTP/1.1] client
Oct 18, 2012 12:01:01 AM org.restlet.engine.connector.ServerConnectionHelper start
INFO: Starting the internal [HTTP/1.1] server on port 8182
Started with DB /home/borislav/evalhalla/db

and you should have a running local server accessible at http://localhost:8182. Hit that URL, type in a username and a password and hit login.

Architectural Overview

The architecture is made up of a minimal set of REST services that essentially offer user login and access-controlled data manipulation to a JavaScript client-side application. The key will be to come up with an access policy that deals gracefully with a schema free database.

The data itself consists of JSON objects stored as a hypergraph using HGDB-mJson. From the client side we can create new objects and store them. We can then query for them or delete them. So it's a bit like the old client-server model from the 90s. HyperGraphDB supports strongly typed data, but we won't be taking advantage of that. Instead, each top-level JSON object will have a special property called entity that will contain the type of the database entity as a string. This way, when we search for all users for example, we'll be searching for all JSON objects with property entity="user".

There are many reasons to go for REST+AJAX rather than, say, servlets. I hardly feel the need to justify it  - it's stateless, you don't have to deal with dispatching, you just design an API, more responsive, we're in 2013 soon after all. The use of JSR 311 allows us to switch server implementations easily. It's pretty well-designed: you annotate your classes and methods with the URI paths they must be invoked for. Then a method's parameters can be bound either to portions of a URI, or to HTTP query parameters or form fields etc.

I'm not sure yet what the REST services will be exactly, but the idea is to keep them very generic so the core could be just plugged for another webapp and writing future applications could be entirely done in JavaScript.

Project Layout

The root folder contains the SBT build file build.sbt, a project folder with some plugin configurations for SBT and a src folder that contains all the code following Maven naming conventions which SBT adopts. The src/main/html and src/main/javascript folders contain the web application. When you run the server with all default options, that's where it serves up the files from. This way, you can modify them and just refresh the page.  Then src/main/scala contains our program and src/main/java some code for Java programmers to look at and copy & paste. The main point of the Java code is really to help people that want to use all those cool libraries but prefer to code in Java instead.

To load the project in Eclipse,  use SBT to generate project files for you. Here's how:
  1. cd evalhalla
  2. sbt
  3. eclipse
  4. exit
Then you'll have a .project and a .classpath file in the current directory, so you can go to your Eclipse and just do "Import Project". Make sure you've run the code before though, in order to force SBT to download all dependencies.

Code Overview

Ok, let's take a look at the code now, all under src/main. First, look at html/index.html, which gets loaded as the default page. It contains just a login form and the interesting part is the function eValhallaController($scope, $http). This function is invoked by AngularJS due to the ng-controller attribute in the body tag. It provides the data model of the HTML underneath and also a login button event handler, all through the $scope parameter. The properties are associated with HTML element via ng-model and buttons to functions with ng-click. An independent tutorial on AngularJS, one of few since it's pretty new, can be found here.

The doLogin posts to /rest/user/login. That's bound to the evalhalla.user.UserService.authenticate method (see user package). The binding is done through the standard JSR 311 Java annotations, which also work in Scala. I've actually done an equivalent version of this class in Java at  java/evalhalla/UserServiceJava. A REST service is essentially a class where some of the public methods represent HTTP endpoints. An instance of such a class must be completely stateless, a JSR 311 implementation is free to create fresh instances for each request. The annotations work by deconstructing an incoming request's URI into relative paths at the class level and then at the method level. So we have the @Path("/user") annotation (or @Path("/user1") for the Java version so they don't conflict).  Note the @Consumes and @Produces annotations at the class level that basically say that all methods in that REST service accept JSON content submitted and return back JSON results. Note further how the authenticate method takes a single Json parameter and returns a Json value. Now, this is mjson.Json and JSR 311 doesn't know about it, but we can tell it to convert to/from in/output stream. This is done in the java/evalhalla/JsonEntityProvider.java class (which I haven't ported to Scala yet). This entity provider and the REST services themselves are plugged into the framework at startup, so before examining the implementation of authenticate, let's look at the startup code.

The Start.scala file contains the main program and the JSR 311 eValhalla application implementation class. The application implementation is only required to provide all REST services as a set of classes that the JSR 311 framework introspects for annotations and for the interfaces they implement. So the entity converter mentioned above, together with both the Scala and Java version of the user service are listed there. The main program itself contains some boilerplate code to initialize the Restlet framework and asks it to serve up some files from the html and javascript folders and it also attaches the JSR 311 REST application under the 'rest' relative path.

An important line in main is evalhalla.init(). This initializes the evalhalla package object defined in scala/evalhalla/package.scala. This is where we put all application global variables and utility methods. This is where we initialized the HyperGraphDB instance. Let's take a closer look. First, configuration is optionally provided as a JSON formatted file, the only possible argument to the main program. All properties of that JSON are optional and have sensible defaults. With respect to deployment configuration, there are two important locations: the database location and the web resources location. The database location, specified with dbLocation, is by default taken to be db under the working directory, from where the application is run. So for example if you've followed the above instructions to run the application from the SBT command prompt for the first time, you'd have a brand new HyperGraphDB instance created under your EVALHALLA_HOME/db. The web resources served up (html, javascript, css, images) are configured with siteLocation the default being src/main so you can modify source and refresh. So here is how the database is created. You should be able to easily follow Scala code even if you're a mainly Java programmer.

    val hgconfig = new HGConfiguration()
    hgconfig.getTypeConfiguration().addSchema(new JsonTypeSchema())
    graph = HGEnvironment.get(config.at("dbLocation").asString(), hgconfig)
    registerIndexers
    db = new HyperNodeJson(graph)

Note that we are adding a JsonTypeSchema to the configuration before opening the database. This is important for the mJson storage implementation that we are mostly going to rely on. Then we create the graph database, create indices (for now just an empty stub) and last but not least create an mJson storage view on the graph database - a HyperNodeJson instance. Please take a moment to go through the wiki page on HGDB-mJson. The graph and db variables above are global variables that we will be accessing from everywhere in our application. 

Some other points of interest here are the utility methods:

  def ok():Json = Json.`object`("ok", True);
  def ko(error:String = "Error occured") = Json.`object`("ok", False, "error", error);

Those are global as well and offer some standard result values from REST services that the client side may rely on. Whenever everything went well on the server, it returns an ok() object that has a boolean true ok property. If something went wrong, the ok boolean of the JSON returned by a REST call is false and the error property provides an error message. Any other relevant data, success or failure, is embedded with those ok or ko objects. 

Lastly, it is common to wrap pieces of code in transactions. After all, we are developing a database backed applications and we want to take full advantage of the ACID capabilities of HyperGraphDB. Scala makes this particularly easy because it supports closures. So we have yet another global utility method that takes a closure and runs it as a HGDB transaction:

  def transact[T](code:Unit => T) = {
    try{
    graph.getTransactionManager().transact(new java.util.concurrent.Callable[T]() {
      def call:T = {
        return code()
      }
    });
    }
    catch { case e:scala.runtime.NonLocalReturnControl[T] => e.value}
  }

This will always create a new transaction. Because BerkeleyDB JE, which is the storage engine by default as of HyperGraphDB 1.2, doesn't supported nested transaction, one must make sure the transact is not called within another transaction. So when we are in a situation where we want to have a transaction and we'd happily be embedded in some top-level one, we can call another global utility function: ensureTx, which behaves like transact except it won't create a new transaction if one is already in effect.

Ok, armed with all these preliminaries, we are now able to examine the authenticate method:

    @POST
    @Path("/login")
    def authenticate(data:Json):Json = {
        return evalhalla.transact(Unit => {
            var result:Json = ok();
            var profile = db.retrieve(jobj("entity", "user", 
                        "email", data.at("email").asString().toLowerCase()))
            if (profile == null) { // automatically register user for now         
              val newuser = data.`with`(jobj("entity", "user"));
              db.addTopLevel(newuser);
            }
            else if (!profile.is("password", data.at("password")))
                result = ko("Invalid user or password mismatch.");
            result;   
        });
    }

The @POST annotation means that this endpoint will be matched only for an HTTP post method. First we do a lookup for the user profile. We do this by pattern matching. We create a Json object that first identifies that we are looking for an object of type user by setting the entity property to "user". Then we provide another property, the user's email which we know is supposed to be unique so we can treat it as a primary key. However, note that neither HyperGraphDB nor its Json storage implementation provides some notion of a primary other than HyperGraphDB atom handles. The HyperNodeJson.retrieve method returns only the first object matching the pattern. If you want an array of all objects matching the pattern use HyperNodeJson.getAll. Note the 'jobj' method call in there: this is a rename in the import section of the Json.object factory method. It is necessary because in Scala object is a keyword. Another way to use a keyword as a method name in Scala beside import rename, on can wrap it in backquotes ` as is done with data.`with` above which is basically an append operation, merging the properties of one Json object into another. The db.addTopLevel is explained in the HGDB-mJson wiki. Also, you may want to refer to the mJson API Javadoc. One last point though about the structure of the authenticate method: there are no local returns. The result variable contains the result and it is written as the last expression of the function and therefore returned as the result. I like local returns actually (i.e. return statement in the middle of the method following if conditions or within loops or whatever), but the way Scala implements them is by throwing a RuntimeException. However, this exception gets caught inside the HyperGraphDB transaction which has a catch all clause and treats exceptions as a true error conditions rather then some control flow mechanism.  This can be fixed inside HyperGraphDB, but avoiding local returns is not such a bad coding practice anyway.

Final Comments

Scala is new to me so take my Scala coding style with a grain of salt. Same goes with AngularJS. I use Eclipse and the Scala IDE plugin from update site  http://download.scala-ide.org/releases-29/stable/site (http://scala-ide.org/download/nightly.html#scala_ide_helium_nightly_for_eclipse_42_juno for Eclipse 4.2). Some of the initial Scala code is translated from equivalent Java code from other projects. If you haven't worked with Scala, I would recommend giving it a try. Especially if, like me, you came to Java from a richer industrial language like C++ and had to give up a lot of expressiveness.

I'll resist the temptation to make this into an in-depth tutorial of everything used to create the project. I'll say more about whatever felt not that obvious and give pointers, but mostly I'm assuming that the reader is browsing the relevant docs alongside reading the code presented here. This blog is mainly a guide.

In the next phase, we'll do the proverbial user registration and put in place some security mechanisms.

Tuesday, October 9, 2012

eValhalla Kick Off

[Next in this series eValhalla Setup]

As promised in this previous blog post, I will now write a tutorial on using the mJson-HGDB backend for implementing REST APIs and web apps. This will be more extensive than originally hinted at. I got involved in a side project to implement a web site where people can post information about failed government projects. This should be pretty straightforward so a perfect candidate for a tutorial. I decided to use that opportunity to document the whole development in the form of blog series, so this will include potentially useful information for people not familiar with some of the web 2.0 libraries such as jQuery and AngularJS which I plan to use. All the code will be available in GitHub. I am actually a Mercurial user, something turned me off from Git when I looked at it before (perhaps just its obnoxious author), but I decided to use this little project as an opportunity to pick up a few new technologies. The others will be Scala (instead of Java) and AngularJS (instead of Knockoutjs).


About eValhalla

Valhalla - the hall of Odin into which the souls of heroes slain in battle and others who have died bravely are received.

The aim is to provide a forum for people involved in (mainly software) projects within government agencies to report anonymously on those projects' failures. Anybody can create a login, without providing much personal information and be guaranteed that whatever information is provided remains confidential if they so choose. Then they can describe projects they have insider information about and that can be of interest to the general public. Those projects could be anything from small scale, internal-use only, local government, to larger-scale publicly visible nation-level government projects.

I won't go into a "mission statement" type of description here. You can see it as a "wiki leaks" type transparency effort, except we'd be dealing with information that is in the public domain, but that so far hasn't had an appropriate outlet. Or you can see it as a fun place to let people vent their frustrations about mis-management, abuses, bad decisions, incompetence etc. Or you can see it as a means to learn from experience in one particular type of software organization: government IT departments. And those are a unique breed. What's unique? Well, the hope is that such an online outlet will make that apparent.

Requirements Laundry List

Here is the list of requirements, verbatim as sent to me by the project initiator:
  • enter project title
  • enter project description
  • enter project narrative
  • enter location
  • tag with failure types
  • tag with subject area/industry sector
  • tag with technologies
  • enter contact info
  • enter project size
  • enter project time frame (year)
  • enter project budget
  • enter outcome (predefined)
  • add lessons learned
  • add pic to project
  • ability to comment on project
  • my projects dashboard (ability to add, edit, delete)
  • projects can be saved as draft and made public later
  • option to be anonymous when adding specific projects
  • ability to create profile (username, userpic, email, organization, location)
  • ability to edit and delete profile
  • administrator ability to feature projects on main page
  • search for projects based on above criteria and tags
  • administrator ability to review projects prior to them being published
Nevermind that the initiator in question is currently pursuing a Ph.D. in requirements engineering. Those are a good enough start. We'll have to implement classic user management and then we have our core domain entity: a failed project, essentially a story decorated with various properties and tags, commented on. As usualy, we'd expect requirements to change as development progresses, new ideas will popup, old ideas will die and force refactorings etc. Nevermind, I will maintain a log of those development activities and if you are following, do not expect anything described to be set in stone.

Architecture

Naturally, the back-end database will be HyperGraphDB with its support of plain JSON-as-a-graph storage as described here. We will also make use of two popular JavaScript libraries: jQuery and AngularJS as well as whatever related plugins come in handy.

Most of the application logic will reside on the client-side. In fact, we will be implementing as much business logic in JavaScript as security concerns allow us. The server will consist entirely of REST services based on the JSR 311 standard so they can be easily run on any of the server software supporting that standard. To make this a more or less self-contained tutorial, I will be describing most of those technologies and standards along the way, at least to the extent that they are being used in our implementation. That is, I will describe as much as needed to help you understand the code. However, you need familiarity with Java and JavaScript  in order to follow.

The data model will be schema-less, yet our JSON will be structured and we will document whenever certain properties are expected to be present and we will follow conventions helps us navigate the data model more easily.

Final Words

So stay tuned. The plan is to start by setting up a development environment, then implement the user management part, then submission of new projects (my project dashboard), online interactions, search, administrative features to manage users, the home page etc.

Tuesday, August 14, 2012

On Testing

More than decade ago, I spend a year doing quality assurance at a big and successful smart card company. It was one of the more intellectually stimulating jobs I've had in the software industry and I ended up developing a scripting language, sadly long lost, to do test automation. Doing testing right can be much harder than writing the software being tested. Especially when you're after 100% quality as is the case with software burned on the chips of millions of smart cards that would have to be thrown away in case of a serious bug discovered post-production. Companies aiming at this level of quality get ISO 900x certified to show off to clients. To get certified, they have to show a QA process that guarantees, to an acceptable degree of confidence, that products delivered work, that the organization is solid, knowledge is preserved etc. etc. The interesting part that I'd like to share with you is the specific software QA approach. It did involve an obscene amount of both documentation and software artifacts that had to be produced in a very rigid formal setting, but the philosophy behind spec-ing out the tests was sound, practical and better than anything I've seen since. 

Dijkstra famously said that testing can only prove the presence of errors, never the absence of errors. True that. To have 100% guarantee that a program works, one would need to produce a mathematical proof of its correctness. Perhaps not sufficient...as Knuth not less famously noted, when sending a program to a colleague, that the program must be used with caution because he only proved it correct, but never tested it. In either case, ultimately the goal is to gain confidence in the quality of a piece of software. We do that by presenting a strong, very, very convincing argument that the program works. 

When discussing testing methodology people generally talk about automated vs. manual testing, test-driven development where test cases are developed before the code or "classic" testing where they are done after development, but rarely do I see people mindful of how tests should be specified. The term test itself is used rather ambiguously to mean the action of testing, or the specification or the process or the development phase. And in some contexts a test means a test case which refers to an input data set, or it refers to an automated program (e.g. a jUnit test case). So let's agree that, whether you code it up or do it manually, a test case consists of a sequence of steps taken to interact with the software being tested and verify the output with the goal of ensuring that said software behaves as expected. So how do you go about deciding what this sequence of steps should be? In other words, how do you gather test requirements? 

Think of it in dialectical terms - you're trying to convince a skeptic that a program works correctly. First you'd have to agree what it means for that program to work correctly. Well, they say, it must match the requirements. So you start by reading all requirements and translating that last statement ("it must match the requirements" ) for each one of them into a corresponding set of test criteria. Naturally, the more detailed the requirements are, the easier that process is. In an agile setting, you might be translating user stories into test criteria. Let's have a simple running example:

Requirement:

Login form should include a captcha protection

Test Criteria:

  • C1 - the login form should display a string as an image that's hard to recognize by a program.
  • C2 - the login form should include an input field that must match the string in the captcha image for login to succeed. 

Notice how the test criteria explicitly state under what conditions one can say that a program works. One can list more criteria with further detail, stating what happens if the captcha doesn't match. What happens after n number of repeats etc. Also, this should make it clear that test criteria are not actual tests. They are not something that can be executed (manually or automatically). In fact, they are to a test program what requirements are to the software being QA-ed. And as with conventional requirements, the more clear you are on your test criteria, the better chance you have in developing adequate tests. 

The crucial point is that when you write a test case, you want to make sure that it is with a well-defined purpose, that it serves as a demonstration that an actual test criterion has been met. And this is what's missing in 90% of testing efforts (well, to be sure this is just anecdotal evidence). People write or perform tests simply for the sake of trying out things. Tests accumulate and if you have a lot of them, it makes it look like you're in good shape. But that's not necessarily the case because tests can only prove the presence of errors, not their absence. To convince your dialectical opponent of the absence of errors, given the agreed upon list of criteria, you'd have to show how your tests prove that the criteria have been met. In other words, you want to ensure that all your test criteria have been covered by appropriate test cases - for each test criterion there is at least one test case that, when successful, shows that this criterion is satisfied. A convenient way to do that is to create a matrix where you list all your criteria in the rows and all your test cases in the columns and checkmark a given cell whenever the test case covers the corresponding criterion, where "covers" means that if the test case succeeds one can be confident that the criterion is met. This implies that the test case itself will have to include all necessary verification steps. Continuing with our simple example, suppose you've developed a few test cases:

  • T1 - test succesful login
  • T2 - test failed login for all 3 fields, bad user, bad password, bad captcha
  • T3 - test captcha quality by running image recognition algos on captcha
 T1T2T3T4T5......
C1  X    
C2XX     
...       

A given test case may cover a certain aspect of the program, but you'd put a checkmark only if it actually verifies the criteria in question. For instance T3 would be loading a login page, but it won't be testing actual login. Similarly, T1 and T2 can observe the captcha, but they won't evaluate its quality. It may appear a bit laborious as an approach. In the aforementioned company, this was all documented ad nauseam. Criteria were classified as "normal", "abnormal", "stress" and what not, reflecting different types of expected behaviors and possible execution contexts. Now, I did warn you - this was a QA process aimed at living up to ISO standards. And it did. But think about the information this matrix provides you. It is a full, detailed spec of your software. It is a full inventory of your test suite. It tells what part of the program is being tested by what test. It shows you immediately if some criteria are not being covered by a test, or not covered enough. If shows you immediately if some criteria are being covered too much, i.e. if some tests are superfluous. When tests fail, it tells you exactly what behaviors of the software are not functioning properly. Recall that one of the main problems with automated testing is the explosion of code that needs to be written to achieve descent coverage. This matrix can go a long way to controlling that code explosion by keeping each test case with a relatively unique purpose. Most importantly, the matrix presents a pretty good argument for the program's correctness - you can see at a glance both how correctness has been defined (the list of criteria) and how it is demonstrated (the list of tests cross-referenced with criteria).

Reading about testing even from big industry names, I have been frequently disappointed at the lack of systematic approach to test requirements. In practice it's even worse. Developers, testers, business people in general have no idea what they are doing when testing. This includes agile teams where tests are sometimes supposed to constitute the specification of the program. That's plain wrong, first because it's code and code is way too low-level to be understood by all stakeholders, hence it can't be a specification that can be agreed upon by different parties. Second, because usually the same people write both the tests and the program tested, the same bugs sneak in both places and never get discovered, the same possibly wrong understanding of the desired behavior is found in both places. So expressing the quality argument (i.e. with the imaginary dialectical adversary) simply in the form of test cases can't cut it. 

That said, I wouldn't advocate following the approach outlined above verbatim and in full detail. But I would recommend having the mental picture of that Criteria x Tests matrix as guide to what you're doing. And if you're building a regression test suite, and especially if some of the tests are manual, it might be worth your while spelling it out in the corporate wiki somewhere.

Cheers,

Boris

Sunday, June 17, 2012

JSON Storage in HyperGraphDB

JSON (http://json.org) came about as a convenient, better-than-XML data format that one could directly work from in JavaScript.  About a year ago, I got a bit disappointed by the JSON libraries out there. JSON is so simple, yet all of them so complicated, so verbose. Most of it comes from the culture of strong typing in Java and from the presumed use of JSON on the server as an intermediate representation to be mapped to Java beans, where the application domain is usually modeled. But JSON is powerful enough on its own if you are willing to trade off type safety for flexibility. So I wrote mJson (for "minimal" Json), a single source file library for working with JSON from Java. For a description see the blog posts:

Using this library, I've been able to avoid creating the usual plethora of Java classes to work with my domain in several applications. Coding with it is pretty neat. One can pretty much view JSON as a general minimalistic, highly flexible modeling tool, a bit like s-expressions or any general enough abstract data structure (another example are first-order logic terms). 

Now, there's a persistence layer for those JSON entities based on HyperGraphDB. Unlike "document-oriented" databases, the representation in HyperGraphDB is really graph-like with all the expected consequences. It is described in the HyperGraphDB Json Wiki Page, pretty stable, convenient and fun to work with. 

If time permits, I will write a few posts in the form of a tutorial to show how to quickly build database backed JSON REST services with mJson and HyperGraphDB. If you have a suggestion of an application domain to cover, please share it in a comment!

In the meantime, I invite you to give it a try! 

Regards,

Boris

Monday, June 11, 2012

HyperGraphDB 1.2 Beta now available

Kobrix Software is pleased to announce the release of HyperGraphDB version 1.2.

HyperGraphDB is a general purpose, free open-source data storage mechanism. Geared toward modern applications with complex and evolving domain models, it is suitable for semantic web, artificial intelligence, social networking or regular object-oriented business applications.

This release contains numerous bug fixes and improvements over the previous 1.1 release. A fairly complete list of changes can be found at the Changes for HyperGraphDB, Release 1.2 wiki page.

  1. Introduction of a new HyperNode interface together with several implementations, including subgraphs and access to remote database peers. The ideas behind are documented in the blog post HyperNodes Are Contexts.
  2. Introduction of a new interface HGTypeSchema and generalized mappings between arbitrary URIs and HyperGraphDB types.
  3. Implementation of storage based on the BerkeleyDB Java Edition (many thanks to Alain Picard and Sebastian Graf!). This version of BerkeleyDB doesn't require native libraries, which makes it easier to deploy and, in addition, performs better for smaller datasets (under 2-3 million atoms).
  4. Implementation of parametarized pre-compiled queries for improved query performance. This is documented in the Variables in HyperGraphDB Queries blog post.

HyperGraphDB is a Java based product built on top of the Berkeley DB storage library.

Key Features of HyperGraphDB include:
  • Powerful data modeling and knowledge representation.
  • Graph-oriented storage.
  • N-ary, higher order relationships (edges) between graph nodes.
  • Graph traversals and relational-style queries.
  • Customizable indexing.
  • Customizable storage management.
  • Extensible, dynamic DB schema through custom typing.
  • Out of the box Java OO database.
  • Fully transactional and multi-threaded, MVCC/STM.
  • P2P framework for data distribution.
In addition, the project includes several practical domain specific components for semantic web, reasoning and natural language processing. For more information, documentation and downloads, please visit the HyperGraphDB Home Page.

Sunday, May 20, 2012

Variables in HyperGraphDB Queries

Introduction
A feature that has been requested on a few occasions is the ability to parametarize HyperGraphDB queries with variables, sort of like the JDBC PreparedStatement allows. That feature has now been implemented with the introduction of named variables. Named variables can be used in place of a regular condition parameters and can contain arbitrary values, whatever is expected by the condition. The benefits are twofold: increase performance by avoiding to recompile queries where only a few condition parameters change and a cleaner code with less query duplication. The performance increase of precompiled queries is quite tangible and shouldn't be neglected. Nearly all predefined query conditions support variables, except for the TypePlusCondition which is expanded into a type disjunction during query compilation and therefore cannot be easily precompiled into a parametarized form. It must also be noted that some optimizations made during the compilation phase, in particular related to index usage, can't be performed when a condition is parametarized. A simple example of this limitation is a parameterized TypeCondition - indices are defined per type, so if the type is not known during query analysis, no index will be used.

The API
The API is designed to fit the existing query condition expressions. A variable is created by calling the hg.var method and the result of that method passed as a condition parameter. For example:

HGQuery<HGHandle> q = hg.make(HGHandle.class, graph).compile(
    hg.and(hg.type(typeHandle),hg.incident(hg.var("targetOfInterest"))));
This will create a query that looks for links of specific type and that point to a given atom, parameterized by the variable "targetOfInterest". Before running the query, one must set a value of the variable:

q.var("targetOfInterest", targetHandle);
q.execute();

It is also possible to specify an initial value of a variable as the second of the hg.var method:


q = hg.make(HGHandle.class, graph).compile(
    hg.and(hg.type(typeHandle),hg.incident(hg.var("targetOfInterest", initialTarget))));
q.execute();
Note the new hg.make static method: it takes the result type and the graph instance and returns a query object that doesn't have a condition attached to it yet. To attach a condition, one must call the HGQuery.compile method immediately after make. This version of make establishes a variable context attached to the newly constructed query. All query conditions constructed after the call to make and before the call to compile will implicitly assume variables belong to that context. That is why to avoid surpises you should call compile right after make.

Once a query is thus created, one can execute it many times by assigning new values to the variables through the hg.var method as shown above.

The Implementation
The requirement of calling HGQuery.compile right of the make method comes from the fact that the existing API constructs conditions in a static environment. The condition is fully constructed before it is passed to the compile method. So to make a variable context available during the condition construction process (the evaluation of the query condition expression), we do something relatively unorthodox here and attach that context (essentially a name/value map) as a thread local variable in the make method. The alternative would have been creating a traversal of the condition expression that collects all variables used within them. Either that traversal process would have to know about all condition types, or the HGQueryCondition would have had to be enhanced with some extra interface for "composite conditions" and "variable used within a condition". I didn't like either possibility. Having a two step process of calling hg.make(...).compile(....) seems clean and concise enough.

Variables are implemented through a couple of new interfaces: org.hypergraphdb.util.Ref, org.hypergraphdb.util.Var. When a condition is created with a constant value (instead of a variable), the org.hypergraphdb.util.Constant implementation of the Ref interface is used. All conditions now contain Ref<T> private member variable where before they contained a member variable of type T.  I won't say more about those reference abstractions, except one should avoid using them, unless one is implementing a custom condition...they are considered internal.

On a side note, this idea of creating a reference abstraction essentially implements the age-old strategy of introducing another level of indirection in order to solve a design problem. I've used this concept in many places now, including for things like caching and dealing with complex scopes in bean containers (Spring), so I spend some time trying to create a more general library around this. Originally I intended to create this library as a separate project and make use of it in HyperGraphDB. But I didn't come up with something satisfactory enough, so instead I created the couple of interfaces mentioned above. However, eventually those interfaces and classes may be replaced by a separate library and ultimately removed from the HyperGraphDB codebase.