Kobrix Software, Official Blog: hypergraphdb

Showing posts with label hypergraphdb. Show all posts

Saturday, February 25, 2017

Prolog, OWL and HyperGraphDB

One of the more attractive aspects of HyperGraphDB is that its model is so general and its API so open, up to the storage layer, that many (meta) models can be very naturally implemented efficiently on top of it. Not only that, but those meta models coexist and can form interesting computational synergies when implementing actual business applications. By meta model here I mean whatever formalism or data structure one uses to model the domain of application, e.g. RDF or UML or whatever. As I've always pointed out elsewhere, there is a price to pay for this generality. For one, it's very hard to answer the question "what is the model of HyperGraphDB? is it a graph? is it a relational model? is it object-oriented?". The answer is: it is whatever you want it to be. Just as there is no one correct domain model for a given application, there is no one correct meta-model. Now, the notion of multi-model or polyglot databases is becoming somewhat of a buzzword. It naturally attracts attention because just as you can't solve all computational problems with a single data structure, you can't implement all types of applications with a single data model. The design of HyperGraphDB recognized that very early. As a consequence, meta models like FOL (first-order logic) as implemented in Prolog and DL (description logic) as realized in the OWL 2.0 standard can become happy teammates to tackle difficult knowledge representation and reasoning problems. In this blog, we will demonstrate a simple integration between the TuProlog interpreter and the OWL HyperGraphDB module.

We will use Seco because it is easier to share sample code in a notebook. And besides, if you are playing with HyperGraphDB, Seco is great tool. If you've seen some of the more recent notebook environments like Jupyter and the likes, the notion should be familiar. A tarball with all the code, both notebook and Maven project can be found at the end of this blog.

Installing and Running Seco with Prolog and OWL

First, download Seco from:

https://github.com/bolerio/seco/releases/tag/v0.7.0

then unzip the tarball and start it with the run.sh (Mac or Linux) or run.cmd (Windows) script. You should see a Welcome notebook. Read it - it'll give you an intro of the tool.

Seco comes pre-bundled with a few main and not-so-mainstream scripting languages such as BeanShell, JavaScript and JScheme. To add Prolog to the mix, go ahead and download:

https://github.com/bolerio/seco/releases/download/v0.7.0/seco-prolog-0.7.0.jar

put the jar in SECO_HOME/lib and restart Seco. It should pick up the new language automatically. To test it:

Open a new notebook
Right-click anywhere inside and select "Set Default Language" as Prolog

You can now see the Prolog interpreter in action by typing for example:



father('Arya Stark', 'Eddard Stark').

father(computing, 'Alan Turing').

Evaluate the cell by hitting Shift+Enter - because of the dot at the end, it will be added as a new fact. Then you can query that by typing:

 father(X, Y)?

Again, evaluate with Shift+Enter - because of the question mark at the end, it will be treated as a query. The output should be a small interactive component that allows you go iterate through the possible solutions of the query. There should be two solutions, one for each declared father.

So far so good. Now let's add OWL. The jar you need can be found in the HyperGraphDB Maven repo:

hgdbowl-1.4-20170225.063004-9.jar

You can just put this jar under SECO_HOME/lib and restart. For simplicity, let's just do that.

Side Note: You can also add it to the runtime context classpath (for more see
https://github.com/bolerio/seco/wiki/Short-Seco-Tutorial). This notion of a runtime context in Seco is a bit analogous to an application in a J2EE container: it has its own class loader with its own runtime dependencies etc. Just like with a Java web server, jars in the lib all available for all runtime contexts while jars added to a runtime context are just available for that context. Seco creates a default context so you always have one. But you can of course create others.

With the OWL module installed, we can load an ontology into the database. The sample ontology for this blog can be found at:

https://gist.github.com/bolerio/523d80bb621207e87ff0024a96616255

Save that file owlsample-1.owl somewhere on your machine. To load, open another notebook, this time without changing the default language (which will be BeanShell). Then you can load the ontology with the following code which you should copy and paste into a notebook and cell and evaluate with Shift+Enter:



import org.hypergraphdb.app.owl.*;

import org.semanticweb.owlapi.model.*;



File ontologyFile = new File("/home/borislav/temp/owlsample-1.owl");



HGDBOntologyManager manager = HGOntologyManagerFactory.getOntologyManager(niche.getLocation());

HGDBOntology ontology = manager.importOntology(IRI.create(ontologyFile), new HGDBImportConfig());

System.out.println(ontology.getAxioms());

The printout should spit some axioms on your console. If that works, you have the Prolog and OWL modules running in a HyperGraphDB database instance. In this case, the database instance is the Seco niche (code and notebooks in Seco are automatically stored in a HyperGrapDB instance called the niche). To do it outside Seco, take a look at the annotated Java code in the sample project linked to in the last section.

Prolog Reasoning Over OWL Axioms

So now suppose we want to query OWL data from Prolog, just as any other Prolog predicate with the ability to backtrack etc. All we need to do is write a HyperGraphDB query expression and associate with a Prolog predicate. For example a query expression that will return all OWL object properties looks like this:



hg.typePlus(OWLObjectPropertyAssertionAxiom.class)

The reason we use typePlus (meaning atoms of this type and all its sub-types) is that the concrete type of OWL object property axioms will be HGDB-implementation dependent and it's good to remain agnostic of that. Then for the actual binding to a Prolog predicate, one needs to dig into the TuProlog implementation internals just a bit. Here is how it looks:

// the HGPrologLibrary is what integrates the TuProlog interpeter with HGDB 
import alice.tuprolog.hgdb.HGPrologLibrary;

// Here we associate a database instance with a Prolog interpreter instance and that association
// is the HGPrologLibrary
HGPrologLibrary lib = HGPrologLibrary.attach(niche, 
                                             thisContext.getEngine("prolog").getProlog());

// Each library contains a map between Prolog predicates (name/arity) and arbitrary HyperGraphDB
// queries. When the Prolog interpreter see that predicate, it will invoke the HGDB query 
// as a clause store
lib.getClauseFactory().getPredicateMapping().put(
    "objectProperty/3", 
     hg.typePlus(org.hypergraphdb.app.owl.model.axioms.OWLObjectPropertyAssertionAxiomHGDB.class))

Evaluate the above code in the BeanShell notebook.

With that in place, we can now perform a Prolog query to retrieve all OWL object property assertions from our ontology:



objectProperty(Subject, Predicate, Object)?

Evaluating the above Prolog expression in the Prolog notebook should open up that solution navigator and display the object property triples one by one.

Note: we've been switching between the BeanShell and Prolog notebooks to evaluate code in two different languages. But you can also mix languages in the same notebook. The tarball of the sample project linked below contains a single notebook file called prologandowl.seco which you can File->Import into Seco and evaluate without the copy & paste effort. In that notebook cells have been individually configured for different languages.

What's Missing?

A tighter integration between this trio of HyperGraphDB, Prolog and OWL would include the following missing pieces:

1. Ability to represent OWL expressions (class and property) in Prolog
2. Ability to assert & retract OWL axioms from Prolog
3. Ability to invoke an OWL reasoner from Prolog
4. Perhaps add a new term type in Prolog representing an OWL entity

I promise to report when that happens. The whole thing would make much more sense if there is an OWL reasoning implementation included in the OWL-HGDB module, instead of the current limited by RAM approach of off-the-shelf reasoners like Pellet and HermiT.

Appendix - Annotated Non-Seco Version

You can find a sample standalone Java project that does exactly the above, consisting of a Maven pom.xml with the necessary dependencies here:

http://www.kobrix.com/samples/hgdb-prologowl-sample.tgz

The tarball also contains a Seco notebook with the above code snippets and that you can import to evaluate the cells and see the code in action. The OWL sample ontology is also in there.

Wednesday, November 11, 2015

Announcing Seco 0.6 - Collaborative Scripting For Java

This is a short post to announce an incremental 0.6 release of Seco. The release comes with important bug fixes and a simplified first-time user experience.

The project is officially now on Github: https://github.com/bolerio/seco
Release notes: https://github.com/bolerio/seco/wiki/Latest-Release
Downloads: https://github.com/bolerio/seco/releases
Installation: https://github.com/bolerio/seco/wiki/Installing-and-Starting-Seco

Seco is a collaborative scripting development environment for the Java platform. You can write code in many JVM scripting languages. The code editor in Seco is based on the Mathematica notebook UI, but the full GUI is richer and much more ambitious. In a notebook, you can mix rich text with code and output, including interactive components created by your code. This makes Seco into a live environment because you can evaluate expression and immediately see the changes to your program.

Monday, August 31, 2015

Scheduling Tasks and Drawing Graphs — The Coffman-Graham Algorithm

When an algorithm developed for one problem domain applies almost verbatim to another, completely unrelated domain, that is the type of insight, beauty and depth that makes computer science a science on its own, and not a branch of something else, namely mathematics, like many professionals educated in the field mistakenly believe. For example, one of the common algorithmic problems during the 60s was the scheduling of tasks on multiprocessor machines. The problem is, you are given a large set of tasks, some of which depend on others, that have to be scheduled for processing on N number of processors in such a way as to maximize processor use. A well-known algorithm for this problem is the Coffman-Graham algorithm. It assumes that there are no circular dependencies between the tasks, as is usually the case when it comes to real world tasks, except in catch 22 situations at some bureaucracies run amok! To do that, the tasks and their dependencies are modeled as a DAG (a directed acyclic graph). In mathematics, this is also known as a partial order: if a tasks T1 depends on T2, we say that T2 preceeds T1, and we write T2 < T1. The ordering is called partial because not all tasks are related in this precedence relation, some are simply independent of each other and can be safely carried out in parallel.

The Coffman-Graham algorithm works by creating a sequence of execution rounds where at each round at most N tasks execute simultaneously. The algorithm also has to make sure that all dependencies of the current round have been executed in previous rounds. Those two constraints are what makes the problem non-trivial: we want exactly N tasks at each round of execution if possible, so that all processors get used, and we also have to complete all tasks that precede a given task T before scheduling it. There are 3 basic steps to achieving the objective:

Cleanup the graph so that only direct dependencies are represented. So if there is a task A that depends on B and B depends on another task C, we already know that A depends “indirectly” on C (transitivity is one of defining features of partial orders), so that dependency does not need to be stated explicitly. Sometimes the input of a problem will have such superfluous information, but in fact this could only confuse the algorithm! Removing the indirect dependencies is called transitive reduction, as opposed to the more commonly operation of transitive closure which explicitly computes all indirect dependencies.
Order the tasks in a single list so that the dependent ones come after their dependencies and they are sort of evenly spread apart. This is the crucial and most interesting part of the algorithm. So how are we to compare two tasks and decide which one should be run first. The trick is to proceed from the starting tasks, the roots of the graph that don’t have any dependencies whatsoever, and then progressively add tasks that depend only on them and then tasks then depend only on them etc. This is called topological ordering of the dependency graph. There are usually many possible such orderings and some of them will lead to a good balanced distribution of tasks for the purpose of CPU scheduling while others will leave lots of CPUs unused. In step (3) of the algorithm, we are just going to take the tasks one by one from this ordering and assign them to execution rounds as they come. Therefore, to make it so that at each round, the number of CPUs is maximized, the ordering must somehow space the dependencies apart as much as possible. That is, if the order is written as [T1, T2, T3, …, Tn] and if Tj depends on Ti, we want j-i to be as big as possible. Intuitively, this is desirable because the closer they are, the sooner we’d have to schedule Tj for execution after Ti, and since they can’t be executed on the same parallel round, we’d end up with unused CPUs. To space the tasks apart, here is what we do. Suppose we have tasks A and B, with all their dependencies already ordered in our list and we have to pick which one is going to come next. From A’s dependencies, we take the one most recently placed in the ordering and we check if it comes before or after the most recently placed task from B’s dependencies. If it comes before, then we choose A, if it comes after then we chose B. If it turns out A and B’s most recently placed dependency is actually the same task that both depend on, we look at the next most recent dependency etc. This way, by picking the next task as the one whose closest dependency is the furthest away, at every step we space out dependencies in our ordering as much as possible.
Assign tasks to rounds so as to maximize the number of tasks executed on each round. This is the easiest step - we just reap the rewards from doing all the hard work of the previous steps. Going through our ordering [T1, T2, T3, …, Tn], we fill up available CPUs by assigning the tasks one by one. When all CPUs are occupied, we move to the next round and we start filling CPUs again. If while at a particular round the next task to be scheduled has a dependency that’s also scheduled for that same round, we have no choice but to leave the remaining CPUs unused and start the next round. The algorithm does not take into account how long each tasks can potentially take.

Now, I said that this algorithm is also used to solve a completely different problem. The problem I was referring to is drawing networks in a visually appealing way. This is a pretty difficult problem and there are many different approaches whose effectiveness often depends on the structure of the network. When a network is devoid of cycles (paths from on node back to itself), the Coffman-Grahan algorithm just described can be applied!

The idea is to think of the network nodes as the tasks and of the network connections as the dependencies between the tasks, and then build a list of consecutive layers analogous to the task execution rounds. Instead of specifying a number of available CPUs, one specifies how many nodes per layer are allowed, which is generally convenient because the drawing is done on a fixed width computer screen. Because the algorithm does not like circular dependencies, there is an extra step here to remove a select set of connections so that the network becomes a DAG. This is in addition to transitive reduction where we only keep direct connections and drop all the rest. Once the algorithm is complete, the drawing of those provisionally removed connections can be performed on top of the nice layering produced. Thus, the Coffman-Graham is (also!) one of hierarchical drawing algorithms, a general framework for graph drawing developed by Kozo Sugiyama.

Tuesday, July 28, 2015

HyperGraphDB 1.3 Released

Kobrix Software is pleased to announce the release of HyperGraphDB 1.3.

This is a maintenance release containing many bugs fixes and small improvements. Most of the efforts in this round have gone towards the various application modules built upon the core database facility.

Go directly to the download page.

HyperGraphDB is a general purpose, free open-source data storage mechanism. Geared toward modern applications with complex and evolving domain models, it is suitable for semantic web, artificial intelligence, social networking or regular object-oriented business applications.
This release contains numerous bug fixes and improvements over the previous 1.2 release. A fairly complete list of changes can be found at the Changes for HyperGraphDB, Release 1.3 wiki page.

HyperGraphDB is a Java based product built on top of the Berkeley DB storage library.

Key Features of HyperGraphDB include:

Powerful data modeling and knowledge representation.
Graph-oriented storage.
N-ary, higher order relationships (edges) between graph nodes.
Graph traversals and relational-style queries.
Customizable indexing.
Customizable storage management.
Extensible, dynamic DB schema through custom typing.
Out of the box Java OO database.
Fully transactional and multi-threaded, MVCC/STM.
P2P framework for data distribution.

In addition, the project includes several practical domain specific components for semantic web, reasoning and natural language processing. For more information, documentation and downloads, please visit the HyperGraphDB Home Page.

Many thanks to all who supported the project and actively participated in testing and development!

Tuesday, August 26, 2014

Where are the JVM Scripting IDEs?

The raise of scripting languages in the past decade has been spectacular. And since the JVM platform is the largest, a few were designed specifically for that platform while many others were also implemented on top. It is thus that we have JRuby, Jython, Groovy, Clojure, Rhino, JavaFX and the more obscure (read more fun) things like Prolog and Scheme implementations. Production code is being written, dynamic language code bases are growing, whole projects don't even have any Java code proper. Yet when it comes to tooling, the space is meager to say the least.

What do we have? In Eclipse world, there's the Dynamic Languages Toolkit which you can explore at http://www.eclipse.org/dltk/, or some individual attempts like http://eclipsescript.org/ for the Rhino JavaScript interpreter or the Groovy plugin at http://groovy.codehaus.org/Eclipse+Plugin. All of those provide means to execute a script inside the Eclipse IDE and possible syntax highlighting and code completion. The Groovy plugin is really advanced in that it offers debugging facilities, which of course is possible because the Groovy implementation itself has support for it. That's great. But frankly, I'm not that impressed. Scripting seems to me a different beast than normal development. Normally you do scripting via a REPL, which is traditionally a very limited form of UI because it's constrained by the limitation of a terminal console. What text editors do to kind of emulate a REPL is let you select the expression to evaluate as a portion of the text, or take everything on a line, or if they are more advanced, then use the language's syntax to get to the smallest evaluate-able expression. It still feels a little awkward. Netbeans' support is similar. Still not impressed. "What more do you want?", you may ask. Well, don't know exactly, but more. There's something I do when I write code in scripting languages, a certain state of mind and a way of approaching problems that is not the same as with the static, verbose languages such as Java.

The truth is the IDE brought something to Java (and Pascal and C++ etc.) that made the vast majority of programmers never want to look back. Nothing of the sort has happened with dynamic languages. What did IDEs bring? Code completion was a late comer, compared to integrated debugging and the project management abilities. Code completion came in at about the same time as tools to navigate large code bases. Both of those need a structured representation of the code and until IDEs got powerful and fast enough to quickly generate and maintain in sync such a representation, we only had an editor+debugger+a project file. Now IDEs also include anything and everything around the development process, all with the idea that the programmer should not leave the environment (nevermind that we prefer to take a walk outside from time to time - I don't care about your integrated browser, Chrome is an Alt-tab away!).

Since I've been coding with scripting languages even before they became so hot, I had that IDE problem a long time ago. That is to say, more than 10 years ago. And there was one UI for scripting that I thought was not only quite original, but a great match for the kind of scripting I was usually doing, namely exploring and testing APIs, writing utilities, throw away programs, prototypes, lots of activities that occasionally occupy a bigger portion of my time than end-user code. That UI was the Mathematica notebook. If you have never heard of it, Mathematica (http://www.wolfram.com/mathematica) is a commercial system that came out in the 90s and has steadily been growing its user base with even larger ambitions as of late. The heart of it is its term-rewrite programming language, nice graphics and sophisticated math algorithms, but the notion of a notebook, as a better than REPL interface, is applicable to any scripting (i.e. evaluation-based, interpreter) language. A notebook is a structured document that has input cells, output cells, groups of cells, groups of groups of cells etc. The output cells contain anything that the input produces which can be a complex graphic display or even an interactive component. That's perfect! How come we haven't seen it widely applied?

Thus Seco was born. On a first approximation, Seco is just a shell to JVM dynamic languages that imitates Mathematica's notebooks. It has its own ambition a bit beyond that, moving towards an experimental notion of software development as semi-structured evolutionary process. Because of that grand goal, which should not distract you from the practicality of the tool that I and a few friends and colleagues have been using for years, Seco has a few extras, like the fact that your work is always persisted on disk, the more advanced zoomable interface beyond the mere notebook concepts. The best way to see why this is worth blogging about is to play with it a little. Go visit http://kobrix.com/seco.jsp.

Seco was written almost in its entirety by a former Kobrix Software employee, Konstantin Vandev. It is about a decade old, but active development stopped a few years ago. I took a couple of hours here and there in the past months to fix some bugs, started implementing a new feature to have a centralized searchable repository for notebooks so people can backup their work remotely, access it and/or publish it. That feature is not ready, but I'd like to breathe some life into the project by making a release. So consider this an official Seco 0.5 release which besides the aforementioned bug fixes upgrades to the latest version of HyperGraphDB (the backing database where everything get stored) and removes dependency on the BerkeleyDB native library so it's pure Java now.

Monday, November 19, 2012

eValhalla User Management

[Previous in this series: eValhalla Setup]

In this installment of the eValhalla development chronicle, we will be tackling what's probably the one common feature of most web-based application - user management. We will implement:

User login
A "Remember me" checkbox for auto-login of users on the same computer
Registration with email validation

The end result can be easily taken and plugged in your next web app! Possible future improvements would be adding captchas during registration and/or login and the ability to edit more extensive profile information.

I have put comments wherever I felt appropriate in the code, which should be fairly straightforward anyway. So I won't be walking you line by line. Rather, I will explain what it does and comment on design decisions or on less obvious parts. First, let me say a few words about how user sessions are managed.

User Sessions

Since we are relying on a RESTful architecture, we can't have a server hold user sessions. We need to store user session information at the client and transfer it with each request. Well, the whole concept of a user session is kind of irrelevant here since the application logic is mostly residing at the client and the server is mainly consulted as a database. Still, the data needs protection and we need the notion of a user with certain access rights. So we have to solve the problem of authentication. Each request needs to carry enough information for the server to decide whether the request should be honored or not. We cannot just rely on the user id because this way anybody can send a request with anybody else's id. To authenticate the client, the server will first supply it with a special secret key, an authentication token, that the client must send on each request along with the user id. To obtain that authentication token, the client must however identify themselves by using a password. And that's the purpose of the login operation: obtaining an authentication token for use in subsequent requests. The client will then keep that token together with the user id and submit them as HTTP headers on every request. The natural way to do that with JavaScript is storing the user id and authentication tokens as cookies.

This authentication mechanism is commonly used when working within a RESTful architecture. For more on the subtleties of that approach, just google "user authentication and REST applications". One question is why not just send the password with every request instead of a separate token. That's possible, but more risky - a token is generated randomly and it expires, hence it is harder to guess. The big issue however is XSS (cross-site scripting) attacks. In brief, with XSS an attacker insert HTML code into a field that gets displayed supposedly as just static text to other users (e.g. a blog title) and the code simply does an HTTP request to a malicious server submitting all the users' private cookies with it. To avoid them, we will have to pay special attention on HTML sanitation. That is, we have to properly escape every HTML tag displayed as static text. We can also put that authentication token in an HTTPOnly cookie for extra security.

Implementation - a User Module

Since user management is so common, I made a small effort to build a relatively self-contained module for it. There are no standard packaging mechanisms for the particular technology stack we're using, so you'd just have to copy&paste a few files:

/html/ahguser.ht - contains the HTML code for login and registration dialogs as well as top-level Login and Register links that show up right aligned. This depends on the whole Angular+AngularUI+Bootstrap+jQuery environment.
/javascript/ahguser.js - contains the Angular module 'ahgUser' that you can use in your Angular application. This depends on the backend:
/scala/evalhalla/user/package.scala - the evalhalla mJson-HGDB backend REST user service.

The backend can be easily packaged in a jar, mavenized and all, and this is something that I might do near the end of the project.

The registration process validates the user's email by emailing them a randomly generated UUID (a HyperGraphDB handle) and requiring that they provide it back for validation before they can legitimately log into the site. Since the backend code is simple enough, repetitive even, let's look at one method only, the register REST endpoint:

@POST
@Path("/register")
def register(data:Json):Json = {
  return transact(Unit => {
    normalize(data);
    // Check if we already have that user registered
    var profile = db.retrieve(
        jobj("entity", "user", 
             "email", data.at("email")))
    if (profile != null)
      return ko().set("error", "duplicate")
    // Email validation token
    var token = graph.getHandleFactory()
          .makeHandle().toString()
    db.addTopLevel(data.set("entity", "user")
                       .set("validationToken", token)
                       .delAt("passwordrepeat"))
    evalhalla.email(data.at("email").asString(), 
      "Welcome to eValhalla - Registration Confirmation",
      "Please validate registration with " + token)
    return ok
  })
}

So we see our friends db, transact, jobj etc. from last time. The whole thing is a Scala transaction closure with the much nicer than java.lang.Callable Scala syntax. As a reminder, note that most of the functions and objects referred here are globally declared in the evalhalla package. For example, db is an instance of the HyperNodeJson class. The call to normalize just ensures the email is lower-case, because email in general are case-insensitive. While the logic is fairly straightforward, let me make a few observations about the APIs.

User profile lookup is done with a Json pattern matching. Recall that the database stores arbitrary Json structures (primitives, arrays, objects and nulls) as HyperGraphDB atoms. It doesn't have the notion of "document collections" like MongoDB for example. This is because each Json structure is just a portion of the whole graph. So to distinguish between different types of objects we define a special JSON property called entity that we reserve as a type name attached to all of our top-level Json entities. Here we are dealing with entities of type "user". Now, each atom in HyperGraphDB and consequently each user profile has a unique identifier - the HyperGraphDB handle (a UUID). There is no notion of a primary key enforced at the database level. We know that an email should uniquely identify a user profile so we perform the lookup with the Json object {entity:"user", email:<the email>} as a pattern. But this uniqueness is enforced at the application level because new profiles are added only precisely via this register method. I've explained the db.addTopLevel method on the HGDB-mJson wiki page, but here is a summary for the impatient: while db.add will create a duplicate version of the whole Json structure recursively and while db.assertAtom will only create something if it doesn't exist yet, db.addTopLevel will create a new database atom only for the top-level JSON, but perform an assert operation for each of its components.

Finally, note that before adding the profile to the database, we delete the passwordrepeat property. This is because we are storing in the database whatever JSON object the client gives us. And the client is giving us all fields coming from the HTML registration form, pretty much as the form is defined. So we get rid of that unneeded field. Granted, it would be actually better design to remove that property at the client-side since the repeat password validation is done there. But I wanted to illustrate the fact that the JSON data is really flowing as is from the HTML form directly to the database, with no mappings or translations of any sort needed.

So far, so good. Let's move on the UI side.

Client-Side Application

In addition to AngularJS, I've incorporated the Bootstrap front-end library by Twitter. It looks good and it has some handy components. I've also included the Angular-UI library which has some extras like form validation and it plays well with Bootstrap.

The whole application resides on one top-level HTML page /html/index.html. The main JavaScript program that drives it resides in the /javascript/app.js file. The idea is to dynamically include fragments of HTML code inside a reserved main area while some portions of the page such as the header with user profile functionality remain stable. By convention, I use the extension '.ht' for such HTML fragments instead of '.html' to indicate that the content is not a complete HTML page. Here's the main page in its essence:


<div ng-controller="MainController">
<div ng-include="ng-include" src="'/ahguser.ht'"></div>
<!-- some extra text here... -->
<hr/>
<a href="#m">main</a> |
<a href="#t">test</a>
<hr/>
<ng-view/>  <!-- this is where HTML fragments get included -->

The MainController function is in app.js and it will eventually establish some top-level scoped variables and functions, but it's currently empty. The user module described above is included as a UI component by the <div ng-include...> tag. The login and register links that you see at the top right of the screen with all their functionality come from that component. Finally, just for the fun of it, I made a small menu with two links to switch the main view from one page fragment to another: 'main' and 'test'. This is just to setup a blueprint for future content.

A Note on AngularJS

Let me explain a bit the Angular portion of that code since this framework, as pretentious as it is, lacks in documentation and it's far from being an intuitive API. In fact, I wouldn't recommend yet. My use of it is experimental and the first impressions are a bit ambivalent. Yes, eventually you get stuff to work. However, it's bloated and it fails the first test of any over-reaching framework: simple things are not easy. It seems built by a team of young people that use the word awesome a lot, so we'll see if it meets the second test, that complicated things should be possible. Over the years, I've managed to stay away from other horrific, bloated frameworks, aggressively marketed by a big companies (e.g. EJBs), but here I may be just a bit too pessimistic. Ok, enough.

Angular reads and interprets your HTML before it gives it to the browser. That opens many doors. In particular, it allows you to define custom attributes and tags to implement some dynamic behaviors. It has the notion of an application and the notion of modules. An application is associated with a top-level HTML tag, usually the 'html' tag itself by setting the custom ng-app='appname' attribute. Then appname is declared as a module in JavaScript:

var app = angular.module('appname', [array of module dependencies])

It's not clear what's special about an application vs. mere modules, presumably nothing. Then functionality is attached to markup (the "view") via Angular controllers. Those are JavaScript functions that you write and Angular calls to setup the model bound to the view. Controller functions take any number of parameters and Angular uses a nifty trick here. When you call the toString method of a JavaScript function, it returns its full text as it was originally parsed. That includes the formal arguments, exactly with the names you have listed in the function declaration (unless you've used some sort of minimization/obfuscation tool). So Angular parses that argument list and uses the names of the parameters to determine what you need in your function. For example, when you declare a controller like this:

function MainController($scope, $http) {
}

A call to MainController.toString() returns the string "function MainController($scope, $http) { }". Angular parses that string and determines that you want "$scope" and "$http". It recognizes those names and passes in the appropriate arguments for them. The name "$scope" is a predefined AngularJS name that refers to an object to be populated with the application model. Properties of that object can be bound to form elements, or displayed in a table or whatever. The name "$http" refers to an Angular service that allows you to make AJAX calls. As far as I understand it, any global name registered as a service with Angular can be used in a controller parameter list. There's a dependency injection mechanism that takes care of, among other things, hooking up services in controllers by matching parameter names with globally registered functions. I still haven't figured out what the practical benefit of that is, as opposed to having global JavaScript objects yourself....perhaps in really large applications where different clusters of the overall module dependency graph use the same names for different things.

Beginnings of a Data Service

One of the design goals in this project is to minimize the amount of code handling CRUD data operations. After all, CRUD is pretty standard and we are working with a schema-less flexible database. So we should be able to do CRUD on any kind of structured object we want without having to predefine its structure. In all honesty, I'm not sure this will be as easy as it sounds. As mentioned before, the main difficulty are security access rules. It's certainly doable, but we shall see what kind of complexities it leads to in the subsequent iterations. For now, I've created a small class called DataService that allows you to perform a simple JSON pattern lookup as well as all CRUD operations on any entity identified by its HyperGraphDB handle.

One can experiment with this interface by making $.ajax calls in the browser's REPL. The interface is certainly going to evolve and change, but here are a couple of calls you can try out. I've made the '$http' service available as a global variable:

$http.post("/rest/data/entity", {entity:'story', content:'My story starts with....'})
$http.get("/rest/data/list?pattern=" + JSON.stringify({entity:'story'})).success(function(A) { console.log(A); })

The above creates a new entity in the DB and then retrieves it via a query for all entities of that type (i.e. entity:'story'). You can also play around with $http.put and $http.delete.

Conclusion

All right, this concludes the 2nd iteration of eValhalla. We've implemented a big part of what we'd need for user management. We've explored AngularJS as a viable UI framework and we've laid the groundwork of a data-centered REST service. To get it, follow the same steps as before, but checkout the phase2 GIT tag instead of phase1:

git clone https://github.com/publicvaluegroup/evalhalla.git
cd evalhalla
git checkout phase2
sbt
run

Coming Up...

On the next iteration, we will do some data modeling and define the main entities of our application domain. We will also implement a portion of the UI dealing with submission and listing of stories.

Sunday, November 4, 2012

HyperGraphDB 1.2 Final Released

Kobrix Software is pleased to announce the release of HyperGraphDB 1.2 Final.

Several bugs were found and corrected during the beta testing period, most notably having to do with indexing.

Go directly to the download page.

HyperGraphDB is a general purpose, free open-source data storage mechanism. Geared toward modern applications with complex and evolving domain models, it is suitable for semantic web, artificial intelligence, social networking or regular object-oriented business applications.
This release contains numerous bug fixes and improvements over the previous 1.1 release. A fairly complete list of changes can be found at the Changes for HyperGraphDB, Release 1.2 wiki page.

Introduction of a new HyperNode interface together with several implementations, including subgraphs and access to remote database peers. The ideas behind are documented in the blog post HyperNodes Are Contexts.
Introduction of a new interface HGTypeSchema and generalized mappings between arbitrary URIs and HyperGraphDB types.
Implementation of storage based on the BerkeleyDB Java Edition (many thanks to Alain Picard and Sebastian Graf!). This version of BerkeleyDB doesn't require native libraries, which makes it easier to deploy and, in addition, performs better for smaller datasets (under 2-3 million atoms).
Implementation of parametarized pre-compiled queries for improved query performance. This is documented in the Variables in HyperGraphDB Queries blog post.

HyperGraphDB is a Java based product built on top of the Berkeley DB storage library.

Key Features of HyperGraphDB include:

Powerful data modeling and knowledge representation.
Graph-oriented storage.
N-ary, higher order relationships (edges) between graph nodes.
Graph traversals and relational-style queries.
Customizable indexing.
Customizable storage management.
Extensible, dynamic DB schema through custom typing.
Out of the box Java OO database.
Fully transactional and multi-threaded, MVCC/STM.
P2P framework for data distribution.

Saturday, October 20, 2012

eValhalla Setup

[Previous in this series: eValhalla Kick Off, Next: eValhalla User Management]

The first step in eValhalla after the official kick off is to setup a development environment with all the selected technologies. That's the goal for this iteration. I'll quickly go through the process of gathering the needed libraries and implement a simple login form that ties everything together.

Technologies

Here are the technologies for this project:

Scala programming language - I had a dilemma. Java has a much larger user base and therefore should have been the language of choice for a tutorial/promotional material on HGDB and JSON storage with it. However, this is actually a serious project to go live eventually and I needed an excuse to code up something more serious with Scala, and Scala has enough accumulated merits, so Scala it is. However, I will show some Java code as well, just in the form of examples, equivalent to the main code.
HyperGraphDB with mJson storage - that's a big part of my motivation to document this development. I think HGDB-mJson are a really neat pair and more people should use them to develop webapps.
Restlet framework REST - this is one of very few implementations of JSR 311, that is sort of lightweight and has some other extras when you need them.
jQuery - That's a no brainer.
AngularJS - Another risky choice, since I haven't used this before. I've used KnockoutJS and Require.js, both great frameworks and well-thought out. I've done some ad hoc customization of HTML tags, tried various template engines, AngularJS promises to give me all of those in a single library. So let's give it a shot.

Getting and Running the Code

Before we go any further, I urge you to get, build and run the code. Besides Java and Scala, I encourage you to get a Git client (Git is now supported on Windows as well), and you need the Scala Build Tool (SBT). Then, on a command console, issue the following commands:

git clone https://github.com/publicvaluegroup/evalhalla.git
cd evalhalla
git checkout phase1
sbt
run

Note the 3d step of checking out the phase1 Git tag - every blog entry is going to be a separate development phase so you can always get the state of the project at a particular blog entry. If you don't have Git, you can download an archive from:

https://github.com/publicvaluegroup/evalhalla/zipball/phase1

All of the above commands will take a while to execute the first time, especially if you don't have SBT yet. But at the end you should see the something like this on your console:

[info] Running evalhalla.Start 
No config file provided, using defaults at root /home/borislav/evalhalla
checkpoint kbytes:0
checkpoint minutes:0
Oct 18, 2012 12:01:01 AM org.restlet.engine.connector.ClientConnectionHelper start
INFO: Starting the internal [HTTP/1.1] client
Oct 18, 2012 12:01:01 AM org.restlet.engine.connector.ServerConnectionHelper start
INFO: Starting the internal [HTTP/1.1] server on port 8182
Started with DB /home/borislav/evalhalla/db

and you should have a running local server accessible at http://localhost:8182. Hit that URL, type in a username and a password and hit login.

Architectural Overview

The architecture is made up of a minimal set of REST services that essentially offer user login and access-controlled data manipulation to a JavaScript client-side application. The key will be to come up with an access policy that deals gracefully with a schema free database.

The data itself consists of JSON objects stored as a hypergraph using HGDB-mJson. From the client side we can create new objects and store them. We can then query for them or delete them. So it's a bit like the old client-server model from the 90s. HyperGraphDB supports strongly typed data, but we won't be taking advantage of that. Instead, each top-level JSON object will have a special property called entity that will contain the type of the database entity as a string. This way, when we search for all users for example, we'll be searching for all JSON objects with property entity="user".

There are many reasons to go for REST+AJAX rather than, say, servlets. I hardly feel the need to justify it - it's stateless, you don't have to deal with dispatching, you just design an API, more responsive, we're in 2013 soon after all. The use of JSR 311 allows us to switch server implementations easily. It's pretty well-designed: you annotate your classes and methods with the URI paths they must be invoked for. Then a method's parameters can be bound either to portions of a URI, or to HTTP query parameters or form fields etc.

I'm not sure yet what the REST services will be exactly, but the idea is to keep them very generic so the core could be just plugged for another webapp and writing future applications could be entirely done in JavaScript.

Project Layout

The root folder contains the SBT build file build.sbt, a project folder with some plugin configurations for SBT and a src folder that contains all the code following Maven naming conventions which SBT adopts. The src/main/html and src/main/javascript folders contain the web application. When you run the server with all default options, that's where it serves up the files from. This way, you can modify them and just refresh the page. Then src/main/scala contains our program and src/main/java some code for Java programmers to look at and copy & paste. The main point of the Java code is really to help people that want to use all those cool libraries but prefer to code in Java instead.

To load the project in Eclipse, use SBT to generate project files for you. Here's how:

cd evalhalla
sbt
eclipse
exit

Then you'll have a .project and a .classpath file in the current directory, so you can go to your Eclipse and just do "Import Project". Make sure you've run the code before though, in order to force SBT to download all dependencies.

Code Overview

Ok, let's take a look at the code now, all under src/main. First, look at html/index.html, which gets loaded as the default page. It contains just a login form and the interesting part is the function eValhallaController($scope, $http). This function is invoked by AngularJS due to the ng-controller attribute in the body tag. It provides the data model of the HTML underneath and also a login button event handler, all through the $scope parameter. The properties are associated with HTML element via ng-model and buttons to functions with ng-click. An independent tutorial on AngularJS, one of few since it's pretty new, can be found here.

The doLogin posts to /rest/user/login. That's bound to the evalhalla.user.UserService.authenticate method (see user package). The binding is done through the standard JSR 311 Java annotations, which also work in Scala. I've actually done an equivalent version of this class in Java at java/evalhalla/UserServiceJava. A REST service is essentially a class where some of the public methods represent HTTP endpoints. An instance of such a class must be completely stateless, a JSR 311 implementation is free to create fresh instances for each request. The annotations work by deconstructing an incoming request's URI into relative paths at the class level and then at the method level. So we have the @Path("/user") annotation (or @Path("/user1") for the Java version so they don't conflict). Note the @Consumes and @Produces annotations at the class level that basically say that all methods in that REST service accept JSON content submitted and return back JSON results. Note further how the authenticate method takes a single Json parameter and returns a Json value. Now, this is mjson.Json and JSR 311 doesn't know about it, but we can tell it to convert to/from in/output stream. This is done in the java/evalhalla/JsonEntityProvider.java class (which I haven't ported to Scala yet). This entity provider and the REST services themselves are plugged into the framework at startup, so before examining the implementation of authenticate, let's look at the startup code.

The Start.scala file contains the main program and the JSR 311 eValhalla application implementation class. The application implementation is only required to provide all REST services as a set of classes that the JSR 311 framework introspects for annotations and for the interfaces they implement. So the entity converter mentioned above, together with both the Scala and Java version of the user service are listed there. The main program itself contains some boilerplate code to initialize the Restlet framework and asks it to serve up some files from the html and javascript folders and it also attaches the JSR 311 REST application under the 'rest' relative path.

An important line in main is evalhalla.init(). This initializes the evalhalla package object defined in scala/evalhalla/package.scala. This is where we put all application global variables and utility methods. This is where we initialized the HyperGraphDB instance. Let's take a closer look. First, configuration is optionally provided as a JSON formatted file, the only possible argument to the main program. All properties of that JSON are optional and have sensible defaults. With respect to deployment configuration, there are two important locations: the database location and the web resources location. The database location, specified with dbLocation, is by default taken to be db under the working directory, from where the application is run. So for example if you've followed the above instructions to run the application from the SBT command prompt for the first time, you'd have a brand new HyperGraphDB instance created under your EVALHALLA_HOME/db. The web resources served up (html, javascript, css, images) are configured with siteLocation the default being src/main so you can modify source and refresh. So here is how the database is created. You should be able to easily follow Scala code even if you're a mainly Java programmer.

val hgconfig = new HGConfiguration()
hgconfig.getTypeConfiguration().addSchema(new JsonTypeSchema())
graph = HGEnvironment.get(config.at("dbLocation").asString(), hgconfig)
registerIndexers
db = new HyperNodeJson(graph)

Note that we are adding a JsonTypeSchema to the configuration before opening the database. This is important for the mJson storage implementation that we are mostly going to rely on. Then we create the graph database, create indices (for now just an empty stub) and last but not least create an mJson storage view on the graph database - a HyperNodeJson instance. Please take a moment to go through the wiki page on HGDB-mJson. The graph and db variables above are global variables that we will be accessing from everywhere in our application.

Some other points of interest here are the utility methods:

def ok():Json = Json.`object`("ok", True);

def ko(error:String = "Error occured") = Json.`object`("ok", False, "error", error);

Those are global as well and offer some standard result values from REST services that the client side may rely on. Whenever everything went well on the server, it returns an ok() object that has a boolean true ok property. If something went wrong, the ok boolean of the JSON returned by a REST call is false and the error property provides an error message. Any other relevant data, success or failure, is embedded with those ok or ko objects.

Lastly, it is common to wrap pieces of code in transactions. After all, we are developing a database backed applications and we want to take full advantage of the ACID capabilities of HyperGraphDB. Scala makes this particularly easy because it supports closures. So we have yet another global utility method that takes a closure and runs it as a HGDB transaction:

def transact[T](code:Unit => T) = {

try{

graph.getTransactionManager().transact(new java.util.concurrent.Callable[T]() {

def call:T = {

return code()

}

});

}

catch { case e:scala.runtime.NonLocalReturnControl[T] => e.value}

}

This will always create a new transaction. Because BerkeleyDB JE, which is the storage engine by default as of HyperGraphDB 1.2, doesn't supported nested transaction, one must make sure the transact is not called within another transaction. So when we are in a situation where we want to have a transaction and we'd happily be embedded in some top-level one, we can call another global utility function: ensureTx, which behaves like transact except it won't create a new transaction if one is already in effect.

Ok, armed with all these preliminaries, we are now able to examine the authenticate method:

@POST

@Path("/login")

def authenticate(data:Json):Json = {

return evalhalla.transact(Unit => {

var result:Json = ok();

var profile = db.retrieve(jobj("entity", "user",

"email", data.at("email").asString().toLowerCase()))

if (profile == null) { // automatically register user for now

val newuser = data.`with`(jobj("entity", "user"));

db.addTopLevel(newuser);

}

else if (!profile.is("password", data.at("password")))

result = ko("Invalid user or password mismatch.");

result;

});

}

The @POST annotation means that this endpoint will be matched only for an HTTP post method. First we do a lookup for the user profile. We do this by pattern matching. We create a Json object that first identifies that we are looking for an object of type user by setting the entity property to "user". Then we provide another property, the user's email which we know is supposed to be unique so we can treat it as a primary key. However, note that neither HyperGraphDB nor its Json storage implementation provides some notion of a primary other than HyperGraphDB atom handles. The HyperNodeJson.retrieve method returns only the first object matching the pattern. If you want an array of all objects matching the pattern use HyperNodeJson.getAll. Note the 'jobj' method call in there: this is a rename in the import section of the Json.object factory method. It is necessary because in Scala object is a keyword. Another way to use a keyword as a method name in Scala beside import rename, on can wrap it in backquotes ` as is done with data.`with` above which is basically an append operation, merging the properties of one Json object into another. The db.addTopLevel is explained in the HGDB-mJson wiki. Also, you may want to refer to the mJson API Javadoc. One last point though about the structure of the authenticate method: there are no local returns. The result variable contains the result and it is written as the last expression of the function and therefore returned as the result. I like local returns actually (i.e. return statement in the middle of the method following if conditions or within loops or whatever), but the way Scala implements them is by throwing a RuntimeException. However, this exception gets caught inside the HyperGraphDB transaction which has a catch all clause and treats exceptions as a true error conditions rather then some control flow mechanism. This can be fixed inside HyperGraphDB, but avoiding local returns is not such a bad coding practice anyway.

Final Comments

Scala is new to me so take my Scala coding style with a grain of salt. Same goes with AngularJS. I use Eclipse and the Scala IDE plugin from update site http://download.scala-ide.org/releases-29/stable/site (http://scala-ide.org/download/nightly.html#scala_ide_helium_nightly_for_eclipse_42_juno for Eclipse 4.2). Some of the initial Scala code is translated from equivalent Java code from other projects. If you haven't worked with Scala, I would recommend giving it a try. Especially if, like me, you came to Java from a richer industrial language like C++ and had to give up a lot of expressiveness.

I'll resist the temptation to make this into an in-depth tutorial of everything used to create the project. I'll say more about whatever felt not that obvious and give pointers, but mostly I'm assuming that the reader is browsing the relevant docs alongside reading the code presented here. This blog is mainly a guide.

In the next phase, we'll do the proverbial user registration and put in place some security mechanisms.

Tuesday, October 9, 2012

eValhalla Kick Off

[Next in this series eValhalla Setup]

As promised in this previous blog post, I will now write a tutorial on using the mJson-HGDB backend for implementing REST APIs and web apps. This will be more extensive than originally hinted at. I got involved in a side project to implement a web site where people can post information about failed government projects. This should be pretty straightforward so a perfect candidate for a tutorial. I decided to use that opportunity to document the whole development in the form of blog series, so this will include potentially useful information for people not familiar with some of the web 2.0 libraries such as jQuery and AngularJS which I plan to use. All the code will be available in GitHub. I am actually a Mercurial user, something turned me off from Git when I looked at it before (perhaps just its obnoxious author), but I decided to use this little project as an opportunity to pick up a few new technologies. The others will be Scala (instead of Java) and AngularJS (instead of Knockoutjs).

About eValhalla

Valhalla - the hall of Odin into which the souls of heroes slain in battle and others who have died bravely are received.

The aim is to provide a forum for people involved in (mainly software) projects within government agencies to report anonymously on those projects' failures. Anybody can create a login, without providing much personal information and be guaranteed that whatever information is provided remains confidential if they so choose. Then they can describe projects they have insider information about and that can be of interest to the general public. Those projects could be anything from small scale, internal-use only, local government, to larger-scale publicly visible nation-level government projects.

I won't go into a "mission statement" type of description here. You can see it as a "wiki leaks" type transparency effort, except we'd be dealing with information that is in the public domain, but that so far hasn't had an appropriate outlet. Or you can see it as a fun place to let people vent their frustrations about mis-management, abuses, bad decisions, incompetence etc. Or you can see it as a means to learn from experience in one particular type of software organization: government IT departments. And those are a unique breed. What's unique? Well, the hope is that such an online outlet will make that apparent.

Requirements Laundry List

Here is the list of requirements, verbatim as sent to me by the project initiator:

enter project title
enter project description
enter project narrative
enter location
tag with failure types
tag with subject area/industry sector
tag with technologies
enter contact info
enter project size
enter project time frame (year)
enter project budget
enter outcome (predefined)
add lessons learned
add pic to project
ability to comment on project
my projects dashboard (ability to add, edit, delete)
projects can be saved as draft and made public later
option to be anonymous when adding specific projects
ability to create profile (username, userpic, email, organization, location)
ability to edit and delete profile
administrator ability to feature projects on main page
search for projects based on above criteria and tags
administrator ability to review projects prior to them being published

Nevermind that the initiator in question is currently pursuing a Ph.D. in requirements engineering. Those are a good enough start. We'll have to implement classic user management and then we have our core domain entity: a failed project, essentially a story decorated with various properties and tags, commented on. As usualy, we'd expect requirements to change as development progresses, new ideas will popup, old ideas will die and force refactorings etc. Nevermind, I will maintain a log of those development activities and if you are following, do not expect anything described to be set in stone.

Architecture

Naturally, the back-end database will be HyperGraphDB with its support of plain JSON-as-a-graph storage as described here. We will also make use of two popular JavaScript libraries: jQuery and AngularJS as well as whatever related plugins come in handy.

Most of the application logic will reside on the client-side. In fact, we will be implementing as much business logic in JavaScript as security concerns allow us. The server will consist entirely of REST services based on the JSR 311 standard so they can be easily run on any of the server software supporting that standard. To make this a more or less self-contained tutorial, I will be describing most of those technologies and standards along the way, at least to the extent that they are being used in our implementation. That is, I will describe as much as needed to help you understand the code. However, you need familiarity with Java and JavaScript in order to follow.

The data model will be schema-less, yet our JSON will be structured and we will document whenever certain properties are expected to be present and we will follow conventions helps us navigate the data model more easily.

Final Words

So stay tuned. The plan is to start by setting up a development environment, then implement the user management part, then submission of new projects (my project dashboard), online interactions, search, administrative features to manage users, the home page etc.

Sunday, June 17, 2012

JSON Storage in HyperGraphDB

JSON (http://json.org) came about as a convenient, better-than-XML data format that one could directly work from in JavaScript. About a year ago, I got a bit disappointed by the JSON libraries out there. JSON is so simple, yet all of them so complicated, so verbose. Most of it comes from the culture of strong typing in Java and from the presumed use of JSON on the server as an intermediate representation to be mapped to Java beans, where the application domain is usually modeled. But JSON is powerful enough on its own if you are willing to trade off type safety for flexibility. So I wrote mJson (for "minimal" Json), a single source file library for working with JSON from Java. For a description see the blog posts:

Using this library, I've been able to avoid creating the usual plethora of Java classes to work with my domain in several applications. Coding with it is pretty neat. One can pretty much view JSON as a general minimalistic, highly flexible modeling tool, a bit like s-expressions or any general enough abstract data structure (another example are first-order logic terms).

Now, there's a persistence layer for those JSON entities based on HyperGraphDB. Unlike "document-oriented" databases, the representation in HyperGraphDB is really graph-like with all the expected consequences. It is described in the HyperGraphDB Json Wiki Page, pretty stable, convenient and fun to work with.

If time permits, I will write a few posts in the form of a tutorial to show how to quickly build database backed JSON REST services with mJson and HyperGraphDB. If you have a suggestion of an application domain to cover, please share it in a comment!

In the meantime, I invite you to give it a try!

Regards,

Boris

Monday, June 11, 2012

HyperGraphDB 1.2 Beta now available

Kobrix Software is pleased to announce the release of HyperGraphDB version 1.2.

HyperGraphDB is a general purpose, free open-source data storage mechanism. Geared toward modern applications with complex and evolving domain models, it is suitable for semantic web, artificial intelligence, social networking or regular object-oriented business applications.

This release contains numerous bug fixes and improvements over the previous 1.1 release. A fairly complete list of changes can be found at the Changes for HyperGraphDB, Release 1.2 wiki page.

Introduction of a new HyperNode interface together with several implementations, including subgraphs and access to remote database peers. The ideas behind are documented in the blog post HyperNodes Are Contexts.
Introduction of a new interface HGTypeSchema and generalized mappings between arbitrary URIs and HyperGraphDB types.
Implementation of storage based on the BerkeleyDB Java Edition (many thanks to Alain Picard and Sebastian Graf!). This version of BerkeleyDB doesn't require native libraries, which makes it easier to deploy and, in addition, performs better for smaller datasets (under 2-3 million atoms).
Implementation of parametarized pre-compiled queries for improved query performance. This is documented in the Variables in HyperGraphDB Queries blog post.

HyperGraphDB is a Java based product built on top of the Berkeley DB storage library.

Key Features of HyperGraphDB include:

Powerful data modeling and knowledge representation.
Graph-oriented storage.
N-ary, higher order relationships (edges) between graph nodes.
Graph traversals and relational-style queries.
Customizable indexing.
Customizable storage management.
Extensible, dynamic DB schema through custom typing.
Out of the box Java OO database.
Fully transactional and multi-threaded, MVCC/STM.
P2P framework for data distribution.

Sunday, May 20, 2012

Variables in HyperGraphDB Queries

Introduction
A feature that has been requested on a few occasions is the ability to parametarize HyperGraphDB queries with variables, sort of like the JDBC PreparedStatement allows. That feature has now been implemented with the introduction of named variables. Named variables can be used in place of a regular condition parameters and can contain arbitrary values, whatever is expected by the condition. The benefits are twofold: increase performance by avoiding to recompile queries where only a few condition parameters change and a cleaner code with less query duplication. The performance increase of precompiled queries is quite tangible and shouldn't be neglected. Nearly all predefined query conditions support variables, except for the TypePlusCondition which is expanded into a type disjunction during query compilation and therefore cannot be easily precompiled into a parametarized form. It must also be noted that some optimizations made during the compilation phase, in particular related to index usage, can't be performed when a condition is parametarized. A simple example of this limitation is a parameterized TypeCondition - indices are defined per type, so if the type is not known during query analysis, no index will be used.

The API
The API is designed to fit the existing query condition expressions. A variable is created by calling the hg.var method and the result of that method passed as a condition parameter. For example:


HGQuery<HGHandle> q = hg.make(HGHandle.class, graph).compile(
    hg.and(hg.type(typeHandle),hg.incident(hg.var("targetOfInterest"))));

This will create a query that looks for links of specific type and that point to a given atom, parameterized by the variable "targetOfInterest". Before running the query, one must set a value of the variable:


q.var("targetOfInterest", targetHandle);
q.execute();

It is also possible to specify an initial value of a variable as the second of the hg.var method:


q = hg.make(HGHandle.class, graph).compile(
    hg.and(hg.type(typeHandle),hg.incident(hg.var("targetOfInterest", initialTarget))));
q.execute();

Note the new hg.make static method: it takes the result type and the graph instance and returns a query object that doesn't have a condition attached to it yet. To attach a condition, one must call the HGQuery.compile method immediately after make. This version of make establishes a variable context attached to the newly constructed query. All query conditions constructed after the call to make and before the call to compile will implicitly assume variables belong to that context. That is why to avoid surpises you should call compile right after make.

Once a query is thus created, one can execute it many times by assigning new values to the variables through the hg.var method as shown above.

The Implementation

The requirement of calling HGQuery.compile right of the make method comes from the fact that the existing API constructs conditions in a static environment. The condition is fully constructed before it is passed to the compile method. So to make a variable context available during the condition construction process (the evaluation of the query condition expression), we do something relatively unorthodox here and attach that context (essentially a name/value map) as a thread local variable in the make method. The alternative would have been creating a traversal of the condition expression that collects all variables used within them. Either that traversal process would have to know about all condition types, or the HGQueryCondition would have had to be enhanced with some extra interface for "composite conditions" and "variable used within a condition". I didn't like either possibility. Having a two step process of calling hg.make(...).compile(....) seems clean and concise enough.

Variables are implemented through a couple of new interfaces: org.hypergraphdb.util.Ref, org.hypergraphdb.util.Var. When a condition is created with a constant value (instead of a variable), the org.hypergraphdb.util.Constant implementation of the Ref interface is used. All conditions now contain Ref<T> private member variable where before they contained a member variable of type T. I won't say more about those reference abstractions, except one should avoid using them, unless one is implementing a custom condition...they are considered internal.

On a side note, this idea of creating a reference abstraction essentially implements the age-old strategy of introducing another level of indirection in order to solve a design problem. I've used this concept in many places now, including for things like caching and dealing with complex scopes in bean containers (Spring), so I spend some time trying to create a more general library around this. Originally I intended to create this library as a separate project and make use of it in HyperGraphDB. But I didn't come up with something satisfactory enough, so instead I created the couple of interfaces mentioned above. However, eventually those interfaces and classes may be replaced by a separate library and ultimately removed from the HyperGraphDB codebase.

Saturday, January 1, 2011

HyperNodes Are Contexts

There are a couple of different items on HyperGraphDB's road map that, upon a little reflection, seem to converge on a single abstraction. The abstraction I would call hypernode following the hypernode model as described by the paper A Graph-Based Data Model and its Ramifications though it remains to be seen how much of that original model applies in the HyperGraphDB context. The items are the following: (1) local proxies to remote database instances; (2) RAM-only atoms; (3) nested graphs. This blog is about laying the design space of how those features should be dealt with. Let's examine each of these future features in order.

[1] Local DB Proxies

It would be nice to be able to access a db instance on some remote machine as if it were a local db. Only local databases can be currently opened and there is a good portion of the API and the way it is used that could work all the same when the data is remote. One could imagine opening remote graphs like this:

HyperGraph remoteGraph = HGEnvironment.get("peer:RemotePeerName");

and then work with it as if it was on the local hard-drive. This almost makes sense. Most API objects dealing with database access and management, such as the HyperGraph instance itself, the index manager, type system, the low-level store could be manipulated remotely with the exact same semantics. However, this would be a real stretch of their original intent. For example, access to low-level storage is meant to support type implementations, in particular predefined types. Would you want a type implementation at machine A to manage storage on machine B? Similarly for the index manager or the event manager - would you want machine A to manage indexing at machine B? A lot of the methods of the HyperGraph class itself are overloaded methods whose sole purpose is simplifying some API calls by delegating to a "full version" of a method. In other words, going that route, we would have to make the whole API be RPC-ed, an ugly solution with a heavy workload and potentially an overkill. Incidentally, I'm curious if there's an instrumentation Java tool that would load a class-library, but transform dynamically every object as a local proxy to a remote instance. Anybody know of such a tool? Should be doable and perhaps useful for many projects... I only found this paper, which is pretty new and by the looks of the complexities and limitations involved, no wonder it hasn't been done before. Bottom line: remote access to database instance should be carefully crafted as a separate abstraction that focuses on what's important, an abstraction that is working at the atom level - CRUD operations, queries and traversals. A couple of design problems here:

(a) What would the interface to a remote instance look like?

(b) How to deal with atom serialization and differences in type systems? The latter problem is addressed to some extent in the current P2P framework.

(c) Does the client of a remote instance need a local database? The easy answer to that is "no, it shouldn't", but that might mean a much harder implementation!

[2] RAM-Only Atoms

In certain applications, it would be useful to work with a hypergraph structure that is not necessarily or entirely persisted on disk. For example, when HyperGraphDB is used as a convenient representation to apply some algorithms to, or when some data is transient and attached to a particular user session in a server-side environment. Or if one wants a client to a HGDB "server" as in the local db proxy described above. Hence it would be convenient to add atoms to the graph without storing them on disk. Such atoms will participate in the graph structure as full-fledged citizens - they would be indexed and queried in the normal way, except that they would only reside in the cache. For an implementation standpoint, the difficulty will be in merging a unified and coherent view of disk and RAM data. If RAM atoms are to be truly RAM only, databases indices cannot be used to index them. Hence, RAM-only indices must be maintained and dynamically intersected with disk indices during querying. Moreover, it would only make sense to be able to perform queries restricted to atoms in main memory.

[3] Nested Graphs/Subgraphs

This is what the hypernode model is more or less about. A hypernode is a node in a graph that is itself a graph, composed of atomic nodes and possibly other hypernodes. The generalization is roughly analogous to the generalization of standard graph edges to hyperedges. Thus we have graphs nested within graphs, recursively. A very natural application of the original hypernode model is nested object structures. Within HyperGraphDB, however, we already get nested value structures from the primitive storage graph and we are mostly interested in scoping.

As a side note, scoping and visibility are terms used in programming language theory and are sometimes confused because scoping usually determines the visibility of identifiers. A scope in programming is essentially the context within which things like variables, values and general expressions live (and have meaning/validity). When speaking about knowledge representation, the scoping problem is referred to as the "context problem" - how do you represent contextual knowledge? It is not unusual for the AI community to hijack such terms and basically strip them of their deeper meaning in order to claim a handle on a problem that is otherwise much more difficult, or simply sound sexier in a rather silly way. Contextuality and context-dependent reference resolution is at the core of all of computing (and hence knowledge representation), but that's the subject of another topic, perhaps a whole book that I might write if I become knowledgeable enough and if I happen to stumble upon more interesting insights. It is much deeper, I believe, than scopes and their nesting. Incidentally, another example of such an abused term is the word semantic as in the "semantic web" or "semantic relations" or "semantic search" or what have you. There's nothing semantic about those things, only actual computer programs are semantic in any meaningful sense of the word, and not the representation they work on.

Nevertheless, speaking of "context" when one means "scope" seems common so I'll use that term, albeit with great reluctance. To represent contextual information in HyperGraphDB, we need some means to state that a certain set of atoms "belongs to" a certain context. A given atom A can live in several different contexts at a time. One way to do that is to manage that set explicitly and say that a context is a set of atoms. Thus a context is a special entity in the database, an explicitly stored, potentially very large set of atoms. If that special entity is an atom in itself, then we have a nested graph, a hypernode. Another way would be to create an atom C that represents a context (i.e. a hypernode) and then link every member A of that context to C with a special purpose ContextLink. I think this is what they do in OpenCog's atomspace now. The two approaches have different implications. In the former, all atoms belonging to a context are readily available as an ordered set that can efficiently participate in queries, but given an atom there's no easy way to find all contexts it belongs to (clearly, we should allow for more than one). In the latter approach, getting all atoms in a context requires an extra hop from the ContextLink to the atom target, but then retrieving the contexts an atom belongs to is not much harder. In addition, if we maintain a TargetToTargetIndexer on ContextLinks, we avoid that extra hop of going from the context C to the ContextLink to the enclosed atom A. The choice between the two alternatives is not obvious. Going the ContextLink route has the clear advantage of building on top of what already exists. On the other hand, nested graphs seems something fundamental enough and close enough to a storage abstraction to deserve native support while a representation through "semantic" ContextLink should be perhaps left to application specific notions of contextuality, more complex or more ad hoc, regardless.

Note that if we had direct support for unordered links in HyperGraphDB, i.e. hyperedges as defined in mathematical hypergraphs, a nested graph context could simply be a hyperedge. Now that the HGLink interface is firmly established as a tuple, it will be hard to add such unordered set links from an API standpoint. However, support for unordered hypergraphs can be achieved through hypernodes-as-sets-of-atoms with their own interface. On this view, a hypernode can be seen either as an unordered link or as a nested graph. To be seen as an unordered link, however, implies that we'd have to maintain incidence sets to include both standard HGLinks and hypernodes. This in turn means we may run into trouble with the semantics of link querying: does hg.link(A) for example include only HGLinks pointing to A or hypernodes as well? If only links, then we'd have to filter out the hypernodes, thus degrading performance of existing queries. If hypernodes as well, existing queries that expect only links could break a runtime typecast. Issues to keep in mind...

Finally, note that this whole analysis takes atoms within nested graphs to be first and foremost members of the whole graph. The universe of atoms remains global and an atom remains visible from the global HyperGraph regardless of what nested graphs it belongs to. Hence a nested graph is more like a materialized subgraph. The alternative where each nested graph lives in its own key-value space would make it very hard to share atoms between nested graphs. So this limits the scoping function of nested graphs because a given HGHandle will always refer to the same entity regardless of what nested graph it belongs. To implement full-blown scoping, we'd need a HGHandle to refer to different things in different contexts. Do we want that capability? Should we reserve it for a higher-level where we are resolving actual symbolic names (like type names for example) rather than "low-level" handles? Perhaps it wouldn't be that hard to contextualize handles, with an extra performance cost of course: a hypernode would simply have a more involved 'get' and 'put' operations where the handle is potentially translated to another handle (maybe in another context) in a chain of reference resolutions ultimately bottoming out in the global handle space? But that issue is orthogonal to the ability to represent nested graphs.

Commonality

So what's common between those three features above? All of them are about a viewing a graph, or a portion thereof, that is not necessarily identical to the database stored on disk, yet it is a hypergraph in its own right and one wants to be able to operate within the context that it represents. There must be a common abstraction in there. Since the name HyperGraph is already taken by the database instance itself, I'd propose to use HyperNode. A HyperNode would strip down the essential operations of a the HyperGraph class, which would implement the HyperNode interface. This HyperNode interface will replace HyperGraph as the entry point to a graph, be it for querying, traversals or CRUD operations in many places.

So what API should a HyperNode offer? Here's a proposal:

get, add, define, replace, remove atoms, but question is: only full versions of those methods or all overloaded versions?
getIncidenceSet(atomHandle)
getType(atomHandle)
find(HGQuery)
getLocation() - an abstract URI-like location identifying the HyperNode

Those are a bare minimum. Access to an index manager, event manager and cache are clearly outside of the scope of this abstraction. However, the question of transactions and the type system is a bit trickier. Being able to enclose operations within a transaction is usually important, but a remote peer might not want to support distributed transactions while local data (RAM or a nested graph) can be transacted upon from the main HyperGraph instance. As for access to the HGTypeSystem, it is mostly irrelevant except when you need it.

The semantics of HyperNode's operations seem intuitively well-defined: do whatever you'd do with a full HyperGraph except do it within that specific context. For example, an InMemoryHyperNode would add, get query etc. only the RAM-based atoms. Similarly a RemoteHyperNode would only operate on remote data. A NestedHyperNode is a bit harder to define: an add will modify the global database and mark the new atom as part of the nested graph, a query or a get will only return data that's marked as part of the nested graph; but what about a remove? should it just remove from the nest graph, or remove globally as well? or try to do the right thing as in "remove globally only if it's not connected to anything outside of its nested graph"?

Ok, enough. A brain dump of sorts, in need of comments to point out potential problems and/or encouragement that this is going on the right track :)

Cheers,

Boris