Kobrix Software, Official Blog

Seco 0.8.0 Release

2017-05-29T15:50:00.002-04:00

There is new release of Seco - 0.8.0. This is yet another incremental improvement of the first of its kind JVM-based notebook environment.

Link: https://github.com/bolerio/seco/releases/tag/v0.8.0

The release is intended to lower the entry barrier further by bundling several languages in the build out of the box, in addition to the JScheme & Beanshell old timer, most notably a new system for symbolic mathematics, similar to Mathematica!

Groovy - latest version (at the time of this writing of course) is bundled, version 2.4.11. There is more work to be done here, defining classes is not supported. Groovy's internals offer several different ways to compile or interpret code, and the internal architecture overall is organized around compilation. Even its own REPL had to be patched to allow global variable declaration in a more interactive environment. But Groovy has gained lots of ground and several useful DSLs are build on top so it would be worth spending more time to support it better.
JRuby - version 1.2.0 (shouldn't be hard to upgrade, probably next release). JRuby apparently beats Ruby native on several benchmarks, supports almost the full language (except things like call/cc which is to be expected for the JVM), so why one doesn't hear about it more is unclear.
JPython - Version 2.5.1 bundled, nothing has changed here.
Prolog (TuProlog) - Sans changement as well...works well!
JavaScript (Rhino) - No changes here either. However, given recent JavaScript implementation for the JVM, an update is in order.
Symja (Symbolic Mathematics) - this is a new biggy! The Seco notebook UI is obviously inspired (to the point of being a clone of) the Mathematica notebooks. This project brings symbolic mathematics to Seco. It has its own syntax, though I believe it is possible to plugin a Mathematica-compatible syntax. Other goodies such as LaTex formula display and a plotter for mathematical graphics are within the project scope, but need better integration. The Symja project is by Axel Kramer and can be found at https://bitbucket.org/axelclk/symja_android_library. As I understand there is a version that works on the Android OS and Axel is mostly focused on maintaining that at the moment. Here are some examples of Symja in action:

Besides that, a few bugs have been fixed, one of which was very annoying - it prevents wrapping long lines in exceptions within output cells. If you have used Seco for a while, you have certainly run into that bug...well now it's gone.

Another small new feature: one can set a default language for the whole niche via the new Niche menu. Also the XMPP support for a chat room and exchanging notebooks is back working (from the Networks Menu).

After being dormant for more years than I care to admit, the project seems to be getting some attention lately!

Cheers,

Boris

Prolog, OWL and HyperGraphDB

2017-02-25T13:11:00.002-05:00

One of the more attractive aspects of HyperGraphDB is that its model is so general and its API so open, up to the storage layer, that many (meta) models can be very naturally implemented efficiently on top of it. Not only that, but those meta models coexist and can form interesting computational synergies when implementing actual business applications. By meta model here I mean whatever formalism or data structure one uses to model the domain of application, e.g. RDF or UML or whatever. As I've always pointed out elsewhere, there is a price to pay for this generality. For one, it's very hard to answer the question "what is the model of HyperGraphDB? is it a graph? is it a relational model? is it object-oriented?". The answer is: it is whatever you want it to be. Just as there is no one correct domain model for a given application, there is no one correct meta-model. Now, the notion of multi-model or polyglot databases is becoming somewhat of a buzzword. It naturally attracts attention because just as you can't solve all computational problems with a single data structure, you can't implement all types of applications with a single data model. The design of HyperGraphDB recognized that very early. As a consequence, meta models like FOL (first-order logic) as implemented in Prolog and DL (description logic) as realized in the OWL 2.0 standard can become happy teammates to tackle difficult knowledge representation and reasoning problems. In this blog, we will demonstrate a simple integration between the TuProlog interpreter and the OWL HyperGraphDB module.

We will use Seco because it is easier to share sample code in a notebook. And besides, if you are playing with HyperGraphDB, Seco is great tool. If you've seen some of the more recent notebook environments like Jupyter and the likes, the notion should be familiar. A tarball with all the code, both notebook and Maven project can be found at the end of this blog.

Installing and Running Seco with Prolog and OWL

First, download Seco from:

https://github.com/bolerio/seco/releases/tag/v0.7.0

then unzip the tarball and start it with the run.sh (Mac or Linux) or run.cmd (Windows) script. You should see a Welcome notebook. Read it - it'll give you an intro of the tool.

Seco comes pre-bundled with a few main and not-so-mainstream scripting languages such as BeanShell, JavaScript and JScheme. To add Prolog to the mix, go ahead and download:

https://github.com/bolerio/seco/releases/download/v0.7.0/seco-prolog-0.7.0.jar

put the jar in SECO_HOME/lib and restart Seco. It should pick up the new language automatically. To test it:

Open a new notebook
Right-click anywhere inside and select "Set Default Language" as Prolog

You can now see the Prolog interpreter in action by typing for example:



father('Arya Stark', 'Eddard Stark').

father(computing, 'Alan Turing').

Evaluate the cell by hitting Shift+Enter - because of the dot at the end, it will be added as a new fact. Then you can query that by typing:

 father(X, Y)?

Again, evaluate with Shift+Enter - because of the question mark at the end, it will be treated as a query. The output should be a small interactive component that allows you go iterate through the possible solutions of the query. There should be two solutions, one for each declared father.

So far so good. Now let's add OWL. The jar you need can be found in the HyperGraphDB Maven repo:

hgdbowl-1.4-20170225.063004-9.jar

You can just put this jar under SECO_HOME/lib and restart. For simplicity, let's just do that.

Side Note: You can also add it to the runtime context classpath (for more see
https://github.com/bolerio/seco/wiki/Short-Seco-Tutorial). This notion of a runtime context in Seco is a bit analogous to an application in a J2EE container: it has its own class loader with its own runtime dependencies etc. Just like with a Java web server, jars in the lib all available for all runtime contexts while jars added to a runtime context are just available for that context. Seco creates a default context so you always have one. But you can of course create others.

With the OWL module installed, we can load an ontology into the database. The sample ontology for this blog can be found at:

https://gist.github.com/bolerio/523d80bb621207e87ff0024a96616255

Save that file owlsample-1.owl somewhere on your machine. To load, open another notebook, this time without changing the default language (which will be BeanShell). Then you can load the ontology with the following code which you should copy and paste into a notebook and cell and evaluate with Shift+Enter:



import org.hypergraphdb.app.owl.*;

import org.semanticweb.owlapi.model.*;



File ontologyFile = new File("/home/borislav/temp/owlsample-1.owl");



HGDBOntologyManager manager = HGOntologyManagerFactory.getOntologyManager(niche.getLocation());

HGDBOntology ontology = manager.importOntology(IRI.create(ontologyFile), new HGDBImportConfig());

System.out.println(ontology.getAxioms());

The printout should spit some axioms on your console. If that works, you have the Prolog and OWL modules running in a HyperGraphDB database instance. In this case, the database instance is the Seco niche (code and notebooks in Seco are automatically stored in a HyperGrapDB instance called the niche). To do it outside Seco, take a look at the annotated Java code in the sample project linked to in the last section.

Prolog Reasoning Over OWL Axioms

So now suppose we want to query OWL data from Prolog, just as any other Prolog predicate with the ability to backtrack etc. All we need to do is write a HyperGraphDB query expression and associate with a Prolog predicate. For example a query expression that will return all OWL object properties looks like this:



hg.typePlus(OWLObjectPropertyAssertionAxiom.class)

The reason we use typePlus (meaning atoms of this type and all its sub-types) is that the concrete type of OWL object property axioms will be HGDB-implementation dependent and it's good to remain agnostic of that. Then for the actual binding to a Prolog predicate, one needs to dig into the TuProlog implementation internals just a bit. Here is how it looks:

// the HGPrologLibrary is what integrates the TuProlog interpeter with HGDB 
import alice.tuprolog.hgdb.HGPrologLibrary;

// Here we associate a database instance with a Prolog interpreter instance and that association
// is the HGPrologLibrary
HGPrologLibrary lib = HGPrologLibrary.attach(niche, 
                                             thisContext.getEngine("prolog").getProlog());

// Each library contains a map between Prolog predicates (name/arity) and arbitrary HyperGraphDB
// queries. When the Prolog interpreter see that predicate, it will invoke the HGDB query 
// as a clause store
lib.getClauseFactory().getPredicateMapping().put(
    "objectProperty/3", 
     hg.typePlus(org.hypergraphdb.app.owl.model.axioms.OWLObjectPropertyAssertionAxiomHGDB.class))

Evaluate the above code in the BeanShell notebook.

With that in place, we can now perform a Prolog query to retrieve all OWL object property assertions from our ontology:



objectProperty(Subject, Predicate, Object)?

Evaluating the above Prolog expression in the Prolog notebook should open up that solution navigator and display the object property triples one by one.

Note: we've been switching between the BeanShell and Prolog notebooks to evaluate code in two different languages. But you can also mix languages in the same notebook. The tarball of the sample project linked below contains a single notebook file called prologandowl.seco which you can File->Import into Seco and evaluate without the copy & paste effort. In that notebook cells have been individually configured for different languages.

What's Missing?

A tighter integration between this trio of HyperGraphDB, Prolog and OWL would include the following missing pieces:

1. Ability to represent OWL expressions (class and property) in Prolog
2. Ability to assert & retract OWL axioms from Prolog
3. Ability to invoke an OWL reasoner from Prolog
4. Perhaps add a new term type in Prolog representing an OWL entity

I promise to report when that happens. The whole thing would make much more sense if there is an OWL reasoning implementation included in the OWL-HGDB module, instead of the current limited by RAM approach of off-the-shelf reasoners like Pellet and HermiT.

Appendix - Annotated Non-Seco Version

You can find a sample standalone Java project that does exactly the above, consisting of a Maven pom.xml with the necessary dependencies here:

http://www.kobrix.com/samples/hgdb-prologowl-sample.tgz

The tarball also contains a Seco notebook with the above code snippets and that you can import to evaluate the cells and see the code in action. The OWL sample ontology is also in there.

mJson 1.4.0 Released

2016-10-23T17:41:00.001-04:00

It is with great pleasure that I hereby announce the official release of the mJson 1.4.0. There have been a few improvements and bug fixes which I list and explain, whenever explanations are needed, below. The next release will be 2.0, marking a move to Java 8 (a.k.a. the Scala killer).

I am also announcing a new sibling project: mjson-ext at https://github.com/bolerio/mjson-ext. The mjson-ext project will be sprinkled with modules implementing extras, including Java Beans serialization, integration with MongoDB, support for native JavaScript object in embedded browsers and whatever else comes up. If you have some alternative implementation of the mJson interfaces that covers your specific use case or you have some interesting module, I'd be grateful if you chose to share it so that others can benefit. Thanks!

Release Notes

So, here is what mJson 1.4.0 is about.

Side-effect free property accessor - this is a case of a bad initial design, corrected after some soul searching. The property accessor is the Json.at(name, defaultValue) method, which returns the value of a property in a JSON object or the provided default value in case the property is absent. Prior to this version and when the object had no property with the given name, the accessor will not only return the default value, but it would also modify the object with set(name, defaultValue). The idea behind that decision is that the default value would be the same in a given context and whatever part of the code knows it should set it for other parts to peruse. And this works elegantly in many situations. But, as with all side effects, the charm of the magic eventually gave way to the dark forces of unexpected behaviors. In this particular case, JSON objects persisted in a database would be systematically polluted with all sorts of defaults that (1) make the objects bigger than necessary and (2) are actually invalid the next time around. And in general, there are cases where the default is valid in one context of use but not another. In sum, I can't believe I made this stupid decision. But I did. And now I'm taking it back. Apologies for the backward-subtly-incompatible improvement.
java.io.Serializable - The Json class is now java.io.Serializable so that applications relying on Java serialization don't hit a wall trying to use Json within a serializable class. This is useful for example when using mJson in a distributed environment that relies on serialization such as Apache Spark.
Parent references - better handling of enclosing/parent Json. In case you missed it, mJson keeps track of the parent of each JSON element. And there is an up() function to access it. That works well when there is always a single parent and the JSON structure is a mere tree. But when we have a more complex DAG structure, there can be multiple parents. This version handles this case better - the parent Json of an element will be an array in case there are multiple parents. Naturally, since an array member will have a Json array as its parent, that causes ambiguity with multiple parents: is my parent an array that is part of my structure, or do I have multiple parents? At the moment, mJson does not resolve said ambiguity for you, it's your own responsibility to manage that.
Merging - when you have two JSON objects X and Y and you want to copy Y ('s properties) into X, there are a lot of potential use cases pertaining to the copy process. It's a bit like cloning a complex structure where you have to decide if you want a deep or a shallow clone. It's of consequence for mutable structures (such as Json objects or arrays) - a deep copy guarantees that mutating the clone won't affect the source, but is of course more costly to do. In the case of merging, you have to also be mindful of what's already in X - do you want to overwrite at the top-level only (shallow) or do you want to recursively merge with existing properties of X. If X and Y are arrays, the use cases are more limited, but one important one is when the arrays are sorted and we want the resulting array to maintain order. All this is covered in an improved Json.with method that accepts a set (to be expanded, of course based on your feedback, dear user) of options driving which part of the tree (ahem, dag) structure is to be cloned or not, merged or sorted or not. Take a look at https://github.com/bolerio/mjson/wiki/Deep-Merging.
Json.Schema - processing JSON schemas may involve resolving URIs. If all is good and all URIs actually point to schemas accessible on the web, things work. But often those URIs are mere identifiers and the schemas are elsewhere thus making it impossible for mJson to load a schemas that makes use of other schemas that make use of other schemas that may or may not make use of the one before, ok I stop. Hence, from now on you can roll your own URIs resolution by providing a Json.Function - yes, the same stuff Java 8 has, but we are staying with Java 7 for this release so we have a soon to disappear special interface just for that.
...and of course there are some bug fixes, added OSGI packaging and previously private method now exposed, namely the Json.factory() method.

I love this little project. Always get positive feedback from its small user community. I will therefore share some extra goodies soon in the mjson-ext project and write some other fun functions in mJson itself to help with navigation and finding stuff.

Cheers,
Boris

Hbase, Spark and HDFS - Setup and a Sample Application

2015-12-30T01:23:00.000-05:00

Apache Spark is a framework where the hype is largely justified. It is both innovative as a model for computation and well done as a product. People may be tempted to compare with another framework for distributed computing that has become popular recently, Apache Storm, e,g. with statements like "Spark is for batch processing while Storm is for streaming". But those are entirely different beasts. Storm is a dataflow framework, very similar to the HyperGraph DataFlow framework for example, and there are others like it, it's based on a model that has existed at least since the 1970s even though its author seem to be claiming credit for it. Spark on the other hand is novel approach to deal with large quantities of data and complex, arbitrary computations on it. Note the "arbitrary" - unlike Map/Reduce, Spark will let you do anything with the data. I hope to post more about what's fundamentally different between something like Storm and Spark because it is interesting, theoretically. But I highly recommend reading the original paper describing RDD (Resilient Distributed Dataset), which is the abstraction at the foundation of Spark.

This post is just a quick HOWTO if you want to start programming against the popular tech stack trio made up of HDFS (the Hadoop Distirbuted File System), HBase and Apache Spark. I might follow with the intricacies of cluster setup some time in the future. But the idea here is to give you just a list of minimal steps that will work (fingers crossed). Whenever I mention relative file paths, they are to be resolved within the main installation directory of the component I'm talking about: if I'm talking about Hadoop and I write 'etc/hadoop', I mean 'etc/hadoop' under your Hadoop installation, not '/etc/hadoop' in your root file system!

All code, including sample data file and Maven POM can be found at the following Github GIST: https://gist.github.com/bolerio/f70c374c3e96495f5a69

HBase Install

We start with HBase because in fact you don't even need Hadoop to develop locally with it. There is of course a standard package for HBase and the latest stable release (at the time of this writing) is 1.1.2. But that's not going to do it for us because we want Spark. There is an integration of Spark with HBase that is being included as an official module in HBase, but only as of the latest 2.0.0 version, which is still in an unstable SNAPSHOT state. If you are just starting with these technologies you don't care about having a production blessed version. So, to install HBase, you need to build it first. Here are the steps:

Clone the whole project from GIT with:
git clone https://github.com/apache/hbase.git
Make sure you have the latest Apache Maven (3.0.5 won't work), get 3.3+, if you don't already have it.
Go to the Hbase project directory and build it with:
mvn -DskipTests=true installThat will put all hbase modules in your local maven repo, which you'll need for a local maven-based Spark project.
Then create an hbase distribution with:
mvn -DskipTests=true package assembly:single
You will find tarball under hbase-assembly/target/hbase-2.0.0-SNAPSHOT-bin.tar.gz.
This is what you are going to unpack in your installation directory of choice, say /opt/hbase.
Great. Now you have to configure your new hbase installation (untar of the build you just created). Edit the file /opt/hbase/conf/hbase-site.xml to specify where Hbase should store its data. Use this template and replace the paths inside:
<configuration> <property> <name>hbase.rootdir</name> <value>file:///home/boris/temp/hbase</value> </property> </configuration>
Make sure you can start HBase by running:
bin/start-hbase.sh (or .cmd)
Then start the HBase command line shell with:
bin/hbase shell
and type the command list to view the list of all available tables - there will be none of course.

Spark Development Setup

Spark itself is really the framework you use to do your data processing. It can run in a cluster environment, with a server to which you submit jobs from a client. But to start, you don't actually need any special downloads or deployments. Just including Maven dependencies in your project will bring the framework in, and you can call it from a main program and run it in a single process.

So assuming you built HBase as explained above, let's setup a Maven project with HBase and Spark. Then we'll get some sample data to play with and go over a sample application that makes use of two different approaches in Spark: the plain API and the SQL-like Spark module which essentially gives a scalable, distributed SQL-query engine, but we won't talk about it here, in part because the HBase integration of that module is sort of undocumented work in progress. However, you can see an example in the sample code provided in the GIST!

No point repeating the code in its entirety here, but I will show the relevant parts. The Maven pom only needs to contain the hbase-spark dependency:

<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-spark</artifactId>
    <version>${hbase.version}</version>
</dependency>

That will pull in everything you need to load that into a local HBase instance and process it with Spark. That Spark SQL module I mentioned is a separate dependency only necessary for the SQL portion of the example:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.10</artifactId>
    <version>1.5.1</version>
</dependency>

There are two other auxiliary dependencies of the sample project: one for CSV parsing and the mJson library, which you can see in the pom.xml file from the Github GIST.

Playing with Some Data

We will do some processing now with some Open Government data, specifically from Miami-Dade County's list of recently arrested individuals. You can get it from here: https://opendata.miamidade.gov/Corrections/Jail-Bookings-May-29-2015-to-current/7nhc-4yqn - export in Excel CSV format as a file named jaildata.csv. I've already made the export and placed in the GIST even though that's a bit of a large file. The data set is simple: it contains arrests for a big part of the year 2015. Each record/line has the date the arrest was made, where it was made, the name of the offender, their date of birth and a list of up to 3 charges. We could for example find out how popular is the "battery" charge for offenders under 30 compared to offenders above 30.

First step is to import the data from the CSV file to HBase (note that Spark could work very well directly from a CSV file). This is done in the ImportData.java program where in a short main program we create a connection to HBase, create a brand new table called jaildata, then loop through all the rows in the CSV file to import the non-empty ones. I've annotated the source code directly. The connection assumes a local HBase server running on a default port and that the table doesn't exist yet. Note that data is inserted as a batch of put operations, one per column value. Each put operation specifies the column family, column qualifier and the value while the version is automatically assigned by the system. Perhaps the most important part in that uploading of data is how the row keys are constructed. HBase gives you complete freedom to create a unique key for each row inserted and it's sort of an artform to pick a good one. In our case, we chose the offender's name combined with the date of the arrest, the assumption being of course that the same person cannot be arrested twice in the same day (not a very solid assumption, of course).

Now we can show off some Spark in action. The file is ProcessData.java, which is also annotated, but I'll go over some parts here.

We have a few context objects and configuration objects that need to be initialized:

  private void init()
  {
      // Default HBase configuration for connecting to localhost on default port.
      conf = HBaseConfiguration.create();
      // Simple Spark configuration where everything runs in process using 2 worker threads.
      sparkconf = new SparkConf().setAppName("Jail With Spark").setMaster("local[2]");
      // The Java Spark context provides Java-friendly interface for working with Spark RDDs
      jsc = new JavaSparkContext(sparkconf);
      // The HBase context for Spark offers general purpose method for bulk read/write
      hbaseContext = new JavaHBaseContext(jsc, conf);
      // The entry point interface for the Spark SQL processing module. SQL-like data frames
      // can be created from it.
      sqlContext = new org.apache.spark.sql.SQLContext(jsc);
  }

A configuration object for HBase will tell the client where the server is etc., in our case default values for local server work. Next line, the Spark configuration gives it an application name and then it tells it where the main driver of the computation is - in our case, we have a local in-process driver that is allowed to use two concurrent threads. The Spark architecture comprises a main driver and then workers spread around in a cluster. A location of a remote main driver might look something like this: spark://masterhost:7077 (instead of "local[2]"). Then we create a Spark context, specifically a JavaSparkContext because the original Scala API is not so easy to work with from Java - possible, just not very user-friendly. Then we create something called a JavaHBaseContext which comes from the HBase-Spark module and it knows how talk to an HBase instances using the Spark data model - it can do bulk inserts, deletes, it can scan an HBase table as a Spark RDD and more. Finally, we create a context object representing an SQL layer on top of Spark data sets. Note that the SQL context object does not depend on HBase. In fact, different data sources can be brought under the same SQL processing API as long as there is some way to "relationalize" it. For example, there is a module that lets you process data coming from MongoDB as Spark SQL data as well. So in fact, you can have a federated data environment using a Spark cluster to perform relational joins between MongoDB collections and HBase table (and flat files and ...).

Now, reading data from HBase is commonly done by scanning. You can perform some filtering operations, but there's no general purpose query language for it. That's the role of Spark and other frameworks like Apache Phoenix for example. Also, scanning HBase rows will give you binary values which need to be converted to the appropriate runtime Java type. So for each column value, you have to manage yourself what its type is and perform the conversion. The HBase API has a convenience class named Bytes that handles all basic Java types. To represent whole records, we use JSON as a data structure so individual column values are first converted to JSON values with this utility method:

  static final  Json value(Result result,  String family, String qualifier, BiFunction converter)
  {
      Cell cell = result.getColumnLatestCell(Bytes.toBytes(family), Bytes.toBytes(qualifier));
      if (cell == null)
          return Json.nil();
      else
          return Json.make(converter.apply(cell.getValueArray(), cell.getValueOffset()));
  }

Given an HBase result row, we create a JSON object for our jail arrests records like this:


static final Function tojsonMap = (result) -> {
    Json data = Json.object()
        .set("name", value(result, "arrest", "name", Bytes::toString))
        .set("bookdate", value(result, "arrest", "bookdate", Bytes::toString))
        .set("dob", value(result, "arrest", "dob", Bytes::toString))
        .set("charge1", value(result, "charge", "charge1", Bytes::toString))
        .set("charge2", value(result, "charge", "charge2", Bytes::toString))
        .set("charge3", value(result, "charge", "charge3", Bytes::toString))
    ;
    return data;
};

With this mapping from row binary HBase to a runtime JSON structure we can construct a Spark RDD for the whole table as JSON records:


JavaRDD<Json> data = hbaseContext.hbaseRDD(TableName.valueOf("jaildata"), new Scan())
                                       .map(tuple -> tojsonMap.apply(tuple._2()));

We can then filter or transform that data anyway we want. For example:

data = data.filter(j -> j.at("name").asString().contains("John"));

would gives a new data set which contains only offenders named John. An instance of JavaRDD is really an abstract representation of a data set. When you invoke filtering or transformation methods on it, it will just produce an abstract representation of a new data set, but it won't actually compute the result. Only when you invoke what is called an action method, that has to return something different than an RDD as its value, the lazy computation is triggered and an actual data set will be produced. For instance, collect and count are such action methods.

Ok, good. Running ProcessData.main should output something like this:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/12/29 00:41:00 INFO Remoting: Starting remoting
15/12/29 00:41:00 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@10.0.0.13:55405]
Older 'battery' offenders: 2548, younger 'battery' offenders: 1709, ratio(older/younger)=1.4909303686366295
Thefts: 34368
Contempts of court: 32

Hadoop

To conclude, I will just show you how to use Hadoop/HDFS to store HBase data instead of the normal OS filesystem. First, download Hadoop from https://hadoop.apache.org/releases.html. I used version 2.7.1. You can unzip the tarball/zip file in a standard location for your OS (e.g. /opt/hadoop on Linux). Two configuration steps are important before you can actually start it:

Point it to the JVM. It needs at least Java 7. Edit etc/hadoop/hadoop-end.sh (or hadoop-env.cmd) and change the line export JAVA_HOME=${JAVA_HOME} to point the Java home you want, unless your OS/shell environment already does.

Next you need to configure where Hadoop will store its data and what will be the URL for clients to access it. The URL for clients is configured in the etc/hadoop/core-site.xml file:

<configuration>
 <property>
  <name>fs.defaultFS</name>
  <value>hdfs://localhost:9000</value>
 </property>
</configuration>

The location for data is in the etc/hadoop/hdfs-site.xml file. And there are in fact two locations: one for the Hadoop Namenode and one for the Hadoop Datanode:


<configuration>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///var/dfs/name</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///var/dfs/data</value>
  </property>
</configuration>

Before starting HDFS, you need to format the namenode (that's the piece the handles file names and directory names and knows what is where) by doing
bin/hadoop namenode -format
When starting hadoop, you only need to start the hdfs file system by running
sbin/start-dfs.shMake sure it has permissions to write to the directories you have configured.
If after hadoop has started, you don't see the admin web UI at http://localhost:50070, check the logs.

To connect HBase to hadoop you must change the hbase root directory to be an HDFS one:

  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://localhost:9000/hbase</value>
  </property>

Restarting HBase now will bring you back with an empty database as you could verify on the hbase shell with the 'list' command. So try running the import program and then the "data crunching" program see what happens.

Announcing Seco 0.6 - Collaborative Scripting For Java

2015-11-11T23:49:00.001-05:00

This is a short post to announce an incremental 0.6 release of Seco. The release comes with important bug fixes and a simplified first-time user experience.

The project is officially now on Github: https://github.com/bolerio/seco
Release notes: https://github.com/bolerio/seco/wiki/Latest-Release
Downloads: https://github.com/bolerio/seco/releases
Installation: https://github.com/bolerio/seco/wiki/Installing-and-Starting-Seco

Seco is a collaborative scripting development environment for the Java platform. You can write code in many JVM scripting languages. The code editor in Seco is based on the Mathematica notebook UI, but the full GUI is richer and much more ambitious. In a notebook, you can mix rich text with code and output, including interactive components created by your code. This makes Seco into a live environment because you can evaluate expression and immediately see the changes to your program.

Scheduling Tasks and Drawing Graphs — The Coffman-Graham Algorithm

2015-08-31T00:43:00.001-04:00

When an algorithm developed for one problem domain applies almost verbatim to another, completely unrelated domain, that is the type of insight, beauty and depth that makes computer science a science on its own, and not a branch of something else, namely mathematics, like many professionals educated in the field mistakenly believe. For example, one of the common algorithmic problems during the 60s was the scheduling of tasks on multiprocessor machines. The problem is, you are given a large set of tasks, some of which depend on others, that have to be scheduled for processing on N number of processors in such a way as to maximize processor use. A well-known algorithm for this problem is the Coffman-Graham algorithm. It assumes that there are no circular dependencies between the tasks, as is usually the case when it comes to real world tasks, except in catch 22 situations at some bureaucracies run amok! To do that, the tasks and their dependencies are modeled as a DAG (a directed acyclic graph). In mathematics, this is also known as a partial order: if a tasks T1 depends on T2, we say that T2 preceeds T1, and we write T2 < T1. The ordering is called partial because not all tasks are related in this precedence relation, some are simply independent of each other and can be safely carried out in parallel.

The Coffman-Graham algorithm works by creating a sequence of execution rounds where at each round at most N tasks execute simultaneously. The algorithm also has to make sure that all dependencies of the current round have been executed in previous rounds. Those two constraints are what makes the problem non-trivial: we want exactly N tasks at each round of execution if possible, so that all processors get used, and we also have to complete all tasks that precede a given task T before scheduling it. There are 3 basic steps to achieving the objective:

Cleanup the graph so that only direct dependencies are represented. So if there is a task A that depends on B and B depends on another task C, we already know that A depends “indirectly” on C (transitivity is one of defining features of partial orders), so that dependency does not need to be stated explicitly. Sometimes the input of a problem will have such superfluous information, but in fact this could only confuse the algorithm! Removing the indirect dependencies is called transitive reduction, as opposed to the more commonly operation of transitive closure which explicitly computes all indirect dependencies.
Order the tasks in a single list so that the dependent ones come after their dependencies and they are sort of evenly spread apart. This is the crucial and most interesting part of the algorithm. So how are we to compare two tasks and decide which one should be run first. The trick is to proceed from the starting tasks, the roots of the graph that don’t have any dependencies whatsoever, and then progressively add tasks that depend only on them and then tasks then depend only on them etc. This is called topological ordering of the dependency graph. There are usually many possible such orderings and some of them will lead to a good balanced distribution of tasks for the purpose of CPU scheduling while others will leave lots of CPUs unused. In step (3) of the algorithm, we are just going to take the tasks one by one from this ordering and assign them to execution rounds as they come. Therefore, to make it so that at each round, the number of CPUs is maximized, the ordering must somehow space the dependencies apart as much as possible. That is, if the order is written as [T1, T2, T3, …, Tn] and if Tj depends on Ti, we want j-i to be as big as possible. Intuitively, this is desirable because the closer they are, the sooner we’d have to schedule Tj for execution after Ti, and since they can’t be executed on the same parallel round, we’d end up with unused CPUs. To space the tasks apart, here is what we do. Suppose we have tasks A and B, with all their dependencies already ordered in our list and we have to pick which one is going to come next. From A’s dependencies, we take the one most recently placed in the ordering and we check if it comes before or after the most recently placed task from B’s dependencies. If it comes before, then we choose A, if it comes after then we chose B. If it turns out A and B’s most recently placed dependency is actually the same task that both depend on, we look at the next most recent dependency etc. This way, by picking the next task as the one whose closest dependency is the furthest away, at every step we space out dependencies in our ordering as much as possible.
Assign tasks to rounds so as to maximize the number of tasks executed on each round. This is the easiest step - we just reap the rewards from doing all the hard work of the previous steps. Going through our ordering [T1, T2, T3, …, Tn], we fill up available CPUs by assigning the tasks one by one. When all CPUs are occupied, we move to the next round and we start filling CPUs again. If while at a particular round the next task to be scheduled has a dependency that’s also scheduled for that same round, we have no choice but to leave the remaining CPUs unused and start the next round. The algorithm does not take into account how long each tasks can potentially take.

Now, I said that this algorithm is also used to solve a completely different problem. The problem I was referring to is drawing networks in a visually appealing way. This is a pretty difficult problem and there are many different approaches whose effectiveness often depends on the structure of the network. When a network is devoid of cycles (paths from on node back to itself), the Coffman-Grahan algorithm just described can be applied!

The idea is to think of the network nodes as the tasks and of the network connections as the dependencies between the tasks, and then build a list of consecutive layers analogous to the task execution rounds. Instead of specifying a number of available CPUs, one specifies how many nodes per layer are allowed, which is generally convenient because the drawing is done on a fixed width computer screen. Because the algorithm does not like circular dependencies, there is an extra step here to remove a select set of connections so that the network becomes a DAG. This is in addition to transitive reduction where we only keep direct connections and drop all the rest. Once the algorithm is complete, the drawing of those provisionally removed connections can be performed on top of the nice layering produced. Thus, the Coffman-Graham is (also!) one of hierarchical drawing algorithms, a general framework for graph drawing developed by Kozo Sugiyama.

HyperGraphDB 1.3 Released

2015-07-28T21:49:00.000-04:00

Kobrix Software is pleased to announce the release of HyperGraphDB 1.3.

This is a maintenance release containing many bugs fixes and small improvements. Most of the efforts in this round have gone towards the various application modules built upon the core database facility.

Go directly to the download page.

HyperGraphDB is a general purpose, free open-source data storage mechanism. Geared toward modern applications with complex and evolving domain models, it is suitable for semantic web, artificial intelligence, social networking or regular object-oriented business applications.
This release contains numerous bug fixes and improvements over the previous 1.2 release. A fairly complete list of changes can be found at the Changes for HyperGraphDB, Release 1.3 wiki page.

HyperGraphDB is a Java based product built on top of the Berkeley DB storage library.

Key Features of HyperGraphDB include:

Powerful data modeling and knowledge representation.
Graph-oriented storage.
N-ary, higher order relationships (edges) between graph nodes.
Graph traversals and relational-style queries.
Customizable indexing.
Customizable storage management.
Extensible, dynamic DB schema through custom typing.
Out of the box Java OO database.
Fully transactional and multi-threaded, MVCC/STM.
P2P framework for data distribution.

In addition, the project includes several practical domain specific components for semantic web, reasoning and natural language processing. For more information, documentation and downloads, please visit the HyperGraphDB Home Page.

Many thanks to all who supported the project and actively participated in testing and development!

Jayson Skima - Validating JavaScript Object Notation Data

2014-09-15T00:08:00.000-04:00

A schema is simply a pattern. A pure form. Computationally it can be used to try to match an instance against the pattern or create an instance from it. That's how XML schema, and even DTD, were traditionally used - mostly for validation but also as an easy way to create a fill-in-the-blanks type of template. Since JSON has been taking over XML by storm, the need for a schema eventually (in my case, finally!) overpowered the simplicity minimalist instinct of JSON lovers. Thus was born JSON Schema. Judging from the earliest activity on the Google Group where the specification committee hung out (https://groups.google.com/forum/#!forum/json-schema), the initial draft was done in 2008 and it was very much inspired by XML Schema. Which is not necessarily a bad thing. For one, the syntax used to define a schema is just JSON. A quick example:

{
"type":"object",
"properties":{
  "firstName":{"type":"string"},
  "age":{"type":"number"}
},
"required":["firstName", "lastName"]
}

That is a complete, perfectly valid and useless schema, defined to validate some imaginary data describing people. People data will be the running example theme in this post, and if you prefer shopping cart orders, well, sorry to disappoint. So, when we validate against that schema, here is what we are enforcing:

The thing must be a JSON object because we put "type":"object".
If it has a firstName property, the value of that property must be a string.
The value of the age property, if present, must be an number.
The properties firstName and lastName are required.

Fairly straightforward. Nevermind that we haven't defined the format (i.e. the sub-schema) for the lastName property, we are still requiring its presence, we just don't care what the value is going to be. So, this is how it goes: a schema is a JSON object where you specify various constraints. If all the constraints are satisfied by a given JSON datum, the schema matches, otherwise it doesn't. The standard standardizes on what are the possible constraint, but with a few extra keywords to structure large, complex schemas.

Why Do I Care?

I can tell why I care then you can decide for yourself. When working with any sort of data structure, if you can't just assume that the structure has the expected form, it becomes very annoying, paranoia strikes at every corner and you become defensive and all sorts of mental disorders can ensue. That's why we use strongly typed languages. But if you, like me, have been doing the unorthodox thing and using JSON as your main data structure instead of spitting out a jar full of beans for your domain model then this sort of trouble befalls you.

A few years ago I made the decision to drop the beans for a project and since then I've been using the strategy in a few other, smaller scale projects where it works even better. But the popular Java libraries for JSON are a disaster. There has already been one JSR 353 about JSON, a "whatever" API, no wonder it seems dead on arrival, almost as bad as Jackson and Gson. And now Java 9 is promising a "lightweight JSON API" which looks like it might actually be well-designed, albeit it has different goals than what I need and simplicity is not one of them. So I wrote mJson. It is a small, 1 Java file, JSON library. I wanted something simple, elegant and powerful. The first two I think I've achieved, the "powerful" part is only half-way there. For instance, many people expect JSON libraries to have JSON<-> POJOs mappings and mJson doesn't, though it has extension points to do your own easily (frankly it takes 1/2 day to implement this stuff if you need it so much).

Modeling with beans offers the type checker's help on validating that structures have the desired form. If you are using JSON only to convert it to Java beans, I suppose the mapping process is a roundabout way to validate, to a certain extent. Otherwise, you either consent to live with the risk of bugs or to the extra bloat needed to defensively code against a structure that may be broken. To avoid these problems, you can write a schema, sort of like your own type rules and make use of it at strategic points in the program. Like when you are getting data from the internet. Or when you are testing a new module. Not that I'm advocating going for JSON + Schema instead of Java POJOs in all circumstances. But you should try it some time, see where it makes sense. By the way, in addition to a being a validation tool schemas are essentially metadata that represents your model (just like XML Schema).

Good. Now I want to give you a quick...

Crash Course on JSON Schema

First, the constraints are organized by the type of JSON. which is probably your starting point to describe how a JSON looks like:

{"type": "string"|"object"|"array"|"number"|"boolean"|"null"|"integer"|"any"}

As you can see, there are two additional possible types besides what you already expected: to avoid floats and to allow any type (which is the same as omitting a type constraint altogether).

Now, given the type of a JSON entity, there are further constraints available. Let's start with object. With properties you describe the object's properties' format in the form of sub-schemas for each possible property you want to talk about. You don't have to list all of them but if your set is exhaustive, you can state that with the additionalProperties keyword:

{
  "properties": { "firstName": { "type":"string"}, etc.... },
  "additionalProperties":false
}

That keywords is actually quite versatile. Here we are disallowing any other properties besides the ones explicitly stated. If instead we want to allow the object to have other properties, we can set it to true, or not set it altogether. Or, the value of the additionalProperties keyword can alternatively be a sub-schema that specifies the form of all the extra properties in the object.

We saw how to specify required properties in the example above. Two other options constrain the size of an object: minProperties and maxProperties. And for super duper flexibility in naming, you can use regular expression patterns for property names - this could be useful if you have some format made of letters and numbers, or UUIDs for example. The keyword for that is patternProperties:

{
  "patternProperties": { "j_[a-zA-Z]+[0-9]+": { "type":"object"} },
  "minProperties":1,
  "maxProperties":100
}

The above allows 1 to 100 properties whose names follow a j_letters_digits pattern. That's it about objects. That's the biggy.

Validating arrays is mainly about validating their elements so you provide a sub-schema for the elements with the items keyword. Either you give a single schema or an array of schemas. A single schema will apply to all elements of an array while an array has to match element for element (a.k.a. tuple typing). That's the basis. Here are the extras: we have minItems and maxItems to control the size of the array; we have additionalItems which only applies when items is an array and it controls what to do with the extra elements when there are some. Similarly to the additionalProperties keyword, you can put false to disallow extra elements or supply a schema to validate them. Finally you can require that all items of an array be distinct with the uniqueItems keyword. Example:

{
  "items": { "type": "string" },
  "uniqueItems":true, 
  "minItems":10,
  "additionalItems":false
}

Here we are mandating exactly 10 unique strings in an array. That's it for areas. Numbers and strings are pretty simple. For number you can define range and divisibility. They keywords are minimum (for >=), maximum (for <=), exclusiveMinimum (if true, minimum means >), exclusiveMaximum (if true, maximum means <). Strings can be validated through a regex with the pattern keyword and by constraining their length with minLength and maxLength. I hope you don't need examples to see string and number validation in action. The regular expression syntax accepted is ECMA 262 (http://www.ecma-international.org/publications/standards/Ecma-262.htm).

Notice that there aren't any logical operators so far. In a previous iteration of the JSON Schema spec (draft 3), some of those keywords admitted areas as values with the interpretation of an "or". For example, the type could be ["string", "number"] indicating that a value can be either a string or a number. Those have been abandoned in favor of a comprehensive set of logical operators to combine schema into more complex validating behavior. Let's go through them: "and" is allOf, "or" is anyOf , "xor" is oneOf, "not" is not. Those are literally to be interpreted as standard logic: not has to be a sub-schema which must not match for validation to succeed, allOf has to be array of schemas and all of them have to match for the JSON to be accepted. Similarly anyOf is an array of which at least one has to match while oneOf means that exactly one of the schemas in the array must match the target JSON. For example to enforce that a person is married, we could declare that it must either have a husband or a wife property, but not both:

{ "oneOf":[
  {"required":["husband"]},
  {"required":["wife"]}
}

If you have a predefined list of values, you could use enum. For example, a gender property has to be either "male" or "female":

{ 
  "properties":{
    "gender":{"enum":["male", "female"]}
  }
}

With that, you know almost everything there is to know about JSON Schema. Almost. Above I mentioned "a few extra keywords to structure large complex schemas". I exaggerated. Actually there is only one such keyword: $ref (the related keyword id is not really needed). $ref allows you to refer to schemas defined elsewhere instead of having to spell out again the same constructs. For example if there is a standard format for address somewhere on the internet, with a schema defined for it and if that schema can be obtained at http://standardschemas.org/address (a made up url), you could do:

{ 
  "properties":{
    "address":{"$ref":"http://standardschemas.org/address"}
  }
}

The fun part of $ref is that the URI can be relative to the current schema and and you can use JSON pointer in a URI fragment (the part after the pound # sign) to refer to a portion of the schema within a document! JSON Pointer is a small RFC (http://tools.ietf.org/html/rfc6901) that specs out Unix path-like expressions to navigate through properties and arrays in a complex JSON structure. For example the expression /children/0/gendre refers to the gendre of the first element in a children array property. Note that only the slash is used, no brackets or dots and that's perfectly enough. If you want to escape a slash inside a property name write ~1 and to escape a tilde write ~0. To gets your hands on some presumably rock solid zip code validation, for example, you could do:

{ 
  "properties":{
    "zip":{"$ref":"http://standardschemas.org/address#/postalCode"}
  }
}

So that means you can define the schemas for your RESTful API at a standard location and publish those and/or refer to them on your API responses. Any JSON validator has to be capable of fetching the right sub-schema and a good implementation will cache them so you don't have to worry about network hops. A reference URI can be relative to the current schema so if you have other schemas on the same base location, they can refer to each other irrespective of where they are deployed. As a special case to that, you can resolve fragments relative to the current schema. For example:

{ 
  "myschemas": {
     "properName": { "type":"string", "pattern":"[A-Z][a-z]+"}
  }
  "properties":{
    "firstName":{ "$ref":"#/myschemas/properName"},
    "lastName":{ "$ref":"#/myschemas/properName"}
  }
}

Because the JSON Schema specification allows properties that are not keywords, we can just pick up a name, like myschemas here, as a placeholder for sub-schemas that we want to reuse. So we've defined that a proper name must start with a capital letter followed by one or more lowercase letters, and then we can reuse that anywhere we want. This is such a common pattern then the specification has defined a keyword to place such sub-schemas. This is the definitions keyword which must appear at the top-level, has no role in validation per se, but is just a placeholder for inline schemas. So the above example should be properly rewritten as:

{ 
  "definitions": {
     "properName": { "type":"string", "pattern":"[A-Z][a-z]+"}
  }
  "properties":{
    "firstName":{ "$ref":"#/definitions/properName"},
    "lastName":{ "$ref":"#/definitions/properName"}
  }
}

To sum up, using the $ref keyword and the definitions placeholder is all you need to structure large schemas, split them into smaller ones, possibly in different documents, refer to standardized schemas over the internet etc.

Resources

Now to make use of JSON schema, there aren't actually that many implementations available yet. The popular (and bloated) Jackson supports draft 3 so far, and this part doesn't seem actively maintained. One of the JSON Schema spec authors has implemented full support on top of Jackson: https://github.com/fge/json-schema-validator, so you should know about that implementation especially if you are already a Jackson user. But if you are not, I want to point you to another option available since recently:mJson 1.3 which supports JSON Schema Draft 4 validation:

Json schema = Json.read(new URL("http://mycompany.com/schemas/model");
Json data = Json.object("firstName", "John", "lastName", "Smith").set("children", Json.array().add(/* etc... */));
Json errors = schema.validate(data);
for (Json e: errors.asJsonList())
   System.out.println("JSON validation error:" + e);

In all fairness, some of the other libraries also have support for generating JSON based on a schema, with default values specified by the default keyword which I haven't covered here. mJson doesn't do that yet, but if there's demand I'll put it in. The keywords I haven't covered are title, description (meta data keywords not used during validation) and id. To become and expert, you can always read the spec. Here it is, alongside some other resources:

http://json-schema.org - the official website with the specification documents.
http://jsonary.com/documentation/json-schema/ - a very well-written tutorial covering the whole standard.
https://json-schema-validator.herokuapp.com/ - simple online tool for immediate JSON schema validation, useful during schema development.
http://spacetelescope.github.io/understanding-json-schema/ - online book, tutorial (Python/Ruby based)
http://www.jsonschema.net/ - a little tool to generate JSON schemas from document samples
https://groups.google.com/forum/#!forum/json-schema - google group around the JSON Schema standard.
http://bolerio.github.io/mjson/ - minimal Json, by yours truly.

For Dessert

To part ways, I want to leave you with a little gem, one more resource. Somebody came up with a much more concise language for describing JSON structures, It's called Orderly, it compiles into JSON Schema and I haven't tried it. If you do, please report back. It's at http://orderly-json.org/ and it looks like this:

object {
  string name;
  string description?;
  string homepage /^http:/;
  integer {1500,3000} invented;
}*;

Where are the JVM Scripting IDEs?

2014-08-26T22:19:00.000-04:00

The raise of scripting languages in the past decade has been spectacular. And since the JVM platform is the largest, a few were designed specifically for that platform while many others were also implemented on top. It is thus that we have JRuby, Jython, Groovy, Clojure, Rhino, JavaFX and the more obscure (read more fun) things like Prolog and Scheme implementations. Production code is being written, dynamic language code bases are growing, whole projects don't even have any Java code proper. Yet when it comes to tooling, the space is meager to say the least.

What do we have? In Eclipse world, there's the Dynamic Languages Toolkit which you can explore at http://www.eclipse.org/dltk/, or some individual attempts like http://eclipsescript.org/ for the Rhino JavaScript interpreter or the Groovy plugin at http://groovy.codehaus.org/Eclipse+Plugin. All of those provide means to execute a script inside the Eclipse IDE and possible syntax highlighting and code completion. The Groovy plugin is really advanced in that it offers debugging facilities, which of course is possible because the Groovy implementation itself has support for it. That's great. But frankly, I'm not that impressed. Scripting seems to me a different beast than normal development. Normally you do scripting via a REPL, which is traditionally a very limited form of UI because it's constrained by the limitation of a terminal console. What text editors do to kind of emulate a REPL is let you select the expression to evaluate as a portion of the text, or take everything on a line, or if they are more advanced, then use the language's syntax to get to the smallest evaluate-able expression. It still feels a little awkward. Netbeans' support is similar. Still not impressed. "What more do you want?", you may ask. Well, don't know exactly, but more. There's something I do when I write code in scripting languages, a certain state of mind and a way of approaching problems that is not the same as with the static, verbose languages such as Java.

The truth is the IDE brought something to Java (and Pascal and C++ etc.) that made the vast majority of programmers never want to look back. Nothing of the sort has happened with dynamic languages. What did IDEs bring? Code completion was a late comer, compared to integrated debugging and the project management abilities. Code completion came in at about the same time as tools to navigate large code bases. Both of those need a structured representation of the code and until IDEs got powerful and fast enough to quickly generate and maintain in sync such a representation, we only had an editor+debugger+a project file. Now IDEs also include anything and everything around the development process, all with the idea that the programmer should not leave the environment (nevermind that we prefer to take a walk outside from time to time - I don't care about your integrated browser, Chrome is an Alt-tab away!).

Since I've been coding with scripting languages even before they became so hot, I had that IDE problem a long time ago. That is to say, more than 10 years ago. And there was one UI for scripting that I thought was not only quite original, but a great match for the kind of scripting I was usually doing, namely exploring and testing APIs, writing utilities, throw away programs, prototypes, lots of activities that occasionally occupy a bigger portion of my time than end-user code. That UI was the Mathematica notebook. If you have never heard of it, Mathematica (http://www.wolfram.com/mathematica) is a commercial system that came out in the 90s and has steadily been growing its user base with even larger ambitions as of late. The heart of it is its term-rewrite programming language, nice graphics and sophisticated math algorithms, but the notion of a notebook, as a better than REPL interface, is applicable to any scripting (i.e. evaluation-based, interpreter) language. A notebook is a structured document that has input cells, output cells, groups of cells, groups of groups of cells etc. The output cells contain anything that the input produces which can be a complex graphic display or even an interactive component. That's perfect! How come we haven't seen it widely applied?

Thus Seco was born. On a first approximation, Seco is just a shell to JVM dynamic languages that imitates Mathematica's notebooks. It has its own ambition a bit beyond that, moving towards an experimental notion of software development as semi-structured evolutionary process. Because of that grand goal, which should not distract you from the practicality of the tool that I and a few friends and colleagues have been using for years, Seco has a few extras, like the fact that your work is always persisted on disk, the more advanced zoomable interface beyond the mere notebook concepts. The best way to see why this is worth blogging about is to play with it a little. Go visit http://kobrix.com/seco.jsp.

Seco was written almost in its entirety by a former Kobrix Software employee, Konstantin Vandev. It is about a decade old, but active development stopped a few years ago. I took a couple of hours here and there in the past months to fix some bugs, started implementing a new feature to have a centralized searchable repository for notebooks so people can backup their work remotely, access it and/or publish it. That feature is not ready, but I'd like to breathe some life into the project by making a release. So consider this an official Seco 0.5 release which besides the aforementioned bug fixes upgrades to the latest version of HyperGraphDB (the backing database where everything get stored) and removes dependency on the BerkeleyDB native library so it's pure Java now.

Why is the Fundamental Theorem of Software Engineering Fundamental?

2014-07-21T00:53:00.000-04:00

Have you heard of the adage "All problems in computer science can be solved by another level of indirection."? If you are a programmer, chances are you have read about it in a book or an article talking about how to best structure the software you write. It's been dubbed the fundamental theorem of software engineering (the FTSE) so you should know what it is about. If you don't then, quickly, go read up on the subject at ... no, I won't do the googling for you.

A common explanation you will find is that the FTSE is talking about abstraction layers: another level of indirection is achieved by raising the abstraction level. That's a good way to view it, but only some of it. Abstraction, when realized in software, often results in a layer so that the details of whatever is on the other side remain hidden, and a layer causes references to go in round about ways to get to the point, hence indirection. However, there are other forms of indirection where one just wants to reduce coupling between two parts. Here we are not really getting rid of any details, but maybe creating a common language about them. Think of the Adaptor pattern for example. Not convinced? Go read the insightful commentary on the subject by Zed Shaw at http://zedshaw.com/essays/indirection_is_not_abstraction.html.

So, just for the fun of it, how would you explain the FTSE to a neophyte? Wikipedia defines indirection as the ability to reference something using a name, reference, or container instead of the value itself. This is both misleading and revelatory. Misleading because this is a much more general definition than the intuition most programmers have about the FTSE. After all, everything relies on naming and references, so what could a theorem stating the obvious have to teach us? But it's also revelatory because it hints at the fact that much of software engineering, and therefore much of computing, is about answering the equally famous question what's in a name. Therefore, to apply the FTSE in practice we need to allow ourselves to answer that question the best we can at any point in time. That is, we need to be able to define the meaning of a name within any given context. The meaning of a symbol in a software system of even moderate complexity quickly starts exhibiting nuances and goes through changes much like words in natural language. This is because just like natural language is a complicated symbolic model of the messy world, a software system can be similarly characterized.

In a sense software, any software, is a model of the world. The symbols we use inside a software program acquire meaning (or semantics if you will) by virtue of what the entities they refer to actually do. A function's meaning is its implementation. The meaning of a piece of data is how it's used, how it interacts with other data. And programming types govern that usage to some extent, but mostly it is about what we do with the data inside the program. For instance, to discover the meaning of a field in a data structure about which you have no documentation, you would trace where and how it is being used. New requirements or changes and adjustments to old requirements are nothing more than a refined model of the world and/or system that needs to be reflected in software. This process of refining a model is enriching the meaning of the terms used to describe things and that translates into modifying the semantics of certain symbols in the software. Which is what we call "changing the implementation". Now, the practice of programming in not seen as creating and refining meaning of symbols, but I believe that is a very important perspective. It is a perspective where the resolution of symbolic references in context is at the foundation of programming, instead of the usual "giving instructions to a machine" view. I came to this conclusion about 15 years ago while designing a programming language and looking at various theoretical constructs in PL design, their limitations and trade-offs. Over the years, I have developed a reflex to see a lot of design problems in terms the important symbols in play, the context, both lexical and runtime (or static and dynamic if you prefer), and what is the resolution process of those symbolic references. I also see a lot of programming language constructs as being in essence a bag of reference resolution tools. That is, programming constructs are in large part tools a programmer has at their disposal to define the meaning of symbolic references. Let's take a look at some.

Variables

Variables are the quintessential programming tool. Probably the first construct you ever learn in programming, directly lifted from algebra, it is the simplest form of abstraction - use a name instead of a value. Of course, as the term "variable" suggests, the whole point is for the thing itself to change, something that is able to vary. So in effect a variable both establishes an identity and an interface for changes to be affected, thus bypassing all sorts of metaphysical problems in one nice engineering sweep. Conceptually variables are direct symbolic references, associations between a name and a value "container". Implementation wise, they are usually the same thing as pointers to memory locations and that's how they've always been understood. In fact, this understanding is a consequence of the fact that in compiled languages the name of a variable disappears from the compiled version, it is replaced by the memory location. A benefit of this strategy is that the cost of using a variable is as low as it can be - just a RAM memory cell access. On the other hand, any flexibility in how the name is to be interpreted at runtime is completely gone.

Note that we are not talking here about the distinction between "reference variables" vs. "primitive data variables" or whatever. Though that's an important distinction, what we are concerned about is merely the fact that variables are names of things. What is thought of as "reference variables" (or pointers) in various languages has to do with how they are copied during an assignment operation or as a function argument, whether the value is mutable or not etc.

Aliases and Macros

Aliases are relatively uncommon as a separate construct in modern languages. When pointers are assigned to each other, this is called aliasing because we have two names for the same memory location, i.e. two references with the same referent. For example, the assignment of any variable of Object type in Java is considered aliasing. While we do have another level of indirection here since we could change one reference without disturbing the other, this type of aliasing is not particularly interesting. But consider another type of aliasing, through macros (think C #define) where a name is explicitly declared to be a replacement of another name. The indirection here involves an extra translation step and the meaning of the alias in this case is not that it has the same referent, but that its referent is the original name. As a consequence, mentioning the alias at a place in the program where the same symbol is used for an entirely different thing will make the alias refer that that thing. Another benefit of this kind of aliasing is that it can be used to refer to other things besides variables, for example types, especially when a type description is a complex expression that we don't want to write over and over again. Macros are also in the same spirit - languages that recognize the value of compile-time decision making will offer a macro facility. And it is a shame they are not more popular. They just have a bad aura because they are associated with purely textual manipulation, a completely separate language on its own. However, if one sees them as a way to do structured code generation, or compile-time evaluation, they are much more appealing. One has to realize that macros and aliases are about name & conquer just as much as variables and functions are, and that they are in fact a great level of indirection mechanism. A mechanism that occupies another spot in the compile-run time dimension. Speaking of which, the fact that there is nothing in between that strict compile vs. run-time separation is strong limitation in language expressiveness. Partial evaluation techniques could be what macros at run time look like, but those are still confined to academic research mainly.

To sum up so far: the key difference between variables and aliases is the timing of the reference resolution process. With variables, the referent is obtained when the program is running while with aliases it is obtained at compile time. In one case we have the context of a running program, in the other the context of a running compiler.

Overloading

Overloading is a programming mechanism that allows one to define different implementations of the same function depending on the arguments used during invocation. In other words, overloading lets you associate a different meaning with a given name depending on the syntactical context of usage of that name. It's a context-dependent reference resolution process that happens generally at compile time, hence within a static context. A rough natural language analogue would be homonyms that have different meanings only because used as a different part of speech. For example, in "all rivers flow" and "the flow is smooth" the semantic import is the same, but the strict meaning is different because in one case we have a verb while in the other we have a subject. A variation on the theme is Common Lisp and its generic functions where the dispatching can be defined on an actual object, via an equals predicate. In that case the context for the resolution is a dynamic one, hence it has to do more with semantics in a sense. That's more like homonymy where the semantics of a word depend on the semantics of the surrounding words.

Overriding

Overriding is about changing the meaning of a name in a nested lexical scope. I deliberately use the word meaning here to talk about whatever a name refers to, understanding that often the meaning of a symbol in a programming language is simply the referent of that symbol. A nested lexical scope could be a nested function, or a class declaration or some code block the delimits a lexical scope. In some cases, one looses the ability to access the referent from the enclosing scope. In others, the programming language provides a special syntax for that. (i.e. the super keyword in Java). Again, we are talking about a reference resolution mechanism in a certain specialized context. The context is completely specified by the lexical scope and is therefore static. Note that people somewhat erroneously use the word overwrite instead of override. The correct term in Computer Science is override. In English, it means to cancel a previous decision whereas overwrite literally mean to write over something. More on the mechanics of overriding below.

Classes

A compound structure such as an object in JavaScript, or a class in Java/C# or a struct in C is, among other things, a context where certain names have meanings. Specifically, the fields and methods that belong to that structure are references that are resolved precisely in the context of the runtime object. Well, actually it depends. An interesting case are the static variables in Java classes. A static variable is a member of the class object rather than of its instances. One way teachers of the language describe static variables is that all objects of that class share the same variable. That's even how the official Java documentation talks about static variables: in terms of variables that all objects share (see http://docs.oracle.com/javase/tutorial/java/javaOO/classvars.html). But that is inaccurate because an object (i.e. an instance of the class) is not the proper context to refer to the static variable. If we have:

class A { public static int a = 0; }
class B extends A { public static int a  = 1;}
A x = new B();
System.out.println(x.a); // what are we supposed to see here?

What does the last statement print out? That sounds like a good entry level job interview question. The question comes down to what is the proper context for the reference resolution of the name a? If we see static variables as variable shared by all objects of the same class, the value should clearly be 1 since x's actual type is B and B.a is what all objects of type B share. But that is not what happens. We get 0 because Java only cares about the declared type which in this is A. The correct way to talk about static variables is as member variables of the object class (in Java, every class is an object at runtime, an object whose type is java.lang.Class). This is why some recent Java compilers issue a warning when you refer to a static variable from an object context (though not the JDK7 compiler!) To be fair to official documentation, the above mentioned tutorial does recommend use of the class rather than the object to refer to static variables. However, the reason given is that otherwise it does not make it clear that they are class variables. Now, if we had non-static member variables? We get the same result: the declared type of the object variable x is what matters, not the actual runtime type. If instead of accessing a public variable, we were accessing a public function then the runtime type would have been the one used to resolve the reference.

So why is that the case? Because part of the reference resolution process happens at compile time rather than at run-time. Traditionally, in programming languages a name is translated to a memory location during compilation so then at runtime only the location matters and the referent is obtained the fastest possible way. With so called "virtual methods", like non-static methods in Java, we get to do an extra hop to get to the memory location at runtime. So for variables, both static and non-static, and for static methods, the reference resolution is entirely based on the static context (type annotations available at compile time) while for non-static function it becomes dynamic. Why is only one kind of name inside a class's lexical context resolved in this way. No deep reason, just how Java was designed. Of course I could have just said "virtual tables only apply to non-static functions", but that's not the point. The point is that in defining programming constructs, an important part of the semantics of a programming language is based on narrowing down what is the context and the process for reference resolution in all the various ways a symbol can be mentioned in a program. For most mainstream languages, this only applies to identifiers, a special lexical category, but it is in fact more general (e.g. operators in C++ or Prolog, any type of symbol in Scheme). A common name for this kind of deferring of the reference resolution is late binding. And everybody likes it because "it's more flexible". Good.

Closures

To finish, let me mention closures, a term and a construct that has thankfully made it into mainstream programming recently. The power of closures stems from their ability to capture a computational context as a reference resolution frame. It is a way to transcend the rigid lexical scoping somewhat and have a piece of code rely on a set of variables that are hidden from the users of that code, yet whose lifetime is independent of its execution. So the variables captured in a closure behave like global variables in terms of their lifetime, but like local variables in terms of their scope. So they represent a very particular intersection between the design dimensions of visibility&lifetime. But what that does in effect, to put it in more general terms, is that the meaning of a name is carried over from one context to another without it interfering with meanings of that same name in other contexts.

Okay. So we started with a software engineering principle and we dug somewhat into a few programming language concepts from the perspective of reference resolution. Naming and reference are philosophical problems that are probably going to get resolved/dissolved soon by neuroscience. In the meantime, the cognitive phenomenon and whatever philosophy and linguistics has thought us so far about it could serve a bit more as an inspiration to how people communicate with machines through programming. So you can see where I'm heading with the question posed at the title of this post. It is the reference resolution that is fundamental an indirection is simply what we do to (re)define what that resolution process ultimately looks like. I have more to say about it, but this already got a bit long so I'll stop here.

Application Components with AngularJS

2014-02-01T19:27:00.000-05:00

This post is about self-contained business components with AngularJS that can be instantiated at arbitrary places in one's application.

With AnuglarJS, one writes the HTML plainly and then implements behavior in JavaScript via controllers and directives. The controllers are about the model of which the HTML is the view, while directives are about extra tags and attributes that you can extend the HTML with. You are supposed to implement business logic in controllers and UI logic in directives. Good. But there are situations where the distinction is not so clear cut, in particular when you are building a UI by reusing business functionality in multiple places.

In a large applications, it often happens that the same piece of functionality has to be available in different contexts, in different flows of the user interaction. So it helps to be able to easily package that functionality and plug it whenever it's needed. For example, a certain domain object should be editable in place, or we need the ability to select among a list of dynamically generated domain objects. Those types of components are application components because they encapsulate reusable business logic and they are even tied to a specific backend, as opposed to, say, UI components which are to be (re)used in different applications and have much wider applicability. Because application components are not instantiated that often (as opposed to UI components which are much more universal), it is frequently tempting to copy&paste code rather than create a reusable software entity. Unless the componentization is really easy and requires almost no extra work.

If you are using AngularJS, here is a way to easily achieve this sort of encapsulation:

Put the HTML in a file as a self-contained template "partial" (i.e. without the top-level document HTML tags).
Have its controller JavaScript be somehow included in the main HTML page.
Plug it in any other part of the application, like other HTML templates for example.

This last part cannot be done with AngularJS's API. We have to write to gluing code. Since we will be plugging by referring to our component in an HTML template, we have to write a custom directive. Instead of writing a separate directive for each component as AngularJS documentation recommends, we will write one directive that will handle all our components. To be sure, there is generic directive to include HTML partials in AngularJS, the ng-view directive, but it's limited to swapping the main content of a page, too coarse grained that is. By contrast, our directive can be used anywhere, nested recursively etc. Here is an example of its usage:


<be-plug name="shippingAddressList">
  <be-model-link from-child-scope="currentSelection" 
       from-parent-scope="shippingAddress">
</be-model-link></be-plug>

This little snippet assumes we have an HTML template file called shippingAddressList.ht that lets the user select one among several addresses to ship the shopping cart contents to. We have a top-level tag called be-plug and nested tag called be-model-link. The be-model-link tag associates attributes of the component's model to attributes of the model (i.e. scope in AngularJS's terms) of the enclosing HTML. More on that below. Here is the implementation:

app.directive('bePlug', function($compile, $http) {
  return {
    restrict:'E',
    scope : {},
    link:function(scope, element, attrs, ctrl) {
      var template = attrs.name + ".ht";
      $http.get(template).then(function(x) {
        element.html(x.data);
        $compile(element.contents())(scope); 
        $.each(scope.modelLinks, function(atParent, atChild) {
          // Find a parent scope that has 'atParent' property
          var parentScope = scope;
          while (parentScope != null && 
                 !parentScope.hasOwnProperty(atParent))
            parentScope = parentScope.$parent;
          if (parentScope == null) 
            throw "No scope with property " + atParent + 
                  ", be-plug can't link models";
          scope.$$childHead.$watch(atChild, function(newValue) {
            parentScope[atParent] = newValue;
          });
          parentScope.$watch(atParent, function(newValue) {
            scope.$$childHead[atChild] = newValue;
          });            
        });
      });
    }
  };
});

Let's deconstruct the above code. First, make sure you are familiar with how to write directives in AngularJS and you understand what AngularJS scopes are. Next, note that we are creating a scope for our directive by declaring a scope:{} object. The purpose is twofold: (1) don't pollute the parent scope and (2) make sure we have a single child in our scope so we have a handle on the scope of the component we are including.

Good. Now, let's look at the gist of the directive, its link method. (I'm sure there is some valid reason that method is named "link". Perhaps because we are "linking" an HTML template to its containing element. Or to a model via the scope? Something like that.) In any case, that's were DOM manipulation is done. So here's what's happening in our implementation:

We fetch the HTML template from the server. By naming convention, we expect the file to have extension .ht. The rest of the relative path of the template file is given in the name attribute.
Once the template is loaded, we set it as the HTML content of the directive's element. So the resulting DOM will have a be-plug DOM node which the browser will happily ignore and inside that node there will be our component's HTML template.
Then we "compile" the HTML content using AngularJS's $compile service. This method call is essentially the whole point of the exercise. This is what allows AngularJS to bind model to view, to process any nested directives recursively etc. In short, this is what makes our textual content inclusion into a "runtime component instance". Well, this and also the following:
...the binding of scope attributes between our enclosing element and the component we are including. This binding is achieved in the following for loop by observing variable changes in the scopes of interest.

That last point needs a bit more explaining. The HTML code that includes our component presumably has some associated model scope with attributes pertaining to business logic. On the other hand, the included component acquires its own scope with its own set of attributes as defined by its own controller. The two scopes end up in a parent-child relationship with the directive's scope (a third one) in between. From an application point of view, we probably have one or several chained parent scopes holding relevant model attributes and we'd want to somehow connect the data in our component model to the data in the enclosing scope. In the example above, we are connecting the shippingAddress attribute of our main application scope to the currentSelection attribute of the address selection component. In the context of the enclosing logic, we are dealing with a "shipping address", but in the context of the address selection component which simply displays a choice of addresses to pick from we are dealing with a "current selection". So we are binding the two otherwise independent concepts.

To implement this sort of binding of a given pair of model attributes, we need to know: the parent scope, the child scope, the name of the attribute in the parent scope and the name of the attribute in the child scope. To collect the pairs of attributes, we rely on a nested tag called be-model-link implemented as follows:

app.directive('beModelLink', function() {
  return {
    restrict:'E',    
    link:function(scope, element, attrs, ctrl) {
      if (!scope.modelLinks)
        scope.modelLinks = {};
      scope.modelLinks[attrs.fromParentScope] = attrs.fromChildScope;
    }
  };
});

Because we have not declared a private scope for the be-model-link directive, the scope we get is the one of the parent directive. This gives us the chance to put some data in it. And the data we put is the mapping from parent to child model attributes in the form of a modelLinks object. Note that we refer to this modelLink object in the setup of variable watching in the be-plug directive where we loop over all its properties and use AngularJS' $watch mechanism to monitor changes on either side and affect the same change to the linked attributes. To find the correct parent scope, we walk up the chain and get the first one which has the stated from-parent-scope attribute, throwing an error if we can't find it. The child scope is easy because there is only one child scope to our directive.

That's about it. We are essentially doing server-side includes like in the good (err..bad) old days, except because of the interactive nature of the "thing" with AJAX and all, and the whole runtime environment created by AngularJS, we have a fairly dynamic component. Hope you find this useful.

Better-Setter-Getter

2013-09-06T22:22:00.001-04:00

I must confess that it's very difficult to bring myself to write code with a long life expectancy in an alternative JVM language. Scala is slowly taking over and that's great. But if I'm working on an API, an open general purpose one, or some "important" layer within a system, the thought of not using Java doesn't even cross my mind. Not that those other languages are not good. But anyway, in this post I just wanted to complain that I'm tired of seeing, let alone writing getters and setters for bean properties. Since first-class properties don't seem to be on the table for Java, can we at least come up with a better naming convention? What would it take to have a tool supported, alternative Java Beans spec? The addition of closures in Java 8 is perhaps an opportunity and an excuse to revise this (e.g. event listeners can be specified more concisely). Perhaps it's already being discussed in the depths of the Oracle.

But let's start with the 'getFoo'/'setFoo' pair. For a "real world" example:

public class Customer {
  private String email;
  public void setEmail(String email) { this.email = email; }
  public String getEmail() { return email; }
}

That convention is ugly and verbose. For one thing why the need for the 'set' and 'get' prefixes? It is clear that if you are passing an argument to a function then you'd be setting. If there's no argument you are not setting, so you are probably getting as evidenced by the return value. Moreover, the fact that a setter returns no value is pure waste since you could always return this. Why is that important? Method chaining! Yes, method chaining is not to be underestimated and it's not just a trick of the conciseness-conscious coder. Since we in the Western world read left to right, programming languages are written and evaluated left to right. As a consequence, symbols to the left of the current position (of reading, writing or evaluating) tend to establish a context that could just be relied upon rather than repeated. When you put a semi-colon in Java, that's like a reset, context goes away, next statement. But if you are setting up a bunch of properties, as is frequently the case, your next statement will start by re-establishing the context you already had. That's more verbiage there on the screen starring at you and definitely kind of ugly. But it's also more work for the thing doing the evaluation. Values are returned on the call stack, so when your setter returns this, the next set operation doesn't need to push the object on the stack since it's already there. So there's a performance improvement there. Probably quite negligible, but I'd like to point it out because it ties to the observation that, in the kind of programming that consists of moving data in one form from one place to another form somewhere else, which is the bulk of so called "business" applications, less code does mean better performance.

The difference between:

Customer customer = new Customer();
customer.setFirstName("Bill");
customer.setLastName("Watterson");
customer.setOccupation("cartoonist");

and this:

Customer customer = new Customer().firstName("Bill").lastName("Watterson").occupation("cartoonist");

is not just a matter of taste. Well, unless you really have bad taste that is.

All it would take is for a couple of IDEs to support the new convention and a small lib that could quickly become ubiquitous. Or perhaps a big lib, something called betterjava.jar that would fix bad API designs from the early, and why not later, days of Java. I'm sure people aren't short of ideas as the Java standard APIs aren't short of problems.

eValhalla User Management

2012-11-19T15:56:00.003-05:00

[Previous in this series: eValhalla Setup]

In this installment of the eValhalla development chronicle, we will be tackling what's probably the one common feature of most web-based application - user management. We will implement:

User login
A "Remember me" checkbox for auto-login of users on the same computer
Registration with email validation

The end result can be easily taken and plugged in your next web app! Possible future improvements would be adding captchas during registration and/or login and the ability to edit more extensive profile information.

I have put comments wherever I felt appropriate in the code, which should be fairly straightforward anyway. So I won't be walking you line by line. Rather, I will explain what it does and comment on design decisions or on less obvious parts. First, let me say a few words about how user sessions are managed.

User Sessions

Since we are relying on a RESTful architecture, we can't have a server hold user sessions. We need to store user session information at the client and transfer it with each request. Well, the whole concept of a user session is kind of irrelevant here since the application logic is mostly residing at the client and the server is mainly consulted as a database. Still, the data needs protection and we need the notion of a user with certain access rights. So we have to solve the problem of authentication. Each request needs to carry enough information for the server to decide whether the request should be honored or not. We cannot just rely on the user id because this way anybody can send a request with anybody else's id. To authenticate the client, the server will first supply it with a special secret key, an authentication token, that the client must send on each request along with the user id. To obtain that authentication token, the client must however identify themselves by using a password. And that's the purpose of the login operation: obtaining an authentication token for use in subsequent requests. The client will then keep that token together with the user id and submit them as HTTP headers on every request. The natural way to do that with JavaScript is storing the user id and authentication tokens as cookies.

This authentication mechanism is commonly used when working within a RESTful architecture. For more on the subtleties of that approach, just google "user authentication and REST applications". One question is why not just send the password with every request instead of a separate token. That's possible, but more risky - a token is generated randomly and it expires, hence it is harder to guess. The big issue however is XSS (cross-site scripting) attacks. In brief, with XSS an attacker insert HTML code into a field that gets displayed supposedly as just static text to other users (e.g. a blog title) and the code simply does an HTTP request to a malicious server submitting all the users' private cookies with it. To avoid them, we will have to pay special attention on HTML sanitation. That is, we have to properly escape every HTML tag displayed as static text. We can also put that authentication token in an HTTPOnly cookie for extra security.

Implementation - a User Module

Since user management is so common, I made a small effort to build a relatively self-contained module for it. There are no standard packaging mechanisms for the particular technology stack we're using, so you'd just have to copy&paste a few files:

/html/ahguser.ht - contains the HTML code for login and registration dialogs as well as top-level Login and Register links that show up right aligned. This depends on the whole Angular+AngularUI+Bootstrap+jQuery environment.
/javascript/ahguser.js - contains the Angular module 'ahgUser' that you can use in your Angular application. This depends on the backend:
/scala/evalhalla/user/package.scala - the evalhalla mJson-HGDB backend REST user service.

The backend can be easily packaged in a jar, mavenized and all, and this is something that I might do near the end of the project.

The registration process validates the user's email by emailing them a randomly generated UUID (a HyperGraphDB handle) and requiring that they provide it back for validation before they can legitimately log into the site. Since the backend code is simple enough, repetitive even, let's look at one method only, the register REST endpoint:

@POST
@Path("/register")
def register(data:Json):Json = {
  return transact(Unit => {
    normalize(data);
    // Check if we already have that user registered
    var profile = db.retrieve(
        jobj("entity", "user", 
             "email", data.at("email")))
    if (profile != null)
      return ko().set("error", "duplicate")
    // Email validation token
    var token = graph.getHandleFactory()
          .makeHandle().toString()
    db.addTopLevel(data.set("entity", "user")
                       .set("validationToken", token)
                       .delAt("passwordrepeat"))
    evalhalla.email(data.at("email").asString(), 
      "Welcome to eValhalla - Registration Confirmation",
      "Please validate registration with " + token)
    return ok
  })
}

So we see our friends db, transact, jobj etc. from last time. The whole thing is a Scala transaction closure with the much nicer than java.lang.Callable Scala syntax. As a reminder, note that most of the functions and objects referred here are globally declared in the evalhalla package. For example, db is an instance of the HyperNodeJson class. The call to normalize just ensures the email is lower-case, because email in general are case-insensitive. While the logic is fairly straightforward, let me make a few observations about the APIs.

User profile lookup is done with a Json pattern matching. Recall that the database stores arbitrary Json structures (primitives, arrays, objects and nulls) as HyperGraphDB atoms. It doesn't have the notion of "document collections" like MongoDB for example. This is because each Json structure is just a portion of the whole graph. So to distinguish between different types of objects we define a special JSON property called entity that we reserve as a type name attached to all of our top-level Json entities. Here we are dealing with entities of type "user". Now, each atom in HyperGraphDB and consequently each user profile has a unique identifier - the HyperGraphDB handle (a UUID). There is no notion of a primary key enforced at the database level. We know that an email should uniquely identify a user profile so we perform the lookup with the Json object {entity:"user", email:<the email>} as a pattern. But this uniqueness is enforced at the application level because new profiles are added only precisely via this register method. I've explained the db.addTopLevel method on the HGDB-mJson wiki page, but here is a summary for the impatient: while db.add will create a duplicate version of the whole Json structure recursively and while db.assertAtom will only create something if it doesn't exist yet, db.addTopLevel will create a new database atom only for the top-level JSON, but perform an assert operation for each of its components.

Finally, note that before adding the profile to the database, we delete the passwordrepeat property. This is because we are storing in the database whatever JSON object the client gives us. And the client is giving us all fields coming from the HTML registration form, pretty much as the form is defined. So we get rid of that unneeded field. Granted, it would be actually better design to remove that property at the client-side since the repeat password validation is done there. But I wanted to illustrate the fact that the JSON data is really flowing as is from the HTML form directly to the database, with no mappings or translations of any sort needed.

So far, so good. Let's move on the UI side.

Client-Side Application

In addition to AngularJS, I've incorporated the Bootstrap front-end library by Twitter. It looks good and it has some handy components. I've also included the Angular-UI library which has some extras like form validation and it plays well with Bootstrap.

The whole application resides on one top-level HTML page /html/index.html. The main JavaScript program that drives it resides in the /javascript/app.js file. The idea is to dynamically include fragments of HTML code inside a reserved main area while some portions of the page such as the header with user profile functionality remain stable. By convention, I use the extension '.ht' for such HTML fragments instead of '.html' to indicate that the content is not a complete HTML page. Here's the main page in its essence:


<div ng-controller="MainController">
<div ng-include="ng-include" src="'/ahguser.ht'"></div>
<!-- some extra text here... -->
<hr/>
<a href="#m">main</a> |
<a href="#t">test</a>
<hr/>
<ng-view/>  <!-- this is where HTML fragments get included -->

The MainController function is in app.js and it will eventually establish some top-level scoped variables and functions, but it's currently empty. The user module described above is included as a UI component by the <div ng-include...> tag. The login and register links that you see at the top right of the screen with all their functionality come from that component. Finally, just for the fun of it, I made a small menu with two links to switch the main view from one page fragment to another: 'main' and 'test'. This is just to setup a blueprint for future content.

A Note on AngularJS

Let me explain a bit the Angular portion of that code since this framework, as pretentious as it is, lacks in documentation and it's far from being an intuitive API. In fact, I wouldn't recommend yet. My use of it is experimental and the first impressions are a bit ambivalent. Yes, eventually you get stuff to work. However, it's bloated and it fails the first test of any over-reaching framework: simple things are not easy. It seems built by a team of young people that use the word awesome a lot, so we'll see if it meets the second test, that complicated things should be possible. Over the years, I've managed to stay away from other horrific, bloated frameworks, aggressively marketed by a big companies (e.g. EJBs), but here I may be just a bit too pessimistic. Ok, enough.

Angular reads and interprets your HTML before it gives it to the browser. That opens many doors. In particular, it allows you to define custom attributes and tags to implement some dynamic behaviors. It has the notion of an application and the notion of modules. An application is associated with a top-level HTML tag, usually the 'html' tag itself by setting the custom ng-app='appname' attribute. Then appname is declared as a module in JavaScript:

var app = angular.module('appname', [array of module dependencies])

It's not clear what's special about an application vs. mere modules, presumably nothing. Then functionality is attached to markup (the "view") via Angular controllers. Those are JavaScript functions that you write and Angular calls to setup the model bound to the view. Controller functions take any number of parameters and Angular uses a nifty trick here. When you call the toString method of a JavaScript function, it returns its full text as it was originally parsed. That includes the formal arguments, exactly with the names you have listed in the function declaration (unless you've used some sort of minimization/obfuscation tool). So Angular parses that argument list and uses the names of the parameters to determine what you need in your function. For example, when you declare a controller like this:

function MainController($scope, $http) {
}

A call to MainController.toString() returns the string "function MainController($scope, $http) { }". Angular parses that string and determines that you want "$scope" and "$http". It recognizes those names and passes in the appropriate arguments for them. The name "$scope" is a predefined AngularJS name that refers to an object to be populated with the application model. Properties of that object can be bound to form elements, or displayed in a table or whatever. The name "$http" refers to an Angular service that allows you to make AJAX calls. As far as I understand it, any global name registered as a service with Angular can be used in a controller parameter list. There's a dependency injection mechanism that takes care of, among other things, hooking up services in controllers by matching parameter names with globally registered functions. I still haven't figured out what the practical benefit of that is, as opposed to having global JavaScript objects yourself....perhaps in really large applications where different clusters of the overall module dependency graph use the same names for different things.

Beginnings of a Data Service

One of the design goals in this project is to minimize the amount of code handling CRUD data operations. After all, CRUD is pretty standard and we are working with a schema-less flexible database. So we should be able to do CRUD on any kind of structured object we want without having to predefine its structure. In all honesty, I'm not sure this will be as easy as it sounds. As mentioned before, the main difficulty are security access rules. It's certainly doable, but we shall see what kind of complexities it leads to in the subsequent iterations. For now, I've created a small class called DataService that allows you to perform a simple JSON pattern lookup as well as all CRUD operations on any entity identified by its HyperGraphDB handle.

One can experiment with this interface by making $.ajax calls in the browser's REPL. The interface is certainly going to evolve and change, but here are a couple of calls you can try out. I've made the '$http' service available as a global variable:

$http.post("/rest/data/entity", {entity:'story', content:'My story starts with....'})
$http.get("/rest/data/list?pattern=" + JSON.stringify({entity:'story'})).success(function(A) { console.log(A); })

The above creates a new entity in the DB and then retrieves it via a query for all entities of that type (i.e. entity:'story'). You can also play around with $http.put and $http.delete.

Conclusion

All right, this concludes the 2nd iteration of eValhalla. We've implemented a big part of what we'd need for user management. We've explored AngularJS as a viable UI framework and we've laid the groundwork of a data-centered REST service. To get it, follow the same steps as before, but checkout the phase2 GIT tag instead of phase1:

git clone https://github.com/publicvaluegroup/evalhalla.git
cd evalhalla
git checkout phase2
sbt
run

Coming Up...

On the next iteration, we will do some data modeling and define the main entities of our application domain. We will also implement a portion of the UI dealing with submission and listing of stories.

HyperGraphDB 1.2 Final Released

2012-11-04T13:21:00.000-05:00

Kobrix Software is pleased to announce the release of HyperGraphDB 1.2 Final.

Several bugs were found and corrected during the beta testing period, most notably having to do with indexing.

Go directly to the download page.

HyperGraphDB is a general purpose, free open-source data storage mechanism. Geared toward modern applications with complex and evolving domain models, it is suitable for semantic web, artificial intelligence, social networking or regular object-oriented business applications.
This release contains numerous bug fixes and improvements over the previous 1.1 release. A fairly complete list of changes can be found at the Changes for HyperGraphDB, Release 1.2 wiki page.

Introduction of a new HyperNode interface together with several implementations, including subgraphs and access to remote database peers. The ideas behind are documented in the blog post HyperNodes Are Contexts.
Introduction of a new interface HGTypeSchema and generalized mappings between arbitrary URIs and HyperGraphDB types.
Implementation of storage based on the BerkeleyDB Java Edition (many thanks to Alain Picard and Sebastian Graf!). This version of BerkeleyDB doesn't require native libraries, which makes it easier to deploy and, in addition, performs better for smaller datasets (under 2-3 million atoms).
Implementation of parametarized pre-compiled queries for improved query performance. This is documented in the Variables in HyperGraphDB Queries blog post.

HyperGraphDB is a Java based product built on top of the Berkeley DB storage library.

Key Features of HyperGraphDB include:

Powerful data modeling and knowledge representation.
Graph-oriented storage.
N-ary, higher order relationships (edges) between graph nodes.
Graph traversals and relational-style queries.
Customizable indexing.
Customizable storage management.
Extensible, dynamic DB schema through custom typing.
Out of the box Java OO database.
Fully transactional and multi-threaded, MVCC/STM.
P2P framework for data distribution.

RESTful Services in Java - a Busy Developer's Guide

2012-10-25T01:28:00.000-04:00

The Internet doesn't lack expositions on REST architecture, RESTful services, and their implementation in Java. But, here is another one. Why? Because I couldn't find something concise enough to point readers of the eValhalla blog series.

What is REST?

The acronym stands for Representational State Transfer. It refers to an architectural style (or pattern) thought up by one of the main authors of the HTTP protocol. Don't try to infer what the phrase "representational state transfer" could possibly mean. It sounds like there's some transfer of state that's going on between systems, but that's a bit of a stretch. Mostly, there's transfer of resources between clients and servers. The clients initiate requests and get back responses. The responses are resources in some standard media type such as XML or JSON or HTML. But, and that's a crucial aspect of the paradigm, the interaction itself is stateless. That's a major architectural departure from the classic client-server model of the 90s. Unlike classic client-server, there's no notion of a client session here.

REST is offered not as a procotol, but as an architectural paradigm. However, in reality we are pretty much talking about HTTP of which REST is an abstraction. The core aspects of the architecture are (1) resource identifiers (i.e. URIs); (2) different possible representations of resources, or internet media types (e.g. application/json); (3) CRUD operations support for resources like the HTTP methods GET, PUT, POST and DEL.

Resources are in principle decoupled from their identifiers. That means the environment can deliver a cached version or it can load balance somehow to fulfill the request. In practice, we all know URIs are actually addresses that resolve to particular domains so there's at least that level of coupling. In addition, resources are decoupled from their representation. A server may be asked to return HTML or XML or something else. There's content negotiation going on where the server may offer the desired representation or not. The CRUD operations have constraints on their semantics that may or may not appear obvious to you. The GET, PUT and DEL operations require that a resource be identified while POST is supposed to create a new resource. The GET operation must not have side-effects. So all other things being equal, one should be able to invoke GET many times and get back the same result. PUT updates a resource, DEL removes it and therefore they both have side-effects just like POST. On the other hand, just like GET, PUT may be repeated multiple times always to the same effect. In practice, those semantics are roughly followed. The main exception is the POST method which is frequently used to send data to the server for some processing, but without necessarily expecting it to create a new resource.

Implementing RESTful services revolves around implementing those CRUD operations for various resources. This can be done in Java with the help of a Java standard API called JAX-RS.

REST in Java = JAX-RS = JSR 311

In the Java world, when it comes to REST, we have the wonderful JAX-RS. And I'm not being sarcastic! This is one of those technologies that the Java Community Process actually got right, unlike so many other screw ups. The API is defined as JSR 311 and it is at version 1.1, with work on version 2.0 under way.

The beauty of JAX-RS is that it is almost entirely driven by annotations. This means you can turn almost any class into a RESTful service. You can simply turn a POJO into a REST endpoint by annotating it with JSR 311 annotations. Such an annotated POJOs is called a resource class in JAX-RS terms.

Some of the JAX-RS annotations are at the class level, some at the method level and others at the method parameter level. Some are available both at class and method levels. Ultimately the annotations combine to make a given Java method into a RESTful endpoint accessible at an HTTP-based URL. The annotations must specify the following elements:

The relative path of the Java method - this is accomplished with @Path annotation.
What the HTTP verb is, i.e. what CRUD operation is being performed - this is done by specifying one of @GET, @PUT, @POST or @DELETE annotations.
The media type accepted (i.e. the representation format) - @Consumes annotation.
The media type returned - @Produces annotation.

The two last ones are optional. If omitted, then all media types are assumed possible. Let's look at a simple example and take it apart:

import javax.ws.rs.*;

@Path("/mail")
@Produces("application/json")
public class EmailService
{
    @POST
    @Path("/new")
    public String sendEmail(@FormParam("subject") String subject,
                            @FormParam("to") String to,
                            @FormParam("body") String body)  {
        return "new email sent";
    }
    
    @GET
    @Path("/new")
    public String getUnread()  {
        return "[]";
    }
   
    @DELETE
    @Path("/{id}")
    public String deleteEmail(@PathParam("id") int emailid)  {
        return "delete " + id;
    }
    
    @GET
    @Path("/export")
    @Produces("text/html")
    public String exportHtml(@QueryParam("searchString") 
                             @DefaultValue("") String search) {
        return "<table><tr>...</tr></table>";
    }
}

The class define a RESTful interface for a hypothetical HTTP-based email service. The top-level path mail is relative to the root application path. The root application path is associated with the JAX-RS javax.ws.rs.core.Application that you extend to plugin into the runtime environment. Then we've declared with the @Produces annotation that all methods in that service produce JSON. This is just a class-default that one can override for individual methods like we've done in the exportHtml method. The sendMail method defines a typical HTTP post where the content is sent as an HTML form. The intent here would be to post to http://myserver.com/mail/new a form for a new email that should be sent out. As you can see, the API allows you to bind each separate form field to a method parameter. Note also that you have a different method for the exact same path. If you do an HTTP get at /mail/new, the Java method annotated with @GET will be called instead. Presumably the semantics of get /mail/new would be to obtain the list of unread emails. Next, note how the path of the deleteEmail method is parametarized by an integer id of the email to delete. The curly braces indicate that "id" is actually a parameter. The value of that parameter is bound to the whatever is annotated with @PathParam("id"). Thus if we do an HTTP delete at http://myserver.com/mail/453 we would be calling the deleteEmail method with argument emailid=453. Finally, the exportHtml method demonstrates how we can get a handle on query parameters. When you annotate a parameter with @QueryParam("x") the value is taken from the HTTP query parameter named x. The @DefaultValue annotation provides a default in case that query parameter is missing. So, calling http://myserver.org/mail/export?searchString=RESTful will call the exportHtml method with a parameter search="RESTful".

To expose this service, first we need to write an implementation of javax.ws.rs.core.Application. That's just a few lines:


public class MyRestApp extends javax.ws.rs.core.Application {
   public Set>Class> getClasses() {
      HashSet S = new HashSet();
      S.add(EmailService.class);
      return S;
   }
}

How this gets plugged into your server depends on your JAX-RS implementation. Before we leave the API, I should mentioned that there's more to it. You do have access to a Request and Response objects. You have annotations to access other contextual information and metadata like HTTP headers, cookies etc. And you can provide custom serialization and deserialization between media types and Java objects.

RESTful vs Web Services

Web services (SOAP, WSDL) were heavily promoted in the past decade, but they didn't become as ubiquitous as their fans had hoped. Blame XML. Blame the rigidity of the XML Schema strong typing. Blame the tremendous overhead, the complexity of deploying and managing a web service. Or, blame the frequent compatibility nightmares between implementations. Reasons are not hard to find and the end result is that RESTful services are much easier to develop and use. But there is a flip side!

The simplicity of RESTful services means that one has less guidance in how to map application logic to a REST API. One of the issues is that instead of the programmatic types we have in programming languages, we have the Java primitives and media types. Fortunately, JAX-RS allows to implement whatever conversions we want between actual Java method arguments and what gets sent on the wire. The other issue is the limited set of operations that a REST service can offer. While with web services, you define the operation and its semantics just as in a general purpose programming language, with RESTful you're stuck with get, put, post and delete. So, free from the type mismatch nightmare, but tied into only 4 possible operations. This is not as bad as it seems if you view those operations as abstract, meta operations.

The key point when designing RESTful services, whether you are exposing existing application logic or creating a new one, is to think in terms of data resources. That's not so hard since most of what common business applications do is manipulate data. First, because every single thing is identified as a resource, one must come up with an appropriate naming schema. Because URIs are hierarchical, it is easy to devise a nested structure like /productcategory/productname/version/partno. Second, one must decide what kinds of representations are to be supported, both in output and input. For a modern AJAX webpp, we'd mostly use JSON. I would recommend JSON over XML even in a B2B setting where servers talk to each other.

Finally, one must categorize business operation as one of GET, PUT, POST and DELETE. This is probably a bit less intuitive, but it's just a matter of getting used to. For example, instead of thinking about a "Checkout Shopping Cart" operation, think about POSTing a new order. Instead of thinking about a "Login User" operation think about GETing an authentication token. In general, every business operation manipulates some data in some way. Therefore, every business operation can fit into this crude CRUD model. Clearly, most read-only operations should be a GET. However, sometimes you have to send a large chunk of data to the server in which case you should use POST. For example you could post some very time consuming query that require a lot of text to specify. Then the resource you are creating is for example the query result. Another way to decide if you should POST or no is if you have a unique resource identifier. If not, then use POST. Obviously, operations that cause some data to be removed should be a DELETE. The operations that "store" data are PUT and again POST. Deciding between those two is easy: use PUT whenever you are modifying an existing resource for which you have an identifier. Otherwise, use POST.

Implementations & Resources

There are several implementations to choose from. Since, I haven't tried them all, I can't offer specific comments. Most of them used to require a servlet containers. The Restlet framework by Jerome Louvel never did, and that's why I liked it. Its documentation leaves to be desired and if you look at its code, it's over-architected to a comical degree, but then what ambitious Java framework isn't. Another newcomer that is strictly about REST and seems lightweight is Wink, an Apache incubated project. I haven't tried it, but it looks promising. And of course, one should not forget the reference implementation Jersey. Jersey has the advantage of being the most up-to-date with the spec at any given time. Originally it was dependent on Tomcat. Nowadays, it seems it can run standalone so it's on par with Restlet which I mentioned first because that's what I have mostly used.

Here are some further reading resources, may their representational state be transferred to your brain and properly encoded from HTML/PDF to a compact and efficient neural net:

The Wikipedia article on REST is not in very good shape, but still a starting point if you want to dig deeper into the conceptual framework.
Refcard from Dzone.com: http://refcardz.dzone.com/refcardz/rest-foundations-restful#refcard-download-social-buttons-display
Wink's User Guide seems well written. Since it's an implementation of JAX-RS, it's a good documentation of that technology.
http://java.dzone.com/articles/putting-java-rest: A fairly good show-and-tell introduction to the JAX-RS API, with a link in there to a more in-depth description of REST concepts by the same author. Worth the read.
http://jcp.org/en/jsr/detail?id=311: The official JSR 311 page. Download the specification and API Javadocs from there.
http://jsr311.java.net/nonav/javadoc/index.html: Online access of JSR 311 Javadocs.

If you know of something better, something nice, please post it in a comment and I'll include in this list.

PS: I'm curious if people start new projects with Servlets, JSP/JSF these days? I would be curious as to what the rationale would be to pick those over AJAX + RESTful services communication via JSON. As I said above, this entry is intended to help readers of the eValhalla blogs series which chronicles the development of the eValhalla project following precisely the AJAX+REST model

eValhalla Setup

2012-10-20T14:38:00.001-04:00

[Previous in this series: eValhalla Kick Off, Next: eValhalla User Management]

The first step in eValhalla after the official kick off is to setup a development environment with all the selected technologies. That's the goal for this iteration. I'll quickly go through the process of gathering the needed libraries and implement a simple login form that ties everything together.

Technologies

Here are the technologies for this project:

Scala programming language - I had a dilemma. Java has a much larger user base and therefore should have been the language of choice for a tutorial/promotional material on HGDB and JSON storage with it. However, this is actually a serious project to go live eventually and I needed an excuse to code up something more serious with Scala, and Scala has enough accumulated merits, so Scala it is. However, I will show some Java code as well, just in the form of examples, equivalent to the main code.
HyperGraphDB with mJson storage - that's a big part of my motivation to document this development. I think HGDB-mJson are a really neat pair and more people should use them to develop webapps.
Restlet framework REST - this is one of very few implementations of JSR 311, that is sort of lightweight and has some other extras when you need them.
jQuery - That's a no brainer.
AngularJS - Another risky choice, since I haven't used this before. I've used KnockoutJS and Require.js, both great frameworks and well-thought out. I've done some ad hoc customization of HTML tags, tried various template engines, AngularJS promises to give me all of those in a single library. So let's give it a shot.

Getting and Running the Code

Before we go any further, I urge you to get, build and run the code. Besides Java and Scala, I encourage you to get a Git client (Git is now supported on Windows as well), and you need the Scala Build Tool (SBT). Then, on a command console, issue the following commands:

git clone https://github.com/publicvaluegroup/evalhalla.git
cd evalhalla
git checkout phase1
sbt
run

Note the 3d step of checking out the phase1 Git tag - every blog entry is going to be a separate development phase so you can always get the state of the project at a particular blog entry. If you don't have Git, you can download an archive from:

https://github.com/publicvaluegroup/evalhalla/zipball/phase1

All of the above commands will take a while to execute the first time, especially if you don't have SBT yet. But at the end you should see the something like this on your console:

[info] Running evalhalla.Start 
No config file provided, using defaults at root /home/borislav/evalhalla
checkpoint kbytes:0
checkpoint minutes:0
Oct 18, 2012 12:01:01 AM org.restlet.engine.connector.ClientConnectionHelper start
INFO: Starting the internal [HTTP/1.1] client
Oct 18, 2012 12:01:01 AM org.restlet.engine.connector.ServerConnectionHelper start
INFO: Starting the internal [HTTP/1.1] server on port 8182
Started with DB /home/borislav/evalhalla/db

and you should have a running local server accessible at http://localhost:8182. Hit that URL, type in a username and a password and hit login.

Architectural Overview

The architecture is made up of a minimal set of REST services that essentially offer user login and access-controlled data manipulation to a JavaScript client-side application. The key will be to come up with an access policy that deals gracefully with a schema free database.

The data itself consists of JSON objects stored as a hypergraph using HGDB-mJson. From the client side we can create new objects and store them. We can then query for them or delete them. So it's a bit like the old client-server model from the 90s. HyperGraphDB supports strongly typed data, but we won't be taking advantage of that. Instead, each top-level JSON object will have a special property called entity that will contain the type of the database entity as a string. This way, when we search for all users for example, we'll be searching for all JSON objects with property entity="user".

There are many reasons to go for REST+AJAX rather than, say, servlets. I hardly feel the need to justify it - it's stateless, you don't have to deal with dispatching, you just design an API, more responsive, we're in 2013 soon after all. The use of JSR 311 allows us to switch server implementations easily. It's pretty well-designed: you annotate your classes and methods with the URI paths they must be invoked for. Then a method's parameters can be bound either to portions of a URI, or to HTTP query parameters or form fields etc.

I'm not sure yet what the REST services will be exactly, but the idea is to keep them very generic so the core could be just plugged for another webapp and writing future applications could be entirely done in JavaScript.

Project Layout

The root folder contains the SBT build file build.sbt, a project folder with some plugin configurations for SBT and a src folder that contains all the code following Maven naming conventions which SBT adopts. The src/main/html and src/main/javascript folders contain the web application. When you run the server with all default options, that's where it serves up the files from. This way, you can modify them and just refresh the page. Then src/main/scala contains our program and src/main/java some code for Java programmers to look at and copy & paste. The main point of the Java code is really to help people that want to use all those cool libraries but prefer to code in Java instead.

To load the project in Eclipse, use SBT to generate project files for you. Here's how:

cd evalhalla
sbt
eclipse
exit

Then you'll have a .project and a .classpath file in the current directory, so you can go to your Eclipse and just do "Import Project". Make sure you've run the code before though, in order to force SBT to download all dependencies.

Code Overview

Ok, let's take a look at the code now, all under src/main. First, look at html/index.html, which gets loaded as the default page. It contains just a login form and the interesting part is the function eValhallaController($scope, $http). This function is invoked by AngularJS due to the ng-controller attribute in the body tag. It provides the data model of the HTML underneath and also a login button event handler, all through the $scope parameter. The properties are associated with HTML element via ng-model and buttons to functions with ng-click. An independent tutorial on AngularJS, one of few since it's pretty new, can be found here.

The doLogin posts to /rest/user/login. That's bound to the evalhalla.user.UserService.authenticate method (see user package). The binding is done through the standard JSR 311 Java annotations, which also work in Scala. I've actually done an equivalent version of this class in Java at java/evalhalla/UserServiceJava. A REST service is essentially a class where some of the public methods represent HTTP endpoints. An instance of such a class must be completely stateless, a JSR 311 implementation is free to create fresh instances for each request. The annotations work by deconstructing an incoming request's URI into relative paths at the class level and then at the method level. So we have the @Path("/user") annotation (or @Path("/user1") for the Java version so they don't conflict). Note the @Consumes and @Produces annotations at the class level that basically say that all methods in that REST service accept JSON content submitted and return back JSON results. Note further how the authenticate method takes a single Json parameter and returns a Json value. Now, this is mjson.Json and JSR 311 doesn't know about it, but we can tell it to convert to/from in/output stream. This is done in the java/evalhalla/JsonEntityProvider.java class (which I haven't ported to Scala yet). This entity provider and the REST services themselves are plugged into the framework at startup, so before examining the implementation of authenticate, let's look at the startup code.

The Start.scala file contains the main program and the JSR 311 eValhalla application implementation class. The application implementation is only required to provide all REST services as a set of classes that the JSR 311 framework introspects for annotations and for the interfaces they implement. So the entity converter mentioned above, together with both the Scala and Java version of the user service are listed there. The main program itself contains some boilerplate code to initialize the Restlet framework and asks it to serve up some files from the html and javascript folders and it also attaches the JSR 311 REST application under the 'rest' relative path.

An important line in main is evalhalla.init(). This initializes the evalhalla package object defined in scala/evalhalla/package.scala. This is where we put all application global variables and utility methods. This is where we initialized the HyperGraphDB instance. Let's take a closer look. First, configuration is optionally provided as a JSON formatted file, the only possible argument to the main program. All properties of that JSON are optional and have sensible defaults. With respect to deployment configuration, there are two important locations: the database location and the web resources location. The database location, specified with dbLocation, is by default taken to be db under the working directory, from where the application is run. So for example if you've followed the above instructions to run the application from the SBT command prompt for the first time, you'd have a brand new HyperGraphDB instance created under your EVALHALLA_HOME/db. The web resources served up (html, javascript, css, images) are configured with siteLocation the default being src/main so you can modify source and refresh. So here is how the database is created. You should be able to easily follow Scala code even if you're a mainly Java programmer.

val hgconfig = new HGConfiguration()
hgconfig.getTypeConfiguration().addSchema(new JsonTypeSchema())
graph = HGEnvironment.get(config.at("dbLocation").asString(), hgconfig)
registerIndexers
db = new HyperNodeJson(graph)

Note that we are adding a JsonTypeSchema to the configuration before opening the database. This is important for the mJson storage implementation that we are mostly going to rely on. Then we create the graph database, create indices (for now just an empty stub) and last but not least create an mJson storage view on the graph database - a HyperNodeJson instance. Please take a moment to go through the wiki page on HGDB-mJson. The graph and db variables above are global variables that we will be accessing from everywhere in our application.

Some other points of interest here are the utility methods:

def ok():Json = Json.`object`("ok", True);

def ko(error:String = "Error occured") = Json.`object`("ok", False, "error", error);

Those are global as well and offer some standard result values from REST services that the client side may rely on. Whenever everything went well on the server, it returns an ok() object that has a boolean true ok property. If something went wrong, the ok boolean of the JSON returned by a REST call is false and the error property provides an error message. Any other relevant data, success or failure, is embedded with those ok or ko objects.

Lastly, it is common to wrap pieces of code in transactions. After all, we are developing a database backed applications and we want to take full advantage of the ACID capabilities of HyperGraphDB. Scala makes this particularly easy because it supports closures. So we have yet another global utility method that takes a closure and runs it as a HGDB transaction:

def transact[T](code:Unit => T) = {

try{

graph.getTransactionManager().transact(new java.util.concurrent.Callable[T]() {

def call:T = {

return code()

}

});

}

catch { case e:scala.runtime.NonLocalReturnControl[T] => e.value}

}

This will always create a new transaction. Because BerkeleyDB JE, which is the storage engine by default as of HyperGraphDB 1.2, doesn't supported nested transaction, one must make sure the transact is not called within another transaction. So when we are in a situation where we want to have a transaction and we'd happily be embedded in some top-level one, we can call another global utility function: ensureTx, which behaves like transact except it won't create a new transaction if one is already in effect.

Ok, armed with all these preliminaries, we are now able to examine the authenticate method:

@POST

@Path("/login")

def authenticate(data:Json):Json = {

return evalhalla.transact(Unit => {

var result:Json = ok();

var profile = db.retrieve(jobj("entity", "user",

"email", data.at("email").asString().toLowerCase()))

if (profile == null) { // automatically register user for now

val newuser = data.`with`(jobj("entity", "user"));

db.addTopLevel(newuser);

}

else if (!profile.is("password", data.at("password")))

result = ko("Invalid user or password mismatch.");

result;

});

}

The @POST annotation means that this endpoint will be matched only for an HTTP post method. First we do a lookup for the user profile. We do this by pattern matching. We create a Json object that first identifies that we are looking for an object of type user by setting the entity property to "user". Then we provide another property, the user's email which we know is supposed to be unique so we can treat it as a primary key. However, note that neither HyperGraphDB nor its Json storage implementation provides some notion of a primary other than HyperGraphDB atom handles. The HyperNodeJson.retrieve method returns only the first object matching the pattern. If you want an array of all objects matching the pattern use HyperNodeJson.getAll. Note the 'jobj' method call in there: this is a rename in the import section of the Json.object factory method. It is necessary because in Scala object is a keyword. Another way to use a keyword as a method name in Scala beside import rename, on can wrap it in backquotes ` as is done with data.`with` above which is basically an append operation, merging the properties of one Json object into another. The db.addTopLevel is explained in the HGDB-mJson wiki. Also, you may want to refer to the mJson API Javadoc. One last point though about the structure of the authenticate method: there are no local returns. The result variable contains the result and it is written as the last expression of the function and therefore returned as the result. I like local returns actually (i.e. return statement in the middle of the method following if conditions or within loops or whatever), but the way Scala implements them is by throwing a RuntimeException. However, this exception gets caught inside the HyperGraphDB transaction which has a catch all clause and treats exceptions as a true error conditions rather then some control flow mechanism. This can be fixed inside HyperGraphDB, but avoiding local returns is not such a bad coding practice anyway.

Final Comments

Scala is new to me so take my Scala coding style with a grain of salt. Same goes with AngularJS. I use Eclipse and the Scala IDE plugin from update site http://download.scala-ide.org/releases-29/stable/site (http://scala-ide.org/download/nightly.html#scala_ide_helium_nightly_for_eclipse_42_juno for Eclipse 4.2). Some of the initial Scala code is translated from equivalent Java code from other projects. If you haven't worked with Scala, I would recommend giving it a try. Especially if, like me, you came to Java from a richer industrial language like C++ and had to give up a lot of expressiveness.

I'll resist the temptation to make this into an in-depth tutorial of everything used to create the project. I'll say more about whatever felt not that obvious and give pointers, but mostly I'm assuming that the reader is browsing the relevant docs alongside reading the code presented here. This blog is mainly a guide.

In the next phase, we'll do the proverbial user registration and put in place some security mechanisms.

eValhalla Kick Off

2012-10-09T23:01:00.003-04:00

[Next in this series eValhalla Setup]

As promised in this previous blog post, I will now write a tutorial on using the mJson-HGDB backend for implementing REST APIs and web apps. This will be more extensive than originally hinted at. I got involved in a side project to implement a web site where people can post information about failed government projects. This should be pretty straightforward so a perfect candidate for a tutorial. I decided to use that opportunity to document the whole development in the form of blog series, so this will include potentially useful information for people not familiar with some of the web 2.0 libraries such as jQuery and AngularJS which I plan to use. All the code will be available in GitHub. I am actually a Mercurial user, something turned me off from Git when I looked at it before (perhaps just its obnoxious author), but I decided to use this little project as an opportunity to pick up a few new technologies. The others will be Scala (instead of Java) and AngularJS (instead of Knockoutjs).

About eValhalla

Valhalla - the hall of Odin into which the souls of heroes slain in battle and others who have died bravely are received.

The aim is to provide a forum for people involved in (mainly software) projects within government agencies to report anonymously on those projects' failures. Anybody can create a login, without providing much personal information and be guaranteed that whatever information is provided remains confidential if they so choose. Then they can describe projects they have insider information about and that can be of interest to the general public. Those projects could be anything from small scale, internal-use only, local government, to larger-scale publicly visible nation-level government projects.

I won't go into a "mission statement" type of description here. You can see it as a "wiki leaks" type transparency effort, except we'd be dealing with information that is in the public domain, but that so far hasn't had an appropriate outlet. Or you can see it as a fun place to let people vent their frustrations about mis-management, abuses, bad decisions, incompetence etc. Or you can see it as a means to learn from experience in one particular type of software organization: government IT departments. And those are a unique breed. What's unique? Well, the hope is that such an online outlet will make that apparent.

Requirements Laundry List

Here is the list of requirements, verbatim as sent to me by the project initiator:

enter project title
enter project description
enter project narrative
enter location
tag with failure types
tag with subject area/industry sector
tag with technologies
enter contact info
enter project size
enter project time frame (year)
enter project budget
enter outcome (predefined)
add lessons learned
add pic to project
ability to comment on project
my projects dashboard (ability to add, edit, delete)
projects can be saved as draft and made public later
option to be anonymous when adding specific projects
ability to create profile (username, userpic, email, organization, location)
ability to edit and delete profile
administrator ability to feature projects on main page
search for projects based on above criteria and tags
administrator ability to review projects prior to them being published

Nevermind that the initiator in question is currently pursuing a Ph.D. in requirements engineering. Those are a good enough start. We'll have to implement classic user management and then we have our core domain entity: a failed project, essentially a story decorated with various properties and tags, commented on. As usualy, we'd expect requirements to change as development progresses, new ideas will popup, old ideas will die and force refactorings etc. Nevermind, I will maintain a log of those development activities and if you are following, do not expect anything described to be set in stone.

Architecture

Naturally, the back-end database will be HyperGraphDB with its support of plain JSON-as-a-graph storage as described here. We will also make use of two popular JavaScript libraries: jQuery and AngularJS as well as whatever related plugins come in handy.

Most of the application logic will reside on the client-side. In fact, we will be implementing as much business logic in JavaScript as security concerns allow us. The server will consist entirely of REST services based on the JSR 311 standard so they can be easily run on any of the server software supporting that standard. To make this a more or less self-contained tutorial, I will be describing most of those technologies and standards along the way, at least to the extent that they are being used in our implementation. That is, I will describe as much as needed to help you understand the code. However, you need familiarity with Java and JavaScript in order to follow.

The data model will be schema-less, yet our JSON will be structured and we will document whenever certain properties are expected to be present and we will follow conventions helps us navigate the data model more easily.

Final Words

So stay tuned. The plan is to start by setting up a development environment, then implement the user management part, then submission of new projects (my project dashboard), online interactions, search, administrative features to manage users, the home page etc.

On Testing

2012-08-14T23:22:00.001-04:00

More than decade ago, I spend a year doing quality assurance at a big and successful smart card company. It was one of the more intellectually stimulating jobs I've had in the software industry and I ended up developing a scripting language, sadly long lost, to do test automation. Doing testing right can be much harder than writing the software being tested. Especially when you're after 100% quality as is the case with software burned on the chips of millions of smart cards that would have to be thrown away in case of a serious bug discovered post-production. Companies aiming at this level of quality get ISO 900x certified to show off to clients. To get certified, they have to show a QA process that guarantees, to an acceptable degree of confidence, that products delivered work, that the organization is solid, knowledge is preserved etc. etc. The interesting part that I'd like to share with you is the specific software QA approach. It did involve an obscene amount of both documentation and software artifacts that had to be produced in a very rigid formal setting, but the philosophy behind spec-ing out the tests was sound, practical and better than anything I've seen since.

Dijkstra famously said that testing can only prove the presence of errors, never the absence of errors. True that. To have 100% guarantee that a program works, one would need to produce a mathematical proof of its correctness. Perhaps not sufficient...as Knuth not less famously noted, when sending a program to a colleague, that the program must be used with caution because he only proved it correct, but never tested it. In either case, ultimately the goal is to gain confidence in the quality of a piece of software. We do that by presenting a strong, very, very convincing argument that the program works.

When discussing testing methodology people generally talk about automated vs. manual testing, test-driven development where test cases are developed before the code or "classic" testing where they are done after development, but rarely do I see people mindful of how tests should be specified. The term test itself is used rather ambiguously to mean the action of testing, or the specification or the process or the development phase. And in some contexts a test means a test case which refers to an input data set, or it refers to an automated program (e.g. a jUnit test case). So let's agree that, whether you code it up or do it manually, a test case consists of a sequence of steps taken to interact with the software being tested and verify the output with the goal of ensuring that said software behaves as expected. So how do you go about deciding what this sequence of steps should be? In other words, how do you gather test requirements?

Think of it in dialectical terms - you're trying to convince a skeptic that a program works correctly. First you'd have to agree what it means for that program to work correctly. Well, they say, it must match the requirements. So you start by reading all requirements and translating that last statement ("it must match the requirements" ) for each one of them into a corresponding set of test criteria. Naturally, the more detailed the requirements are, the easier that process is. In an agile setting, you might be translating user stories into test criteria. Let's have a simple running example:

Requirement:

Test Criteria:

C1 - the login form should display a string as an image that's hard to recognize by a program.
C2 - the login form should include an input field that must match the string in the captcha image for login to succeed.

Notice how the test criteria explicitly state under what conditions one can say that a program works. One can list more criteria with further detail, stating what happens if the captcha doesn't match. What happens after n number of repeats etc. Also, this should make it clear that test criteria are not actual tests. They are not something that can be executed (manually or automatically). In fact, they are to a test program what requirements are to the software being QA-ed. And as with conventional requirements, the more clear you are on your test criteria, the better chance you have in developing adequate tests.

The crucial point is that when you write a test case, you want to make sure that it is with a well-defined purpose, that it serves as a demonstration that an actual test criterion has been met. And this is what's missing in 90% of testing efforts (well, to be sure this is just anecdotal evidence). People write or perform tests simply for the sake of trying out things. Tests accumulate and if you have a lot of them, it makes it look like you're in good shape. But that's not necessarily the case because tests can only prove the presence of errors, not their absence. To convince your dialectical opponent of the absence of errors, given the agreed upon list of criteria, you'd have to show how your tests prove that the criteria have been met. In other words, you want to ensure that all your test criteria have been covered by appropriate test cases - for each test criterion there is at least one test case that, when successful, shows that this criterion is satisfied. A convenient way to do that is to create a matrix where you list all your criteria in the rows and all your test cases in the columns and checkmark a given cell whenever the test case covers the corresponding criterion, where "covers" means that if the test case succeeds one can be confident that the criterion is met. This implies that the test case itself will have to include all necessary verification steps. Continuing with our simple example, suppose you've developed a few test cases:

T1 - test succesful login
T2 - test failed login for all 3 fields, bad user, bad password, bad captcha
T3 - test captcha quality by running image recognition algos on captcha

	T1	T2	T3	T4	T5	...	...
C1			X
C2	X	X
...

A given test case may cover a certain aspect of the program, but you'd put a checkmark only if it actually verifies the criteria in question. For instance T3 would be loading a login page, but it won't be testing actual login. Similarly, T1 and T2 can observe the captcha, but they won't evaluate its quality. It may appear a bit laborious as an approach. In the aforementioned company, this was all documented ad nauseam. Criteria were classified as "normal", "abnormal", "stress" and what not, reflecting different types of expected behaviors and possible execution contexts. Now, I did warn you - this was a QA process aimed at living up to ISO standards. And it did. But think about the information this matrix provides you. It is a full, detailed spec of your software. It is a full inventory of your test suite. It tells what part of the program is being tested by what test. It shows you immediately if some criteria are not being covered by a test, or not covered enough. If shows you immediately if some criteria are being covered too much, i.e. if some tests are superfluous. When tests fail, it tells you exactly what behaviors of the software are not functioning properly. Recall that one of the main problems with automated testing is the explosion of code that needs to be written to achieve descent coverage. This matrix can go a long way to controlling that code explosion by keeping each test case with a relatively unique purpose. Most importantly, the matrix presents a pretty good argument for the program's correctness - you can see at a glance both how correctness has been defined (the list of criteria) and how it is demonstrated (the list of tests cross-referenced with criteria).

Reading about testing even from big industry names, I have been frequently disappointed at the lack of systematic approach to test requirements. In practice it's even worse. Developers, testers, business people in general have no idea what they are doing when testing. This includes agile teams where tests are sometimes supposed to constitute the specification of the program. That's plain wrong, first because it's code and code is way too low-level to be understood by all stakeholders, hence it can't be a specification that can be agreed upon by different parties. Second, because usually the same people write both the tests and the program tested, the same bugs sneak in both places and never get discovered, the same possibly wrong understanding of the desired behavior is found in both places. So expressing the quality argument (i.e. with the imaginary dialectical adversary) simply in the form of test cases can't cut it.

That said, I wouldn't advocate following the approach outlined above verbatim and in full detail. But I would recommend having the mental picture of that Criteria x Tests matrix as guide to what you're doing. And if you're building a regression test suite, and especially if some of the tests are manual, it might be worth your while spelling it out in the corporate wiki somewhere.

Cheers,

Boris

JSON Storage in HyperGraphDB

2012-06-17T21:16:00.001-04:00

JSON (http://json.org) came about as a convenient, better-than-XML data format that one could directly work from in JavaScript. About a year ago, I got a bit disappointed by the JSON libraries out there. JSON is so simple, yet all of them so complicated, so verbose. Most of it comes from the culture of strong typing in Java and from the presumed use of JSON on the server as an intermediate representation to be mapped to Java beans, where the application domain is usually modeled. But JSON is powerful enough on its own if you are willing to trade off type safety for flexibility. So I wrote mJson (for "minimal" Json), a single source file library for working with JSON from Java. For a description see the blog posts:

Using this library, I've been able to avoid creating the usual plethora of Java classes to work with my domain in several applications. Coding with it is pretty neat. One can pretty much view JSON as a general minimalistic, highly flexible modeling tool, a bit like s-expressions or any general enough abstract data structure (another example are first-order logic terms).

Now, there's a persistence layer for those JSON entities based on HyperGraphDB. Unlike "document-oriented" databases, the representation in HyperGraphDB is really graph-like with all the expected consequences. It is described in the HyperGraphDB Json Wiki Page, pretty stable, convenient and fun to work with.

If time permits, I will write a few posts in the form of a tutorial to show how to quickly build database backed JSON REST services with mJson and HyperGraphDB. If you have a suggestion of an application domain to cover, please share it in a comment!

In the meantime, I invite you to give it a try!

Regards,

Boris

HyperGraphDB 1.2 Beta now available

2012-06-11T00:42:00.001-04:00

Kobrix Software is pleased to announce the release of HyperGraphDB version 1.2.

HyperGraphDB is a general purpose, free open-source data storage mechanism. Geared toward modern applications with complex and evolving domain models, it is suitable for semantic web, artificial intelligence, social networking or regular object-oriented business applications.

This release contains numerous bug fixes and improvements over the previous 1.1 release. A fairly complete list of changes can be found at the Changes for HyperGraphDB, Release 1.2 wiki page.

Introduction of a new HyperNode interface together with several implementations, including subgraphs and access to remote database peers. The ideas behind are documented in the blog post HyperNodes Are Contexts.
Introduction of a new interface HGTypeSchema and generalized mappings between arbitrary URIs and HyperGraphDB types.
Implementation of storage based on the BerkeleyDB Java Edition (many thanks to Alain Picard and Sebastian Graf!). This version of BerkeleyDB doesn't require native libraries, which makes it easier to deploy and, in addition, performs better for smaller datasets (under 2-3 million atoms).
Implementation of parametarized pre-compiled queries for improved query performance. This is documented in the Variables in HyperGraphDB Queries blog post.

HyperGraphDB is a Java based product built on top of the Berkeley DB storage library.

Key Features of HyperGraphDB include:

Powerful data modeling and knowledge representation.
Graph-oriented storage.
N-ary, higher order relationships (edges) between graph nodes.
Graph traversals and relational-style queries.
Customizable indexing.
Customizable storage management.
Extensible, dynamic DB schema through custom typing.
Out of the box Java OO database.
Fully transactional and multi-threaded, MVCC/STM.
P2P framework for data distribution.

Variables in HyperGraphDB Queries

2012-05-20T02:05:00.001-04:00

Introduction
A feature that has been requested on a few occasions is the ability to parametarize HyperGraphDB queries with variables, sort of like the JDBC PreparedStatement allows. That feature has now been implemented with the introduction of named variables. Named variables can be used in place of a regular condition parameters and can contain arbitrary values, whatever is expected by the condition. The benefits are twofold: increase performance by avoiding to recompile queries where only a few condition parameters change and a cleaner code with less query duplication. The performance increase of precompiled queries is quite tangible and shouldn't be neglected. Nearly all predefined query conditions support variables, except for the TypePlusCondition which is expanded into a type disjunction during query compilation and therefore cannot be easily precompiled into a parametarized form. It must also be noted that some optimizations made during the compilation phase, in particular related to index usage, can't be performed when a condition is parametarized. A simple example of this limitation is a parameterized TypeCondition - indices are defined per type, so if the type is not known during query analysis, no index will be used.

The API
The API is designed to fit the existing query condition expressions. A variable is created by calling the hg.var method and the result of that method passed as a condition parameter. For example:


HGQuery<HGHandle> q = hg.make(HGHandle.class, graph).compile(
    hg.and(hg.type(typeHandle),hg.incident(hg.var("targetOfInterest"))));

This will create a query that looks for links of specific type and that point to a given atom, parameterized by the variable "targetOfInterest". Before running the query, one must set a value of the variable:


q.var("targetOfInterest", targetHandle);
q.execute();

It is also possible to specify an initial value of a variable as the second of the hg.var method:


q = hg.make(HGHandle.class, graph).compile(
    hg.and(hg.type(typeHandle),hg.incident(hg.var("targetOfInterest", initialTarget))));
q.execute();

Note the new hg.make static method: it takes the result type and the graph instance and returns a query object that doesn't have a condition attached to it yet. To attach a condition, one must call the HGQuery.compile method immediately after make. This version of make establishes a variable context attached to the newly constructed query. All query conditions constructed after the call to make and before the call to compile will implicitly assume variables belong to that context. That is why to avoid surpises you should call compile right after make.

Once a query is thus created, one can execute it many times by assigning new values to the variables through the hg.var method as shown above.

The Implementation

The requirement of calling HGQuery.compile right of the make method comes from the fact that the existing API constructs conditions in a static environment. The condition is fully constructed before it is passed to the compile method. So to make a variable context available during the condition construction process (the evaluation of the query condition expression), we do something relatively unorthodox here and attach that context (essentially a name/value map) as a thread local variable in the make method. The alternative would have been creating a traversal of the condition expression that collects all variables used within them. Either that traversal process would have to know about all condition types, or the HGQueryCondition would have had to be enhanced with some extra interface for "composite conditions" and "variable used within a condition". I didn't like either possibility. Having a two step process of calling hg.make(...).compile(....) seems clean and concise enough.

Variables are implemented through a couple of new interfaces: org.hypergraphdb.util.Ref, org.hypergraphdb.util.Var. When a condition is created with a constant value (instead of a variable), the org.hypergraphdb.util.Constant implementation of the Ref interface is used. All conditions now contain Ref<T> private member variable where before they contained a member variable of type T. I won't say more about those reference abstractions, except one should avoid using them, unless one is implementing a custom condition...they are considered internal.

On a side note, this idea of creating a reference abstraction essentially implements the age-old strategy of introducing another level of indirection in order to solve a design problem. I've used this concept in many places now, including for things like caching and dealing with complex scopes in bean containers (Spring), so I spend some time trying to create a more general library around this. Originally I intended to create this library as a separate project and make use of it in HyperGraphDB. But I didn't come up with something satisfactory enough, so instead I created the couple of interfaces mentioned above. However, eventually those interfaces and classes may be replaced by a separate library and ultimately removed from the HyperGraphDB codebase.

HyperNodes Are Contexts

2011-01-01T13:30:00.000-05:00

There are a couple of different items on HyperGraphDB's road map that, upon a little reflection, seem to converge on a single abstraction. The abstraction I would call hypernode following the hypernode model as described by the paper A Graph-Based Data Model and its Ramifications though it remains to be seen how much of that original model applies in the HyperGraphDB context. The items are the following: (1) local proxies to remote database instances; (2) RAM-only atoms; (3) nested graphs. This blog is about laying the design space of how those features should be dealt with. Let's examine each of these future features in order.

[1] Local DB Proxies

It would be nice to be able to access a db instance on some remote machine as if it were a local db. Only local databases can be currently opened and there is a good portion of the API and the way it is used that could work all the same when the data is remote. One could imagine opening remote graphs like this:

HyperGraph remoteGraph = HGEnvironment.get("peer:RemotePeerName");

and then work with it as if it was on the local hard-drive. This almost makes sense. Most API objects dealing with database access and management, such as the HyperGraph instance itself, the index manager, type system, the low-level store could be manipulated remotely with the exact same semantics. However, this would be a real stretch of their original intent. For example, access to low-level storage is meant to support type implementations, in particular predefined types. Would you want a type implementation at machine A to manage storage on machine B? Similarly for the index manager or the event manager - would you want machine A to manage indexing at machine B? A lot of the methods of the HyperGraph class itself are overloaded methods whose sole purpose is simplifying some API calls by delegating to a "full version" of a method. In other words, going that route, we would have to make the whole API be RPC-ed, an ugly solution with a heavy workload and potentially an overkill. Incidentally, I'm curious if there's an instrumentation Java tool that would load a class-library, but transform dynamically every object as a local proxy to a remote instance. Anybody know of such a tool? Should be doable and perhaps useful for many projects... I only found this paper, which is pretty new and by the looks of the complexities and limitations involved, no wonder it hasn't been done before. Bottom line: remote access to database instance should be carefully crafted as a separate abstraction that focuses on what's important, an abstraction that is working at the atom level - CRUD operations, queries and traversals. A couple of design problems here:

(a) What would the interface to a remote instance look like?

(b) How to deal with atom serialization and differences in type systems? The latter problem is addressed to some extent in the current P2P framework.

(c) Does the client of a remote instance need a local database? The easy answer to that is "no, it shouldn't", but that might mean a much harder implementation!

[2] RAM-Only Atoms

In certain applications, it would be useful to work with a hypergraph structure that is not necessarily or entirely persisted on disk. For example, when HyperGraphDB is used as a convenient representation to apply some algorithms to, or when some data is transient and attached to a particular user session in a server-side environment. Or if one wants a client to a HGDB "server" as in the local db proxy described above. Hence it would be convenient to add atoms to the graph without storing them on disk. Such atoms will participate in the graph structure as full-fledged citizens - they would be indexed and queried in the normal way, except that they would only reside in the cache. For an implementation standpoint, the difficulty will be in merging a unified and coherent view of disk and RAM data. If RAM atoms are to be truly RAM only, databases indices cannot be used to index them. Hence, RAM-only indices must be maintained and dynamically intersected with disk indices during querying. Moreover, it would only make sense to be able to perform queries restricted to atoms in main memory.

[3] Nested Graphs/Subgraphs

This is what the hypernode model is more or less about. A hypernode is a node in a graph that is itself a graph, composed of atomic nodes and possibly other hypernodes. The generalization is roughly analogous to the generalization of standard graph edges to hyperedges. Thus we have graphs nested within graphs, recursively. A very natural application of the original hypernode model is nested object structures. Within HyperGraphDB, however, we already get nested value structures from the primitive storage graph and we are mostly interested in scoping.

As a side note, scoping and visibility are terms used in programming language theory and are sometimes confused because scoping usually determines the visibility of identifiers. A scope in programming is essentially the context within which things like variables, values and general expressions live (and have meaning/validity). When speaking about knowledge representation, the scoping problem is referred to as the "context problem" - how do you represent contextual knowledge? It is not unusual for the AI community to hijack such terms and basically strip them of their deeper meaning in order to claim a handle on a problem that is otherwise much more difficult, or simply sound sexier in a rather silly way. Contextuality and context-dependent reference resolution is at the core of all of computing (and hence knowledge representation), but that's the subject of another topic, perhaps a whole book that I might write if I become knowledgeable enough and if I happen to stumble upon more interesting insights. It is much deeper, I believe, than scopes and their nesting. Incidentally, another example of such an abused term is the word semantic as in the "semantic web" or "semantic relations" or "semantic search" or what have you. There's nothing semantic about those things, only actual computer programs are semantic in any meaningful sense of the word, and not the representation they work on.

Nevertheless, speaking of "context" when one means "scope" seems common so I'll use that term, albeit with great reluctance. To represent contextual information in HyperGraphDB, we need some means to state that a certain set of atoms "belongs to" a certain context. A given atom A can live in several different contexts at a time. One way to do that is to manage that set explicitly and say that a context is a set of atoms. Thus a context is a special entity in the database, an explicitly stored, potentially very large set of atoms. If that special entity is an atom in itself, then we have a nested graph, a hypernode. Another way would be to create an atom C that represents a context (i.e. a hypernode) and then link every member A of that context to C with a special purpose ContextLink. I think this is what they do in OpenCog's atomspace now. The two approaches have different implications. In the former, all atoms belonging to a context are readily available as an ordered set that can efficiently participate in queries, but given an atom there's no easy way to find all contexts it belongs to (clearly, we should allow for more than one). In the latter approach, getting all atoms in a context requires an extra hop from the ContextLink to the atom target, but then retrieving the contexts an atom belongs to is not much harder. In addition, if we maintain a TargetToTargetIndexer on ContextLinks, we avoid that extra hop of going from the context C to the ContextLink to the enclosed atom A. The choice between the two alternatives is not obvious. Going the ContextLink route has the clear advantage of building on top of what already exists. On the other hand, nested graphs seems something fundamental enough and close enough to a storage abstraction to deserve native support while a representation through "semantic" ContextLink should be perhaps left to application specific notions of contextuality, more complex or more ad hoc, regardless.

Note that if we had direct support for unordered links in HyperGraphDB, i.e. hyperedges as defined in mathematical hypergraphs, a nested graph context could simply be a hyperedge. Now that the HGLink interface is firmly established as a tuple, it will be hard to add such unordered set links from an API standpoint. However, support for unordered hypergraphs can be achieved through hypernodes-as-sets-of-atoms with their own interface. On this view, a hypernode can be seen either as an unordered link or as a nested graph. To be seen as an unordered link, however, implies that we'd have to maintain incidence sets to include both standard HGLinks and hypernodes. This in turn means we may run into trouble with the semantics of link querying: does hg.link(A) for example include only HGLinks pointing to A or hypernodes as well? If only links, then we'd have to filter out the hypernodes, thus degrading performance of existing queries. If hypernodes as well, existing queries that expect only links could break a runtime typecast. Issues to keep in mind...

Finally, note that this whole analysis takes atoms within nested graphs to be first and foremost members of the whole graph. The universe of atoms remains global and an atom remains visible from the global HyperGraph regardless of what nested graphs it belongs to. Hence a nested graph is more like a materialized subgraph. The alternative where each nested graph lives in its own key-value space would make it very hard to share atoms between nested graphs. So this limits the scoping function of nested graphs because a given HGHandle will always refer to the same entity regardless of what nested graph it belongs. To implement full-blown scoping, we'd need a HGHandle to refer to different things in different contexts. Do we want that capability? Should we reserve it for a higher-level where we are resolving actual symbolic names (like type names for example) rather than "low-level" handles? Perhaps it wouldn't be that hard to contextualize handles, with an extra performance cost of course: a hypernode would simply have a more involved 'get' and 'put' operations where the handle is potentially translated to another handle (maybe in another context) in a chain of reference resolutions ultimately bottoming out in the global handle space? But that issue is orthogonal to the ability to represent nested graphs.

Commonality

So what's common between those three features above? All of them are about a viewing a graph, or a portion thereof, that is not necessarily identical to the database stored on disk, yet it is a hypergraph in its own right and one wants to be able to operate within the context that it represents. There must be a common abstraction in there. Since the name HyperGraph is already taken by the database instance itself, I'd propose to use HyperNode. A HyperNode would strip down the essential operations of a the HyperGraph class, which would implement the HyperNode interface. This HyperNode interface will replace HyperGraph as the entry point to a graph, be it for querying, traversals or CRUD operations in many places.

So what API should a HyperNode offer? Here's a proposal:

get, add, define, replace, remove atoms, but question is: only full versions of those methods or all overloaded versions?
getIncidenceSet(atomHandle)
getType(atomHandle)
find(HGQuery)
getLocation() - an abstract URI-like location identifying the HyperNode

Those are a bare minimum. Access to an index manager, event manager and cache are clearly outside of the scope of this abstraction. However, the question of transactions and the type system is a bit trickier. Being able to enclose operations within a transaction is usually important, but a remote peer might not want to support distributed transactions while local data (RAM or a nested graph) can be transacted upon from the main HyperGraph instance. As for access to the HGTypeSystem, it is mostly irrelevant except when you need it.

The semantics of HyperNode's operations seem intuitively well-defined: do whatever you'd do with a full HyperGraph except do it within that specific context. For example, an InMemoryHyperNode would add, get query etc. only the RAM-based atoms. Similarly a RemoteHyperNode would only operate on remote data. A NestedHyperNode is a bit harder to define: an add will modify the global database and mark the new atom as part of the nested graph, a query or a get will only return data that's marked as part of the nested graph; but what about a remove? should it just remove from the nest graph, or remove globally as well? or try to do the right thing as in "remove globally only if it's not connected to anything outside of its nested graph"?

Ok, enough. A brain dump of sorts, in need of comments to point out potential problems and/or encouragement that this is going on the right track :)

Cheers,

Boris

HyperGraphDB 1.1 Released

2010-12-13T11:19:00.000-05:00

Kobrix Software is pleased to announce the first official release of HyperGraphDB version 1.1.

HyperGraphDB is a Java based product built on top of the Berkeley DB storage library.

Key Features of HyperGraphDB include:

Powerful data modeling and knowledge representation.
Graph-oriented storage.
N-ary, higher order relationships (edges) between graph nodes.
Graph traversals and relational-style queries.
Customizable indexing.
Customizable storage management.
Extensible, dynamic DB schema through custom typing.
Out of the box Java OO database.
Fully transactional and multi-threaded, MVCC/STM.
P2P framework for data distribution.

HyperGraphDB at Strange Loop 2010

2010-08-04T14:47:00.000-04:00

I will be giving a brief talk on HyperGraphDB on the Strange Loop conference in St-Louis on October 14. The talk will focus on the HyperGraphDB data model, architecture and why it's well suitable for complex software systems, as opposed to other models, SQL, NOSQL, or conventional graph databases.

This conference is highly recommended! Judging by the program and the list of speakers, it is truly as the organizers promote it: from developers for developers. It is about cutting edge technology, it is about everything hot going on this days in the world of software development, it is technical and it looks fun. So, please come buy!

Website: http://strangeloop2010.com/

You are encouraged to register with the website, and interact with the speakers online before the conference.

Cheers,
Boris

HyperGraphDB at IWGD 2010

2010-07-09T15:45:00.000-04:00

The architecture of HyperGraphDB will be presented at The First International Workshop on Graph Database during WAIM 2010 (the Web Age Information Management conference) taking place on July 15-17 in Jiuzhai Valley, China. For more information, please see:

http://www.icst.pku.edu.cn/IWGD2010/index.html

The presentation will be done by Borislav Iordanov and it will focus on the unique HyperGraphDB data model, type system and discuss some of the architectural choices and their impact on performance. The accompanying paper can be found here:

http://kobrix.com/documents/hypergraphdb.pdf