Saturday, January 1, 2011

HyperNodes Are Contexts

There are a couple of different items on HyperGraphDB's road map that, upon a little reflection, seem to converge on a single abstraction. The abstraction I would call hypernode following the hypernode model as described by the paper A Graph-Based Data Model and its Ramifications though it remains to be seen how much of that original model applies in the HyperGraphDB context. The items are the following: (1) local proxies to remote database instances; (2) RAM-only atoms; (3) nested graphs. This blog is about laying the design space of how those features should be dealt with. Let's examine each of these future features in order.

[1] Local DB Proxies

It would be nice to be able to access a db instance on some remote machine as if it were a local db. Only local databases can be currently opened and there is a good portion of the API and the way it is used that could work all the same when the data is remote. One could imagine opening remote graphs like this:

HyperGraph remoteGraph = HGEnvironment.get("peer:RemotePeerName");

and then work with it as if it was on the local hard-drive. This almost makes sense. Most API objects dealing with database access and management, such as the HyperGraph instance itself, the index manager, type system, the low-level store could be manipulated remotely with the exact same semantics. However, this would be a real stretch of their original intent. For example, access to low-level storage is meant to support type implementations, in particular predefined types. Would you want a type implementation at machine A to manage storage on machine B? Similarly for the index manager or the event manager - would you want machine A to manage indexing at machine B? A lot of the methods of the HyperGraph class itself are overloaded methods whose sole purpose is simplifying some API calls by delegating to a "full version" of a method. In other words, going that route, we would have to make the whole API be RPC-ed, an ugly solution with a heavy workload and potentially an overkill. Incidentally, I'm curious if there's an instrumentation Java tool that would load a class-library, but transform dynamically every object as a local proxy to a remote instance. Anybody know of such a tool? Should be doable and perhaps useful for many projects... I only found this paper, which is pretty new and by the looks of the complexities and limitations involved, no wonder it hasn't been done before. Bottom line: remote access to database instance should be carefully crafted as a separate abstraction that focuses on what's important, an abstraction that is working at the atom level - CRUD operations, queries and traversals. A couple of design problems here:
(a) What would the interface to a remote instance look like?
(b) How to deal with atom serialization and differences in type systems? The latter problem is addressed to some extent in the current P2P framework.
(c) Does the client of a remote instance need a local database? The easy answer to that is "no, it shouldn't", but that might mean a much harder implementation!

[2] RAM-Only Atoms

In certain applications, it would be useful to work with a hypergraph structure that is not necessarily or entirely persisted on disk. For example, when HyperGraphDB is used as a convenient representation to apply some algorithms to, or when some data is transient and attached to a particular user session in a server-side environment. Or if one wants a client to a HGDB "server" as in the local db proxy described above. Hence it would be convenient to add atoms to the graph without storing them on disk. Such atoms will participate in the graph structure as full-fledged citizens - they would be indexed and queried in the normal way, except that they would only reside in the cache. For an implementation standpoint, the difficulty will be in merging a unified and coherent view of disk and RAM data. If RAM atoms are to be truly RAM only, databases indices cannot be used to index them. Hence, RAM-only indices must be maintained and dynamically intersected with disk indices during querying. Moreover, it would only make sense to be able to perform queries restricted to atoms in main memory.

[3] Nested Graphs/Subgraphs

This is what the hypernode model is more or less about. A hypernode is a node in a graph that is itself a graph, composed of atomic nodes and possibly other hypernodes. The generalization is roughly analogous to the generalization of standard graph edges to hyperedges. Thus we have graphs nested within graphs, recursively. A very natural application of the original hypernode model is nested object structures. Within HyperGraphDB, however, we already get nested value structures from the primitive storage graph and we are mostly interested in scoping.

As a side note, scoping and visibility are terms used in programming language theory and are sometimes confused because scoping usually determines the visibility of identifiers. A scope in programming is essentially the context within which things like variables, values and general expressions live (and have meaning/validity). When speaking about knowledge representation, the scoping problem is referred to as the "context problem" - how do you represent contextual knowledge? It is not unusual for the AI community to hijack such terms and basically strip them of their deeper meaning in order to claim a handle on a problem that is otherwise much more difficult, or simply sound sexier in a rather silly way. Contextuality and context-dependent reference resolution is at the core of all of computing (and hence knowledge representation), but that's the subject of another topic, perhaps a whole book that I might write if I become knowledgeable enough and if I happen to stumble upon more interesting insights. It is much deeper, I believe, than scopes and their nesting. Incidentally, another example of such an abused term is the word semantic as in the "semantic web" or "semantic relations" or "semantic search" or what have you. There's nothing semantic about those things, only actual computer programs are semantic in any meaningful sense of the word, and not the representation they work on.

Nevertheless, speaking of "context" when one means "scope" seems common so I'll use that term, albeit with great reluctance. To represent contextual information in HyperGraphDB, we need some means to state that a certain set of atoms "belongs to" a certain context. A given atom A can live in several different contexts at a time. One way to do that is to manage that set explicitly and say that a context is a set of atoms. Thus a context is a special entity in the database, an explicitly stored, potentially very large set of atoms. If that special entity is an atom in itself, then we have a nested graph, a hypernode. Another way would be to create an atom C that represents a context (i.e. a hypernode) and then link every member A of that context to C with a special purpose ContextLink. I think this is what they do in OpenCog's atomspace now. The two approaches have different implications. In the former, all atoms belonging to a context are readily available as an ordered set that can efficiently participate in queries, but given an atom there's no easy way to find all contexts it belongs to (clearly, we should allow for more than one). In the latter approach, getting all atoms in a context requires an extra hop from the ContextLink to the atom target, but then retrieving the contexts an atom belongs to is not much harder. In addition, if we maintain a TargetToTargetIndexer on ContextLinks, we avoid that extra hop of going from the context C to the ContextLink to the enclosed atom A. The choice between the two alternatives is not obvious. Going the ContextLink route has the clear advantage of building on top of what already exists. On the other hand, nested graphs seems something fundamental enough and close enough to a storage abstraction to deserve native support while a representation through "semantic" ContextLink should be perhaps left to application specific notions of contextuality, more complex or more ad hoc, regardless.

Note that if we had direct support for unordered links in HyperGraphDB, i.e. hyperedges as defined in mathematical hypergraphs, a nested graph context could simply be a hyperedge. Now that the HGLink interface is firmly established as a tuple, it will be hard to add such unordered set links from an API standpoint. However, support for unordered hypergraphs can be achieved through hypernodes-as-sets-of-atoms with their own interface. On this view, a hypernode can be seen either as an unordered link or as a nested graph. To be seen as an unordered link, however, implies that we'd have to maintain incidence sets to include both standard HGLinks and hypernodes. This in turn means we may run into trouble with the semantics of link querying: does for example include only HGLinks pointing to A or hypernodes as well? If only links, then we'd have to filter out the hypernodes, thus degrading performance of existing queries. If hypernodes as well, existing queries that expect only links could break a runtime typecast. Issues to keep in mind...

Finally, note that this whole analysis takes atoms within nested graphs to be first and foremost members of the whole graph. The universe of atoms remains global and an atom remains visible from the global HyperGraph regardless of what nested graphs it belongs to. Hence a nested graph is more like a materialized subgraph. The alternative where each nested graph lives in its own key-value space would make it very hard to share atoms between nested graphs. So this limits the scoping function of nested graphs because a given HGHandle will always refer to the same entity regardless of what nested graph it belongs. To implement full-blown scoping, we'd need a HGHandle to refer to different things in different contexts. Do we want that capability? Should we reserve it for a higher-level where we are resolving actual symbolic names (like type names for example) rather than "low-level" handles? Perhaps it wouldn't be that hard to contextualize handles, with an extra performance cost of course: a hypernode would simply have a more involved 'get' and 'put' operations where the handle is potentially translated to another handle (maybe in another context) in a chain of reference resolutions ultimately bottoming out in the global handle space? But that issue is orthogonal to the ability to represent nested graphs.


So what's common between those three features above? All of them are about a viewing a graph, or a portion thereof, that is not necessarily identical to the database stored on disk, yet it is a hypergraph in its own right and one wants to be able to operate within the context that it represents. There must be a common abstraction in there. Since the name HyperGraph is already taken by the database instance itself, I'd propose to use HyperNode. A HyperNode would strip down the essential operations of a the HyperGraph class, which would implement the HyperNode interface. This HyperNode interface will replace HyperGraph as the entry point to a graph, be it for querying, traversals or CRUD operations in many places.

So what API should a HyperNode offer? Here's a proposal:

  • get, add, define, replace, remove atoms, but question is: only full versions of those methods or all overloaded versions?
  • getIncidenceSet(atomHandle)
  • getType(atomHandle)
  • find(HGQuery)
  • getLocation() - an abstract URI-like location identifying the HyperNode

Those are a bare minimum. Access to an index manager, event manager and cache are clearly outside of the scope of this abstraction. However, the question of transactions and the type system is a bit trickier. Being able to enclose operations within a transaction is usually important, but a remote peer might not want to support distributed transactions while local data (RAM or a nested graph) can be transacted upon from the main HyperGraph instance. As for access to the HGTypeSystem, it is mostly irrelevant except when you need it.

The semantics of HyperNode's operations seem intuitively well-defined: do whatever you'd do with a full HyperGraph except do it within that specific context. For example, an InMemoryHyperNode would add, get query etc. only the RAM-based atoms. Similarly a RemoteHyperNode would only operate on remote data. A NestedHyperNode is a bit harder to define: an add will modify the global database and mark the new atom as part of the nested graph, a query or a get will only return data that's marked as part of the nested graph; but what about a remove? should it just remove from the nest graph, or remove globally as well? or try to do the right thing as in "remove globally only if it's not connected to anything outside of its nested graph"?

Ok, enough. A brain dump of sorts, in need of comments to point out potential problems and/or encouragement that this is going on the right track :)