Saturday, January 1, 2011

HyperNodes Are Contexts

There are a couple of different items on HyperGraphDB's road map that, upon a little reflection, seem to converge on a single abstraction. The abstraction I would call hypernode following the hypernode model as described by the paper A Graph-Based Data Model and its Ramifications though it remains to be seen how much of that original model applies in the HyperGraphDB context. The items are the following: (1) local proxies to remote database instances; (2) RAM-only atoms; (3) nested graphs. This blog is about laying the design space of how those features should be dealt with. Let's examine each of these future features in order.

[1] Local DB Proxies

It would be nice to be able to access a db instance on some remote machine as if it were a local db. Only local databases can be currently opened and there is a good portion of the API and the way it is used that could work all the same when the data is remote. One could imagine opening remote graphs like this:

HyperGraph remoteGraph = HGEnvironment.get("peer:RemotePeerName");

and then work with it as if it was on the local hard-drive. This almost makes sense. Most API objects dealing with database access and management, such as the HyperGraph instance itself, the index manager, type system, the low-level store could be manipulated remotely with the exact same semantics. However, this would be a real stretch of their original intent. For example, access to low-level storage is meant to support type implementations, in particular predefined types. Would you want a type implementation at machine A to manage storage on machine B? Similarly for the index manager or the event manager - would you want machine A to manage indexing at machine B? A lot of the methods of the HyperGraph class itself are overloaded methods whose sole purpose is simplifying some API calls by delegating to a "full version" of a method. In other words, going that route, we would have to make the whole API be RPC-ed, an ugly solution with a heavy workload and potentially an overkill. Incidentally, I'm curious if there's an instrumentation Java tool that would load a class-library, but transform dynamically every object as a local proxy to a remote instance. Anybody know of such a tool? Should be doable and perhaps useful for many projects... I only found this paper, which is pretty new and by the looks of the complexities and limitations involved, no wonder it hasn't been done before. Bottom line: remote access to database instance should be carefully crafted as a separate abstraction that focuses on what's important, an abstraction that is working at the atom level - CRUD operations, queries and traversals. A couple of design problems here:
(a) What would the interface to a remote instance look like?
(b) How to deal with atom serialization and differences in type systems? The latter problem is addressed to some extent in the current P2P framework.
(c) Does the client of a remote instance need a local database? The easy answer to that is "no, it shouldn't", but that might mean a much harder implementation!

[2] RAM-Only Atoms

In certain applications, it would be useful to work with a hypergraph structure that is not necessarily or entirely persisted on disk. For example, when HyperGraphDB is used as a convenient representation to apply some algorithms to, or when some data is transient and attached to a particular user session in a server-side environment. Or if one wants a client to a HGDB "server" as in the local db proxy described above. Hence it would be convenient to add atoms to the graph without storing them on disk. Such atoms will participate in the graph structure as full-fledged citizens - they would be indexed and queried in the normal way, except that they would only reside in the cache. For an implementation standpoint, the difficulty will be in merging a unified and coherent view of disk and RAM data. If RAM atoms are to be truly RAM only, databases indices cannot be used to index them. Hence, RAM-only indices must be maintained and dynamically intersected with disk indices during querying. Moreover, it would only make sense to be able to perform queries restricted to atoms in main memory.

[3] Nested Graphs/Subgraphs

This is what the hypernode model is more or less about. A hypernode is a node in a graph that is itself a graph, composed of atomic nodes and possibly other hypernodes. The generalization is roughly analogous to the generalization of standard graph edges to hyperedges. Thus we have graphs nested within graphs, recursively. A very natural application of the original hypernode model is nested object structures. Within HyperGraphDB, however, we already get nested value structures from the primitive storage graph and we are mostly interested in scoping.

As a side note, scoping and visibility are terms used in programming language theory and are sometimes confused because scoping usually determines the visibility of identifiers. A scope in programming is essentially the context within which things like variables, values and general expressions live (and have meaning/validity). When speaking about knowledge representation, the scoping problem is referred to as the "context problem" - how do you represent contextual knowledge? It is not unusual for the AI community to hijack such terms and basically strip them of their deeper meaning in order to claim a handle on a problem that is otherwise much more difficult, or simply sound sexier in a rather silly way. Contextuality and context-dependent reference resolution is at the core of all of computing (and hence knowledge representation), but that's the subject of another topic, perhaps a whole book that I might write if I become knowledgeable enough and if I happen to stumble upon more interesting insights. It is much deeper, I believe, than scopes and their nesting. Incidentally, another example of such an abused term is the word semantic as in the "semantic web" or "semantic relations" or "semantic search" or what have you. There's nothing semantic about those things, only actual computer programs are semantic in any meaningful sense of the word, and not the representation they work on.

Nevertheless, speaking of "context" when one means "scope" seems common so I'll use that term, albeit with great reluctance. To represent contextual information in HyperGraphDB, we need some means to state that a certain set of atoms "belongs to" a certain context. A given atom A can live in several different contexts at a time. One way to do that is to manage that set explicitly and say that a context is a set of atoms. Thus a context is a special entity in the database, an explicitly stored, potentially very large set of atoms. If that special entity is an atom in itself, then we have a nested graph, a hypernode. Another way would be to create an atom C that represents a context (i.e. a hypernode) and then link every member A of that context to C with a special purpose ContextLink. I think this is what they do in OpenCog's atomspace now. The two approaches have different implications. In the former, all atoms belonging to a context are readily available as an ordered set that can efficiently participate in queries, but given an atom there's no easy way to find all contexts it belongs to (clearly, we should allow for more than one). In the latter approach, getting all atoms in a context requires an extra hop from the ContextLink to the atom target, but then retrieving the contexts an atom belongs to is not much harder. In addition, if we maintain a TargetToTargetIndexer on ContextLinks, we avoid that extra hop of going from the context C to the ContextLink to the enclosed atom A. The choice between the two alternatives is not obvious. Going the ContextLink route has the clear advantage of building on top of what already exists. On the other hand, nested graphs seems something fundamental enough and close enough to a storage abstraction to deserve native support while a representation through "semantic" ContextLink should be perhaps left to application specific notions of contextuality, more complex or more ad hoc, regardless.

Note that if we had direct support for unordered links in HyperGraphDB, i.e. hyperedges as defined in mathematical hypergraphs, a nested graph context could simply be a hyperedge. Now that the HGLink interface is firmly established as a tuple, it will be hard to add such unordered set links from an API standpoint. However, support for unordered hypergraphs can be achieved through hypernodes-as-sets-of-atoms with their own interface. On this view, a hypernode can be seen either as an unordered link or as a nested graph. To be seen as an unordered link, however, implies that we'd have to maintain incidence sets to include both standard HGLinks and hypernodes. This in turn means we may run into trouble with the semantics of link querying: does hg.link(A) for example include only HGLinks pointing to A or hypernodes as well? If only links, then we'd have to filter out the hypernodes, thus degrading performance of existing queries. If hypernodes as well, existing queries that expect only links could break a runtime typecast. Issues to keep in mind...

Finally, note that this whole analysis takes atoms within nested graphs to be first and foremost members of the whole graph. The universe of atoms remains global and an atom remains visible from the global HyperGraph regardless of what nested graphs it belongs to. Hence a nested graph is more like a materialized subgraph. The alternative where each nested graph lives in its own key-value space would make it very hard to share atoms between nested graphs. So this limits the scoping function of nested graphs because a given HGHandle will always refer to the same entity regardless of what nested graph it belongs. To implement full-blown scoping, we'd need a HGHandle to refer to different things in different contexts. Do we want that capability? Should we reserve it for a higher-level where we are resolving actual symbolic names (like type names for example) rather than "low-level" handles? Perhaps it wouldn't be that hard to contextualize handles, with an extra performance cost of course: a hypernode would simply have a more involved 'get' and 'put' operations where the handle is potentially translated to another handle (maybe in another context) in a chain of reference resolutions ultimately bottoming out in the global handle space? But that issue is orthogonal to the ability to represent nested graphs.

Commonality

So what's common between those three features above? All of them are about a viewing a graph, or a portion thereof, that is not necessarily identical to the database stored on disk, yet it is a hypergraph in its own right and one wants to be able to operate within the context that it represents. There must be a common abstraction in there. Since the name HyperGraph is already taken by the database instance itself, I'd propose to use HyperNode. A HyperNode would strip down the essential operations of a the HyperGraph class, which would implement the HyperNode interface. This HyperNode interface will replace HyperGraph as the entry point to a graph, be it for querying, traversals or CRUD operations in many places.

So what API should a HyperNode offer? Here's a proposal:

  • get, add, define, replace, remove atoms, but question is: only full versions of those methods or all overloaded versions?
  • getIncidenceSet(atomHandle)
  • getType(atomHandle)
  • find(HGQuery)
  • getLocation() - an abstract URI-like location identifying the HyperNode

Those are a bare minimum. Access to an index manager, event manager and cache are clearly outside of the scope of this abstraction. However, the question of transactions and the type system is a bit trickier. Being able to enclose operations within a transaction is usually important, but a remote peer might not want to support distributed transactions while local data (RAM or a nested graph) can be transacted upon from the main HyperGraph instance. As for access to the HGTypeSystem, it is mostly irrelevant except when you need it.

The semantics of HyperNode's operations seem intuitively well-defined: do whatever you'd do with a full HyperGraph except do it within that specific context. For example, an InMemoryHyperNode would add, get query etc. only the RAM-based atoms. Similarly a RemoteHyperNode would only operate on remote data. A NestedHyperNode is a bit harder to define: an add will modify the global database and mark the new atom as part of the nested graph, a query or a get will only return data that's marked as part of the nested graph; but what about a remove? should it just remove from the nest graph, or remove globally as well? or try to do the right thing as in "remove globally only if it's not connected to anything outside of its nested graph"?

Ok, enough. A brain dump of sorts, in need of comments to point out potential problems and/or encouragement that this is going on the right track :)

Cheers,

Boris

Monday, December 13, 2010

HyperGraphDB 1.1 Released

Kobrix Software is pleased to announce the first official release of HyperGraphDB version 1.1.

HyperGraphDB is a general purpose, free open-source data storage mechanism. Geared toward modern applications with complex and evolving domain models, it is suitable for semantic web, artificial intelligence, social networking or regular object-oriented business applications.

HyperGraphDB is a Java based product built on top of the Berkeley DB storage library.

Key Features of HyperGraphDB include:
  • Powerful data modeling and knowledge representation.
  • Graph-oriented storage.
  • N-ary, higher order relationships (edges) between graph nodes.
  • Graph traversals and relational-style queries.
  • Customizable indexing.
  • Customizable storage management.
  • Extensible, dynamic DB schema through custom typing.
  • Out of the box Java OO database.
  • Fully transactional and multi-threaded, MVCC/STM.
  • P2P framework for data distribution.
In addition, the project includes several practical domain specific components for semantic web, reasoning and natural language processing. For more information, documentation and downloads, please visit the HyperGraphDB Home Page.

Wednesday, August 4, 2010

HyperGraphDB at Strange Loop 2010

I will be giving a brief talk on HyperGraphDB on the Strange Loop conference in St-Louis on October 14. The talk will focus on the HyperGraphDB data model, architecture and why it's well suitable for complex software systems, as opposed to other models, SQL, NOSQL, or conventional graph databases.

This conference is highly recommended! Judging by the program and the list of speakers, it is truly as the organizers promote it: from developers for developers. It is about cutting edge technology, it is about everything hot going on this days in the world of software development, it is technical and it looks fun. So, please come buy!

Website: http://strangeloop2010.com/

You are encouraged to register with the website, and interact with the speakers online before the conference.

Cheers,
Boris

Friday, July 9, 2010

HyperGraphDB at IWGD 2010

The architecture of HyperGraphDB will be presented at The First International Workshop on Graph Database during WAIM 2010 (the Web Age Information Management conference) taking place on July 15-17 in Jiuzhai Valley, China. For more information, please see:


The presentation will be done by Borislav Iordanov and it will focus on the unique HyperGraphDB data model, type system and discuss some of the architectural choices and their impact on performance. The accompanying paper can be found here:

http://kobrix.com/documents/hypergraphdb.pdf


Thursday, June 17, 2010

HyperGraphDB 1.1 Alpha Released

Kobrix Software is pleased to announce the release HyperGraphDB 1.1 Alpha. HyperGraphDB is a general purpose, extensible, portable, distributed, embeddable, open-source data storage mechanism. Designed specifically for artificial intelligence and semantic web projects, it can also be used as an embedded object-oriented database for projects of all sizes.

This is an initial, alpha release of the next version of HyperGraphDB. A complete list of changes is availabe at:

http://code.google.com/p/hypergraphdb/wiki/NextReleaseNotes

To download, please vist the HyperGraphDB project home page at Google.

For more information about HyperGraphDB's features, documentation and additional pointers, please visist the HyperGraphDB home page at Kobrix.

Sunday, May 30, 2010

Poetry About Our Art

About 14 years ago, at the pick of the object-oriented programming craze I came across a publication called On the Origin of Objects, randomly placed amidst the stream of software/OO engineering books at a very reliable computer science bookstore. Surely, I thought, this would get to the bottom of things, somebody has dissected the notion of objects and outlined all the fundamentals I would need to be a successful programmer without the need to skim through endless pages describing ridiculous design processes, trivial principles, inapplicable rules of thumb and what not. It turned out the book had nothing to do with object-oriented programming.

On the Origin of Objects (amazon link) is about metaphysics with computation as the starting point. The author is one of the deepest and most original thinkers I've come across - Brian Cantwell Smith (wikipedia link). Needless to say, I couldn't read it at the time, I wasn't ready for it. That came about 7 years later and it was a memorable, mentally reinvigorating experience. Smith writes beautifully and dances around his insights with such grace and depth. He writes about the kind of computation we do day to day, the real stuff, and he puts common conceptual problems we programmers face into the center stage of philosophy in a way that gives our work those extra dimensions that scientists seem to have always enjoyed - a fundamental, very real connection to the physical world, including, at an even deeper level, a connection with us intentional beings, a set of problems that arise naturally from the practice of our profession, yet quickly reach the most difficult metaphysics in a way that no other practice does.

There are a few (not many) articles you could find on the internet from Prof. Smith, all of them worth reading. However, the purpose of this blurb is to bring to the attention to whoever comes across it his latest work. For the past years, I've been eagerly monitoring and waiting the publication of Age of Significance, which is supposed to be in 7 volumes. The book website, http://ageofsignificance.org/, hadn't changed until just two months ago where it was announced that individual chapters will be published monthly. So far, only the introduction has been posted at http://ageofsignificance.org/aos/en/toc.html and I believe that an attentive read would make my seeming infatuation with this work understandable. Originally, I intended to write a summary of that introduction, highlighting the main points, most of which I'm already familiar with from previous writings (Smith's and others), but I wouldn't want to butcher it. It is philosophy at its best. And it is about the foundation of computing, that which we (should, to say the least) care about. I will just quote the conclusion for the hardcore philosophy skeptics:


Throughout the whole project, I have received countless communications from programmers and computer scientists - over coffee, in the midst of debugging sessions, in conferences, at bars and via email - hoping that it might be possible to say what they feel they know, the absence of which, in some cases, has led to almost existential frustration.

That is pretty much how I've felt more often than not as a programmer. And that is why, to me Smith's work is pure poetry, as philosophy used to be seen at the time of Plato anyway.

Cheers,
Boris

Tuesday, March 9, 2010

Seco 0.3 Released

Kobrix Software is pleased to announce the release of Seco 0.3 Seco, formerly known as Scriba, is a scripting development environment for JVM-based dynamic languages. Seco has been in active development and use for the past several years and it is a perfect companion to the serious Java developer.

Key features include:

- Support of many popular JVM languages with syntax highlighting and code completion for most.
- Advanced script editing interface based on structured notebooks as popularized by the Mathematica system.
- A WYSIWYG HTML editor for documentation.
- An infinite, zoomable 2D canvas for arbitrary layout of components, container nesting and more.
- Full workspace automatically persisted in an embedded HyperGraphDB database.
- Support for importing 3d party libraries in multiple evaluation contexts.
- Based on the JSR 223 standard for language interoperability - all languages share the same runtime context.
- Real-time collaboration and exchange of components and notebooks via a P2P network.

Seco is perfect not only for prototyping, testing and experimentation, but it is also the ideal tool for learning a given JVM language or a new Java library. It can be used to build complete interactive applications embedded within the environment itself similarly to a life system like Squeak!

Seco is free, open-source, LGPL licensed software.

To download and for more information, please visit the Seco home page at Kobrix.