Monday, January 25, 2010

A Feedforward Neural Net with HyperGraphDB

One obvious application of a graph databases such as HyperGraphDB is the implementation of artificial neural networks (NNs). In this post, I will show one such implementation based on the practical and informative book "Programming Collective Intelligence" by Toby Segaran. In chapter 4, the author shows how one can incorporate user feedback for improving search result ranking. The idea is to train a neural network to rank possible URLs given a keyword combination. It is assumed that users will perform a search, then examine the list of results (with their titles, summaries, type of document etc.) and finally click on the one that's most likely to contain the information they are looking for. That click action provides the feedback needed to train the neural network. With training, the network will increasingly rank better the most relevant documents for a keyword combination. Not only that, but it will make rather good guesses for searches it has never seen before.

If you are not familiar with neural networks and don't own the aforementioned book, the main Wikipedia article on the subject is always a good place to start. Also, I can recommend this introductory book where a couple of chapters are freely available . For those too busy/lazy for thorough introduction to NNs, here is a brief summary of the particular neural net we'll be implementing: a 3-layered feedforward NN. It consists of 3 layers of neurons connected with synapses. The neurons and synapses are abstract models of their brain counterparts. Each synapse connects two neurons, its input and output, and has an associated real number called its strength. There's an input layer, a hidden layer and an output layer as shown in this picture. The network is feed-forward because neurons from the input layer "feed" the hidden layer neurons which in turn "feed" the output layer neurons, but there is no looping back. The NN is executed in steps: first all input neurons are assigned a real number that comes somewhere from the "outside world": that's their activation level. Second, the activation level of all hidden neurons is calculated based on the input neurons connected to them and the strength of the connections (synapses). Finally, the same calculation is applied for the output neurons, yielding a set of real numbers, which is the NN's output. To give the correct outputs, the network is trained by an algorithm known as backpropagation : run the network to produce output, compare with the expected output and adjust the synaptic strengths to reduce the delta between the two.

In "Programming Collective Intelligence", the input neurons correspond to individual search keywords and their activation level is set to 1 while the output neurons correspond to URLs and their final activation level provides their ranking. The hidden neurons correspond to combination of keywords (i.e. actual searches performed). In our implementation, the inputs and outputs are going to be arbitrary HyperGraphDB atoms that are part of whatever representation of whatever domain one is working with.

As in most software, neural network implementations usually have two representations - a runtime RAM representation and a permanent storage representation (e.g. in SQL database). But one of the core principals behind HyperGraphDB is to eliminate this distinction and embed the database in-process as direct extension to RAM memory. This way one doesn't have to worry about caching frequently used data, what's in RAM and what's not, at least not as much. For example, while usually the set of connections between two consecutive layers is represented as a weighted matrix, we will see now how a direct HyperGraphDB representation can be used without much penalty!

When you think about representing a neural network as a graph, the most intuitive way is by creating a node for each neuron and a link for each synaptic connection:
public class Neuron { ... }
public class Synapse extends HGPlainLink
{
private double strength;
public Synapse(HGHandle...args) { super(args); }
public HGHandle getInput() { return getTargetAt(0); }
public HGHandle getOutput() { return getTargetAt(1); }
public getStrength() { return strength; }
public setStrength(double strength) { this.strength = strength; }
}
To get all inputs of a neuron:
List inputs = hg.getAll(graph, hg.and(hg.type(Synapse.class),hg.orderedLink(hg.anyHandle(), neuronHandle)));
This will work just fine. When time comes to activate a neuron, we loop through the above list and apply whatever formula we have chosen for our model. But there's a better way: each neuron can simply be a link binding together its input neurons and it can contain the strengths of all synapses in a double[]:
public class Neuron extends HGPlainLink
{
private double [] weights;

public Neuron(HGHandle...inputNeurons)
{
super(inputNeurons);
}

public double fire(Map input,...) { ... }

// etc...
}
In this way, the inputs are readily available and the activation of a neuron can be easily calculated. Notice how the generality of the hypergraph model allows us to represent the same graph structure, but in a much more compact way! Had we stored the synapses as explicit links, we would have ended up with an explosion of extra entities in the database and with a less inefficient way to execute the network where a separate runtime representation would have probably been warranted. Here we make use of the ability of an edge to hold an arbitrary number of atoms, including other edges: the artificial neurons are interlinked recursively in a sense.

With this schema, a very large network can be represented, only portions of which are loaded in memory as the need arises. A search application for a corporate site with, say, several hundred thousand URLs and about a few thousand recurring searches will yield a NN with about a small factor of that many HyperGraphDB atoms.

Now, the input layer in our representation doesn't actually consist of Neuron instances, but of arbitrary atoms. For a search application those will simply be search terms. The output layer however is made of Neurons and we need a mechanism that associates an output Neuron with an output atom. One option is to create a Neuron subclass that holds a reference to the associated atom. Another option is to create a new link for the association:
public class NeuronOutput extends HGPlainLink
{
public NeuronOutput(HGHandle...args)
{
super(args);
}
public HGHandle getNeuron()
{
return getTargetAt(0);
}
public HGHandle getOutputAtom()
{
return getTargetAt(1);
}
}
With both approaches, some book keeping will be required to remove the output Neuron if the output atom itself is removed. The NeuronOutput link representation leads to a slight performance hit because of the extra hop to get from output neuron to output atom, but that should be negligible for our purposes.

With that in place, we create a NeuralNet class to hold runtime information about an active portion of the whole network stored in the database. The class is intended to be lightweight, instantiated on demand and used within a single thread (e.g. for processing a search request). It implements the feedforward algorithm which should be fast, and the training algorithm which could be slower because it will be used only for training during idle time. For example, a search engine could process the accumulated feedback at night by training the neural net based on the daily activity. To represent the activation of neurons at a given layer, we have an auxiliary ActivationMap class that is a HashMap that returns a default value (usually 0) for missing keys. Activating a neuron involves a calculation where all inputs multiplied by the synaptic strength are summed and then an activation function is applied to the sum. The activation function could be a threshold function or a sigmoid function, like tanh which we use here. An abbreviated version of the NeuralNet class looks like this:
public class NeuralNet
{
private HyperGraph graph;
private ActivationFunction activationFunction = new TanhFun();
Map inputLayer = null;
Map hiddenLayer = null;
Map outputLayer = null;

private Map activateNextLayer(Map previousLayer, Map nextLayer)
{
for (Map.Entry in : previousLayer.entrySet())
{
Collection downstream = hg.getAll(graph,
hg.and(hg.type(Neuron.class),
hg.incident(in.getKey())));
for (Neuron n : downstream)
if (!nextLayer.containsKey(n))
nextLayer.put(graph.getHandle(n),
n.fire(previousLayer, activationFunction));
}
return nextLayer;
}

public NeuralNet(HyperGraph graph)
{
this.graph = graph;
}

public Map getOutputLayer()
{
return outputLayer;
}

public void feedforward(ActivationMap inputs)
{
inputLayer = inputs;
hiddenLayer = activateNextLayer(inputLayer, new ActivationMap(0.0));
outputLayer = activateNextLayer(hiddenLayer, new ActivationMap(0.0));
}

public void train(Collection inputs, Collection outputs,
HGHandle selectedOutput)
{
Collection outputNeurons = updateNeuralStructure(inputs, outputs);
ActivationMap inputMap = new ActivationMap(0.0);
for (HGHandle in : inputs)
inputMap.put(in, 1.0);
selectedOutput = hg.findOne(graph,
hg.apply(hg.targetAt(graph, 0),
hg.and(hg.type(NeuronOutput.class),
hg.incident(selectedOutput))));
Map outputMap = new HashMap();
for (HGHandle h : outputNeurons)
outputMap.put(h, 0.0);
outputMap.put(selectedOutput, 1.0);
feedforward(inputMap);
backpropagate(outputMap);
}
}
The feedforward method takes an input map and executes the network, loading only relevant neurons on demand. The result of the execution will be stored in the outputLayer member variable.

The train method takes a collection in input atoms, a collection of output atoms and a single output atom that was selected amongst all outputs. It starts by making sure that all output atoms have a corresponding neuron that is connected to an appropriate hidden neuron for this particular input combination. All this is done in the call to updateNeuralStructure. Then the method creates an ActivationMap representing the expected activation of the output layer: 0 for all outputs except the selected which has an activation of 1. Then the network is executed on the input to produce its own result for the output layer (stored in the outputLayer member variable). Finally, backpropagation adjusts synaptic strengths based on the different between expected and calculated output. A version of of train that uses a full map of expected output value can be trivially added here.

Explaining how backpropagation works and the math behind it is beyond the scope of this post. So we'll leave it at that. The full code is available from HyperGraphDB's SVN repository:


This post is intended mainly as a sample usage of HyperGraphDB. However, the code should be usable in a real world application. Get it from SVN, try it and write back if you have questions, bug finds or improvement suggestions.

Cheers,
Boris

Monday, January 4, 2010

HyperGraphDB 1.0 Released

Kobrix Software is pleased to announce the first official release of the HyperGraphDB database system. HyperGraphDB 1.0 is a general purpose, extensible, portable, distributed, embeddable, open-source data storage mechanism. Designed specifically for artificial intelligence and semantic web projects, it can also be used as an embedded object-oriented database for projects of all sizes.

HyperGraphDB is a Java based product built on top of the Berkeley DB storage library. It can be used as a single in-process database bound to a location on the local disk or within a "cloud" of networked database instances communicating and sharing data in a P2P (peer-to-peer) fashion.

Key Features of HyperGraphDB include:
  • Storage of generalized hypergraphs.
  • Open, extensible type system.
  • Basic query system and graph traversal algorithms.
  • Out-of-box support for Java object storage.
  • Thread-safe transactions.
  • P2P framework for data distribution.
For more information, documentation and downloads, please visit the HyperGraphDB Home Page.

Friday, January 1, 2010

Parsing the Google Wiki Format

Since we moved most of our projects on the Google Code servers and started using Google's wiki for documentation, I've been wanting to use that wiki as a publishing tool for anything related to a specific project as this saves coding and administrative efforts to manage everything on our server. This would avoid having to install and administer, or code up and administer something that's pretty much done a gazillion times and that is covered by blogs and wikis. Google offers an API to access the blogs so we can embed the blog on the website easily and there are enough examples on the internet on how to do that. The easiest thing to do is just copy and modify the examples from Google Code Playground. However, I couldn't find an API, nor 3d party code around the internet for the wikis. Google's wikis are a bit lousy, admittedly, but they are good enough for most purposes, so I decided to write a parser that translate a wiki page to HTML and now we can display docs on our website. Here is that code in the hope that it could benefit someone else.

First, Google stores the wiki pages of a project in that project's SVN repository. The URL to the latest version of a wiki page looks like this:

"http://" + project + ".googlecode.com/svn/wiki/" + page + ".wiki";

Where project is the name of your project at Google and page the name of your page. So a plain wiki file can be downloaded with a standard HTTP client.

So the parser, a class called GoogleWikiViewer, just downloads a .wiki from SVN, converts it to HTML and sends it back to the browser in a simple Java servlet that looks like this:

protected void doPost(HttpServletRequest request,
HttpServletResponse response)
throws ServletException, java.io.IOException
{
// ... your HTML preamble here, e.g. response.getWriter().println("");
String project = request.getParameter("project");
String wikipage = request.getParameter("page");
if (!(project == null || project.length() == 0
|| wikipage == null || wikipage.length() == 0))
{
GoogleWikiViewer viewer = new GoogleWikiViewer(project, wikipage, "wikishow");
viewer.toHtml(response.getWriter());
}
// ... finish HTML document output


In the above the 'wikishow' parameter is the name of the very servlet using that code snippet. GoogleWikiViewer uses that to construct a URL of the form "wikishow?project=theproject&page=thepage". I know the trend these days is to write this stuff in JavaScript, but JavaScript takes longer to debug, and in any case if there's interest this could easily be ported over to JavaScript and made into an AJAX library. For now, it's just a simple, standalone Java class:

Download GoogleWikiViewer.java.

Please comment with bug reports/fixes. Bear in mind that this was written in a couple of hours and it doesn't cover everything (e.g. comments and gadgets are not supported) and it probably has a bug here and there. But the HyperGraphDB wiki pages get displayed correctly which means it covers all commonly used features.