Journal Updates
As part of the Canadian Distributed Mentorship Program, here is where I will be posting my journal entries relating to the work done.
I - | II - | III - | IV - | V - | VI - | VII - | VIII - | IX- | X- | XI- | XII- | XIII- | XIV- | XV- | XVI- | XVII- | XVIII |
Week Eight: June 19 - June 23
This week, Susanna Still is in town. She is working on a paper on Active Learning and Optimal Predictions(no link yet...). Doina and her thought I might be interested in implementing the algorithm outlined in the paper, so on Monday we had a meeting to discuss it.
Susanna's paper presents an information-theoretic approach to the problem of learning an internal representation of an environment. As opposed to the Reinforcement Learning methods I have described in other entries, the learner does not seek to maximize the total or average reward, but rather, to maximize the predictive information about the world in its internal model. This approach has several things in common with Predictive State Representations (PSRs). The system is partially observable, which means that the agent receives observations, rather than actual state information from the environment. Based on a history of available informations, reaching time T into the past, as well as a set of probabilistic rules, the agent should choose an action such that the internal representation retains maximal predictive information. Since there is a finite length history, one tries to find a compressed representation of the available past, that reflects both the current internal model, as well as the available observations.
Of course, a trade-off between exploration and control needs to be found. As such, the optimal solution contains two terms: the first term tells that, given a past, the optimal action should minimally perturb the system; the second term promotes exploration, and ensures that the optimal action induces a distribution over the future that is far from average( i.e. the agent should look for unlikely futures that have not been seen before)
Paper abstracts and I are very good friends. They summarize the information core dump that is about to happen in a friendly, feel-good paragraph. If you're lucky, you can catch the key words, and chew on them, so that by the time they are actually introduced in the paper, you feel like you've known them forever. A really good abstract can save you reading the entire paper. Paper abstracts are great.
Graphs and diagrams also deserve a worthy mention. They soften the blow of algorithm results. Sometimes, the paper introduces something so revolutionary and creative, that you feel you've just been told that 1+1=3. Sometimes you are actually told that 1+1=3. It might alarm you and leave you entirely unconvinced...but ah! add a graph, some plotted results, an "x vs. y" diagram and the world is right again. Everyone loves the "x vs. y" diagrams.
Such were the papers I had previously read. I knew this was a mathy-theory type paper, but I felt I was ready. Worst-case scenario, I'd just skip the proofs and go with the flow. I wish this was worst-case scenario. On the [0,1] scale, this was the 0 that never, ever happens. After the first brief skim, the paper was 17 pages of sigmas, math formulae, conditional probabilities, and the odd English word in between.
I was a lost and confused llama.
By Thursday-ish, I was on my 5th try at the paper, and things were finally making sense. The proofs in the appendix were still a bit of a mystery, but I could take my time with those after I was done the algorithm. In theory, the algorithm sounds easy: all I needed to do was get the input distributions, type in the system of equations, iteratively solve it and test it. However, as I started thinking about it, it became obvious that this was not going to go smoothly. The optimal solution consists of iteratively solving a system of 7 equations. There are two input distributions needed for this to be possible: p(past), or what is the probability that a particular action:observation pair will be seen, and p(future|past,action), which is the probabillity that a particular observation will be seen, given a past history, and an action. And, despite my best intentions, I had no idea how to get the latter. I met with Susanna, who explained to me that I had to simulate this distribution, by generating a number of pasts and futures. She also gave me code that her and her students have started writing, but never finished, for an algorithm very similar to this, in hopes it might help. And so I started reading the code...