The L.E.N.S.E. Project

(Learning from Expensive, Noisy, Slow Experts)

Version 0.2
This is a casually written draft document to explain the project.

Latest News

Rough implementation is now up on GitHub
Check out our early results.

Abstract

We explore making time-sensitive predictions on an infinite stream of “problem instances” by selectively querying humans for noisy partial information, and exploiting previous observations to train a classifier that can incrementally take over from the humans. This stream is assumed to be too large to be kept in memory, so querying backwards over previously seen examples is considered impractical.

A limited version of this setting has been independently proposed several times under different names over the past two decades. It’s been called “Active Learning in a streaming setting“, “Online Active Learning” or “Label Efficient Learning,” to name a few. This is generally proposed under two limitations: the “problem instances” must be multi-class classification problems, and the learning algorithm optimizes a Frequentist objective that does not provide a way for you to specify the complicated trade-off between expense, time delay, and expected number of errors.

This project proposes two improvements on the historical work:

We propose a way to do structured instance prediction in the general case, by which we mean we can handle arbitrary multi-variable problem instances, with factor-graph structure. We also allow collection of human opinions about one variable at a time. (We stress arbitrary factor graphs, because there have been domain specific online structured prediction systems in the past, e.g. dependency parsing by Dan Roth)
We also provide a novel fully Bayesian approach to Online Active Learning, which can optimize arbitrary loss functions.

The L.E.N.S.E. motto: “Fake it until you make it”

Humans are already AI complete, so why not fake hyper intelligent AI by just asking for help once in a while? Generally, this doesn’t scale because it’s too expensive. Our trick is to learn from the human responses we get, and phase out humans as we learn. In short, the LENSE approach to hard AI: “fake it until you make it.”

Goals:

Provide arbitrary precision classifiers, by asking humans.
Make our classifiers cheap by incorporating machine learning.
Be useful to the real world.

Motivating Example:

You answer the phone for a Chinese Food restaurant. You start to notice that every call follows pretty much the same flow, and you want to know if it would be possible to teach a computer to take over the phones for you. We’ll call this the Chinese Restaurant Process (pun intended)

the Chinese Restaurant Process (pun intended)

If we could have AI advanced enough to figure out the decisions at the red parts, we can let MacSpeech do the rest. No such AI exists at the moment, so we’ll have to fake it. LENSE aims to do exactly that.

People are AI complete. Why is this news?

Well, it isn’t, quite. The trick is money. People want to get paid for the work they do. Businesses don’t like how that scales.

Computers can learn from examples in real time, and not bother humans with questions the model already knows about. This will save money. Costs should scale at roughly $O(log(N))$ , instead of $O(N)$ for the no-learning case, as the computer learns to handle the common kinds of queries independently. The tagline here is decreasing marginal cost.

relative scaling costs

A simple example

Let’s pick apart the confirmation node on the Chinese Restaurant example, and talk about exactly how L.E.N.S.E. works in a very simple case.

confirmation detection example

First, we get a “prior belief” over the possible values of $x$ , $P_{\theta}(x)$ , according to our model. This comes from a standard machine learning model, which we’ll explain in the next section.

In the absence of other information, $P(x)$ , our belief over the true value of $x$ , is equal to $P_{\theta}(x)$ . $P(x)$ can be more or less “peaky.” For a less peaky $P(x)$ , we can increase the peakiness of our distribution over $x$ by querying a human, and using Bayes’ rule: $P(x|h) = P(x)*P(h|x)$ . Once sufficient mass is concentrated in one class, we turn in our result.

Deciding whether to query another human or turn in our current best guess is complicated. We resort to Bayesian Decision Theory here, where we specify some loss function $L(c,t,E)$ which we seek to minimize. The inputs are: $c$ is the financial cost, $t$ is the time, and $E$ is the expectation that our guess is wrong ( $E = 1 - P(guess)$ ). We make decisions in order to minimize expected loss. We get to this in the section after next.

Learning $P_{\theta}(x)$ and $P(h|x)$

The key to making LENSE work is that its parameters for the model $P_{\theta}(x)$ are learned over time. We want fewer and fewer queries to be made to humans as (features, opinion) pairs accumulate and the parameters are refined. As an added bonus, LENSE should also learn the parameters for $P(h|x)$ as accurately as possible.

We will perform this learning in a PGM structure, taking advantage of existing algorithms for handling learning with unobserved variables.

Note: In the naive case I present here, this can be done with EM in batches over all currently available data. It turns out Percy has a paper on doing it in the online case, so that’s no problem: Online EM for Unsupervised Models (NAACL ‘09)

In our Chinese Restaurant confirmation system example, we have the following structure (which can be extended, we discuss that later):

plate model

The only thing not specified here is exactly how the model weights combine with features to create $P_{\theta}(x)$ , and how human error weights combine with the hidden true value $P(h|x)$ . This is by design, to allow flexibility within the framework.

$P_{\theta}(x)$ and $P(h|x)$ can be any functions that output valid probability distributions. This includes logistic regression, neural networks ending in a normalized layer, and everything in between.

Minimizing $L(c,t,E)$

The decision to query humans is made based on our current beliefs $P(x)$ , and our preferences about outcomes, as expressed in $L(c,t,E)$ .

In some settings, the cost of a 1% chance of error is far higher than the cost of asking 3 more people their opinions (ex. lawyers collecting case data). In other settings, the cost of an additional 500ms of latency is unacceptable (ex. hedge fund news analysis, consumer products). In still others, cost is the primary concern and both accuracy and latency aren’t as important (ex. corpus construction for ML). This is all encoded precisely by $L(c,t,E)$ .

Bayesian statisticians are in general a pessimistic bunch, looking at every outcome as incurring some loss, and we follow their lead. Loss is a function of the cost $c$ , the time $t$ , and the expectation of error, $E$ . $c$ and $t$ are trivial to calculate, and $E$ is calculated by taking the probability under our current $P(x)$ that our guess $g$ is incorrect. That’s $E = 1-P(g)$ .

Making no assumptions about the form of $L(c,t,E)$ , we are trying to make a decision that minimizes our expected loss. We can look at this as a game tree, where our “opponent” (the crowd we’re querying) is playing the statistical strategy $P(h) = P(h|x)P_{\theta}(x)$ . We can calculate $L(c,t,E)$ in closed form whenever we choose to “turn in” our best guess at a node in our game tree, since we can calculate $E$ and we know $c$ , and can take an expectation over $t$ at that point.

query game structure

There’s lots of existing work on how to optimally play a game like this. Exhaustive tree traversal is impossible in the general case, because of the exponential size of the tree. In those situations, there are quite a few approximate algorithms to guess expectations via sampling paths in the tree.

The multivariable case

So far we’ve fleshed out what $K$ -way classification looks like. Lots of problems can be reduced to a (potentially huge) sequence of $K$ -way classification tasks, but that can be very inefficient.

Let’s imagine using LENSE to do accurate, low-latency NER for use in some pipeline system (say this is for that hedge fund that accidentally thought Anne Hathaway positive mentions applied to Berkshire Hathaway). That is generally represented by a sequence model, because the NER classification of a token’s neighbor effects its own NER classification.

We want to be able to ask humans about single tokens in the sequence, since that is much faster and cheaper than asking for labels of the whole sequence. We’d like to be able to use information about some tokens to effect others, and to know which tokens to ask queries about.

Going to arbitrary factor-graphs is a relatively straightforward change to the machinery we’ve already built up. Learning doesn’t need to be changed, we just insert the factor graph where we used to just have a single node “True X,” and hang human observations off of the individual nodes within the factor graph that they were observations for. Our loss function $L(c,t,E)$ needs to be generalized so that $E$ is a vector, where each entry $E_i$ the probability of getting the class of node $i$ in the graph wrong. Our game player for minimizing expected loss also now needs to generalize to decide which variable to query for, if it chooses to ask humans. Otherwise everything is exactly as before.

anne hathaway ner example

There are a lot of interesting real world tasks that can be cast as MAP inference over a set of variables. In these tasks, it’s often (though not always) faster to query humans about individual variables than whole groups, and these settings especially should see a huge speedup with LENSE.

In full generality

LENSE expects queries to arrive in the form $(G, f(x))$ . Here $G$ is a factor graph of random variables we want the values for, where each factor represents feature values (but no weights or probabilistic interpretation).

$f(x)$ is a function that we can pass a subset of variables, and it will return us a human opinion on them at some cost we know, and with some human error rate we don’t know. $f(x)$ is responsible for phrasing the question to humans, making the query, and returning an answer.

LENSE will attempt to respond with the MAP assignment to $G$ . It will do this by calling $f(x)$ on subsets of variables to gather evidence. The number of times $f(x)$ will be called, and on which variables, is balanced by a Bayesian Decision Theoretic model that seeks to find the MAP assignment in minimal time, with minimum monetary cost, and with maximum accuracy, which are balanced according to an equation you specify.

For every stream of queries to LENSE that are from the same “task”, there are 3 things you need to give it:

$P_{\theta}(x)$ : the kind of model to use for your problem (log-linear, SVR, NN, etc). Your must have some statistical interpretation, and the better tuned the better.
$P(h|x)$ : the kind of model to use for modeling human error, where $h$ is the human response, and $x$ is the true value. This could be as simple as “With probability $\epsilon$ human chooses uniformly at random, otherwise $h = x$ ”, or as complicated as a rich template model with lots of complicated shared unobserved parameters.
$L(c,t,E)$ : your loss function, which you would like minimized, as a function of money spend $m$ , time delay for responses $t$ , and expected error $E$ . This can be arbitrarily complex, but a simple example is $L(c,t,E) = c + 0.5t^2 + 4E$ , where the user is less concerned by cost than error rate (by a factor of 4), and really doesn’t want long delays (hence the square term).

Once you’ve specified all of this, and are sending queries of the form $(G,f(x))$ , it’s up to LENSE to behave such that your loss function is minimized (getting high accuracy, low cost, and short delays), and to learn parameters to both of your models (human error and priors).

That’s all folks!

That’s all that LENSE is so far. What follows are some suggestions for ways to contribute to the project both in a way that leads to academic papers and in ways that are useful to others (and sometimes both at the same time).

A Central Implementation:

The unsexy first order of business is a central implementation people can use as a platform for experimentation. That means abstract classes for each of the subcomponents, so changes are easily pluggable. It also means a lot of plumbing and machinery to get the whole thing working, and some careful interface design.

A GitHub repo is up, but currently at a very early stage, with no documentation at all. It implements all factor-graph calculations using FACTORIE. Contributors are of course welcome.

Research Ideas

LENSE can be broken down into several subcomponents, which can each be improved individually, more or less. This provides an opportunity to fruitfully work completely separately, but with mutual goals, which I’ve always found was the most productive way to go.

Better ways to model $P(h|x)$ :

This is a hot topic in the ML community because it leads to more accurate corpus construction. Lots of models have been tried, each one with more hidden variables and complicated plate diagrams than the last. NIPS has held a workshop on this for the last two years, which has yielded some papers on error modeling.

Related Work:

The right objective for $P_{\theta}(x)$ and $P(h|x)$ :

Any model for $P_{\theta}(x)$ that gets used with LENSE needs to provide accurate statistical estimates, and must return a uniform distribution when it doesn’t know. This is more important than F1. That means that regularization is very important, and that we can’t expect optimal performance from many off-the-shelf algorithms (basically all of classification, and even some regression setups).

Picking the right training objective for this is key. This should probably involve a reference to $L(c,t,E)$ , because a stronger regularizer will increase your expected cost $c$ and delay $t$ in return for better guarantees about your guess of your error expectation $E$ being an overestimate.

Related Work:

Statistical Decision Theory and Bayesian Analysis (this is a whole book, but its concept of applying a loss function to statistical estimation of values might be a good place to start, beginning on p. 24)

Interface design to optimize human response time when asking questions:

There’s a human interface problem of how we ask questions to minimize delay. This has been explored in the past in HCI journals like CHI, so that’s an excellent place to start.

Related Work:

Cognitively Inspired Task Design to Improve User
Performance on Crowdsourcing Platforms (CHI ‘14)

A richer set of choices to optimize $L(c,t,E)$ over:

We could have a richer set of choices for the algorithm. For example, what if there are multiple available humans, each with a different length of queue of requests (effecting expected delay) and a different accuracy and rate of completion. We would then have to choose between humans.

And then what about the case where there is a parallel stream of requests appearing at LENSE, and we have to account for queuing theory as we traverse the game tree of because possible human response times will change.

Related Work:

The L.E.N.S.E. Project

Latest News

Abstract

The L.E.N.S.E. motto: “Fake it until you make it”

Goals:

Motivating Example:

People are AI complete. Why is this news?

A simple example

Learning P_{\theta}(x) and P(h|x)

Minimizing L(c,t,E)

The multivariable case

In full generality

That’s all folks!

A Central Implementation:

Research Ideas

Better ways to model P(h|x):

The right objective for P_{\theta}(x) and P(h|x):

Interface design to optimize human response time when asking questions:

A richer set of choices to optimize L(c,t,E) over:

Learning $P_{\theta}(x)$ and $P(h|x)$

Minimizing $L(c,t,E)$

Better ways to model $P(h|x)$ :

The right objective for $P_{\theta}(x)$ and $P(h|x)$ :

A richer set of choices to optimize $L(c,t,E)$ over: