Knowledge Intensive Machine Learning, Part 1
Machine learning is all about learning from extensional information (datapoints). The incorporation of intensional information (domain knowledge) is generally done by hand via data encoding and feature and model selection. Doing this well is arguably the most difficult part of a successful machine learning system. It would sure be nice if our learning algorithm could make direct use of qualitative domain knowledge, say for model or feature selection, or reducing the parameter space? This is the idea behind "knowledge intensive machine learning".
Probably the best (though almost trivial) example of this is the simple Bayesian network. A Bayes net is a probabilistic model which can be learned from data, but for which the user can very easily specify high-level qualitative domain knowledge (causality or conditional independence) which significantly constrains the parameter space. What else do we have? Not much.
Here's a motivating example: suppose you are building a simple anti-smoking propaganda Bayes net. You have domain knowledge that tells you what nodes should be connected (for example, smoking and lung cancer). But we know more than that structure: the more a person smokes the more at risk they are for lung cancer. The qualitative reasoning folks would write this as "SmokingFrequency Q+ LungCancer", indicating a qualitative (perhaps stochastic) monotonic influence. Now, instead of using that statement for qualitative simulation, we use it for putting a prior on our CPTs that enforces stochastic monotonicity, resulting in better classifiers in low-data situations.
Probably the best (though almost trivial) example of this is the simple Bayesian network. A Bayes net is a probabilistic model which can be learned from data, but for which the user can very easily specify high-level qualitative domain knowledge (causality or conditional independence) which significantly constrains the parameter space. What else do we have? Not much.
Here's a motivating example: suppose you are building a simple anti-smoking propaganda Bayes net. You have domain knowledge that tells you what nodes should be connected (for example, smoking and lung cancer). But we know more than that structure: the more a person smokes the more at risk they are for lung cancer. The qualitative reasoning folks would write this as "SmokingFrequency Q+ LungCancer", indicating a qualitative (perhaps stochastic) monotonic influence. Now, instead of using that statement for qualitative simulation, we use it for putting a prior on our CPTs that enforces stochastic monotonicity, resulting in better classifiers in low-data situations.
3 Comments:
So is the basic idea to put hard constraints on the space of distributions? What sort of constraints do you think are reasonable? For instance, when Shore and Johnson talked about inference in cases when information was provided in a form of constraints over space of distributions, they allowed any constraint that formed a closed, convex set. (for instance all distributions which have moments m1,m2,m3 forms a closed convex set, also holds for monotonicity). Is that too broad though?
By Yaroslav Bulatov, at 1:12 AM
Well, for now we use hard constraints, but only because it is the simple case to try first, and because MAP becomes a constrained MLE which is tractable. The right thing to do (still assuming a Bayesian framework) is to put a soft prior that encodes the strength of our belief in the domain knowledge. We already have some results that indicate that should provide better performance across a wider range of training set sizes.
Let me get back to you on the rest of that question (and the question of whether the Bayesian framework is the best one for this problem). I want to do some more reading...
By E-Rock, at 12:11 AM
You're right, these ideas do seem very conducive to maxent methods. My limited background is very Bayesian, so it will take some thought to apply monotonicities to the maxent framework. In the Bayesian framework, there are no theoretical limits on the constraints; you just need to be able to actually implement them as a prior and do inference under that prior. Does that answer your question?
By E-Rock, at 9:42 AM
Post a Comment
<< Home