Probabilistic Inference with Naive Bayes
Saturday, 23 January 2016 · 21 min read · machine-learningToday, let’s take a look at Naive Bayes from the ground up, and work through a working implementation in Javascript.
For one of my side projects, I’m working on a feature that lets you perform simple data analysis on small datasets via the browser. One use case is the automatic assignment of categories to a set of items (i.e. supervised learning.)
Since users’ datasets are small (N<60), we want to keep our model as simple as possible so it won’t overfit. This usually means making simple assumptions about the distribution the data comes from.
There are a handful of machine learning classifiers we can use: Naive Bayes, k-Nearest Neighbour, and linear SVM. I also considered using Decision Trees and MLP which we can visualize on the browser using client-side visualization libraries.
Naive Bayes is a simple probabilistic classifier that works despite the independence assumption, and has historically been used in document classification.
Intro to Bayes Theorem
I’m going to assume you know some basic probabilities.
As a refresher, here are a few definitions:
- Let
A
andB
be events that can happen with some probability P(A)
: probability that A happensP(A|B)
: probability that A happens, given that B happens (conditional probability)
According to Bayes Theorem:
\[P(A|B) = \frac {P(B|A)P(A)} {P(B)}\]That is, the probability P(A|B)
of A
happening given B
is equal to the probability P(B|A)
of B
happening given A
AND the probability P(A)
of A
happening OVER the probability P(B)
of B
happening.
To clarify this idea further, let’s work through an example:
A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Also, .008 of the entire population have this cancer.
For a new patient that lab test returns a positive result, should he be diagnosed as having cancer or not?
Based on the above, we have the following probabilities:
P(cancer) = 0.008 P(!cancer) = 0.992
P(+|cancer) = 0.98 P(-|cancer) = 0.02
P(+|!cancer) = 0.03 P(-|cancer) = 0.97
We can then calculate the conditional probabilities of patients having cancer:
P(cancer|+) = P(+|cancer)P(cancer) / P(+)
P(!cancer|+) = P(+|!cancer)P(!cancer) / P(+)
P(+|cancer)P(cancer) = 0.0078
P(+|!cancer)P(!cancer) = 0.0298
P(cancer|+) = 0.0078 / (0.0078 + 0.0298) = 0.21
P(!cancer|+) = 0.79
Since P(!cancer|+) > P(cancer|+)
, we conclude that if a new patient gets a positive test result, he has a higher likelihood of having cancer.
Bayesian learning algorithms calculate explicit probabilities for hypotheses, which typically requires initial knowledge of many probabilities. What this means is that you’ll need labeled data beforehand (supervised learning.)
Supervised Learning with Naive Bayes
The above example works for a single attribute: a test result. However, in real-world cases we’ll often need more attributes in order to improve the accuracy of our predictions. That’s where supervised learning comes in.
A Naive Bayes classifier is a simple classification method that is based on Bayes’ Theorem and the assumption of conditional independence.
That is:
P(A ^ B | Y) = P(A | Y) P(B | Y)
Now, let’s say our training data looks as follows:
a1, ..., aN, Vj
Where ai
is an attribute at the i
th position, and Vj
is the j
th class label in the dataset. According to our conditional independence assumption:
P(a1, ..., aN|Vj) = P(a1|Vj) P(a2|Vj) ... P(aN|Vj)
According to Bayes’ Theorem, we’ll then be able to find:
P(Vj|a1, ..., aN)
which is the probability that a given sample of attributes a1, ..., AN
belongs to a class Vj
. This is the end result of our classification!
Notice that Naive Bayes is an eager learner: it calculates all the probabilities beforehand in a preprocessing step, making classification time close to zero. This is in contrast to lazy learners such as k-nearest neighbours with less training time and more classification time.
Naive Bayes Learner Implementation
Given the following training data, let’s build a Naive Bayes classifier to answer the following supervised learning question: is today a good day to Play
outside?
Outlook | Temperature | Humidity | Wind | Play? |
---|---|---|---|---|
Sunny | 85 | 85 | Weak | No |
Sunny | 80 | 90 | Strong | No |
Overcast | 83 | 86 | Weak | Yes |
Rain | 70 | 96 | Weak | Yes |
Rain | 68 | 80 | Weak | Yes |
Rain | 65 | 70 | Strong | No |
Overcast | 64 | 65 | Strong | Yes |
Sunny | 72 | 95 | Weak | No |
Sunny | 69 | 70 | Weak | Yes |
Rain | 75 | 80 | Weak | Yes |
Sunny | 75 | 70 | Strong | Yes |
Overcast | 72 | 90 | Strong | Yes |
Overcast | 81 | 75 | Weak | Yes |
Rain | 71 | 91 | Strong | No |
As a recap, we want to calculate all the individual probabilities
P(ai | Vj)
, so that we can calculateP(Vj | ai..aN)
Implementation
NaiveBayes.prototype.train = function() {
var frequencies = {};
var probabilities = {};
var labelIndex = this.labelIndex;
var columns = this.columns;
var columnTypes = this.columnTypes;
var data = this.data;
var labelValues = _.map(data, function(sample) {
return sample[labelIndex];
});
var labelKey = columns[labelIndex];
// Calculate class frequencies/probabilities
frequencies[labelKey] = _.countBy(data, function(sample) {
return sample[labelIndex];
});
probabilities[labelKey] = _.mapObject(frequencies[labelKey], function(v, k) {
return v / data.length;
});
// Calculate conditional probability for each non-numeric column value and class pair
_.each(_.uniq(labelValues), function(labelValue) {
_.each(_.without(columns, columns[labelIndex]), function(column, columnIndex) {
var isNumericColumn = (columnTypes[columnIndex] === 'number');
if (isNumericColumn) {
return;
};
frequencies[column] = frequencies[column] || {};
probabilities[column] = probabilities[column] || {};
var columnValues = _.uniq(_.pluck(data, columnIndex));
_.each(columnValues, function(columnValue) {
frequencies[column][columnValue] = frequencies[column][columnValue] || {};
probabilities[column][columnValue] = probabilities[column][columnValue] || {};
var count = _.size(_.filter(data, function(sample) {
return _.isEqual(sample[columnIndex], columnValue) &&
_.isEqual(sample[labelIndex], labelValue);}));
frequencies[column][columnValue][labelValue] = count;
probabilities[column][columnValue][labelValue] = (count + 1) / (frequencies[labelKey][labelValue] + columnValues.length);
});
});
});
// Calculate frequencies, mean, standard deviation for numeric attributes
var numericColumns = [];
_.each(this.columnTypes, function(type, index) {
var isNumericColumn = (columnTypes[index] === 'number');
if (!isNumericColumn) {
return;
}
numericColumns.push({name: columns[index], index: index});
});
_.each(numericColumns, function(obj) {
var columnName = obj.name;
var columnIndex = obj.index;
frequencies[columnName] = {};
probabilities[columnName] = {};
// Froup columns values that by sample label
_.each(_.uniq(labelValues), function(labelValue) {
var samples = _.filter(data, function(sample) {return sample[labelIndex] === labelValue;});
frequencies[columnName][labelValue] = _.pluck(samples, columnIndex);
probabilities[columnName][labelValue] = {};
probabilities[columnName][labelValue].mean = math.mean(frequencies[columnName][labelValue]);
probabilities[columnName][labelValue].std = math.std(frequencies[columnName][labelValue]);
});
});
this.frequencies = frequencies;
this.probabilities = probabilities;
this.trainedAt = Date.now();
return true;
};
NaiveBayes.prototype.predict = function(sample) {
var blindColumnValues = _.without(this.columns, this.columns[this.labelIndex]);
var blindColumnTypes = this.columnTypes; blindColumnTypes.splice(this.labelIndex, 1);
var errors = validateSample(this.data,
blindColumnValues,
blindColumnTypes,
sample,
{allowNull: true});
if (!_.isEmpty(errors)) {
throw new Error('ValidationError: ' + errors.join());
};
if (this.eagerTraining && this.hasDirtySamples) {
this.train();
}
var answer = {};
var columnTypes = this.columnTypes;
var probabilities = this.probabilities;
var attributeColumns = _.without(this.columns, this.columns[this.labelIndex]);
_.each(attributeColumns, function(columnName, columnIndex) {
var columnValueProbabilities = probabilities[columnName][sample[columnIndex]];
// Calculate probabilities
var isNumericColumn = (columnTypes[columnIndex] === 'number');
if (isNumericColumn) {
var sampleValue = sample[columnIndex];
_.each(_.keys(probabilities[columnName]), function(labelValue) {
if (_.isNull(labelValue)) { // Skipped columns
answer[labelValue] = (answer[labelValue] * 1) || 1;
};
var obj = probabilities[columnName][labelValue];
var mean = obj.mean;
var std = obj.std;
var probability = numericProbability(sampleValue, mean, std);
answer[labelValue] = (answer[labelValue] * probability) || probability;
});
} else {
_.each(_.keys(columnValueProbabilities), function(labelValue) {
if (_.isNull(labelValue)) { // Skipped columns
answer[labelValue] = (answer[labelValue] * 1) || 1;
};
answer[labelValue] = (answer[labelValue] * columnValueProbabilities[labelValue]) || columnValueProbabilities[labelValue];
});
}
});
var keys = _.keys(answer);
var verboseAnswer = _.max(keys, function(k) { return answer[k];});
if (!this.verbose) {
return verboseAnswer;
}
answer.answer = verboseAnswer;
return answer;
};
The full implementation can be seen on Github: node-bayes
Handling numeric attributes
One consideration we have to take into account is how we can handle numerical attributes such as Temperature
and Humidity
in our dataset.
We calculate the probability of a yet unseen numerical value through a normal distribution, specifically a calculated mean & standard deviation of the sample values for the column.
// Calculate frequencies, mean, standard deviation for numeric attributes
var numericColumns = [];
_.each(this.columnTypes, function(type, index) {
var isNumericColumn = (columnTypes[index] === 'number');
if (!isNumericColumn) {
return;
}
numericColumns.push({name: columns[index], index: index});
});
_.each(numericColumns, function(obj) {
var columnName = obj.name;
var columnIndex = obj.index;
frequencies[columnName] = {};
probabilities[columnName] = {};
// Froup columns values that by sample label
_.each(_.uniq(labelValues), function(labelValue) {
var samples = _.filter(data, function(sample) {return sample[labelIndex] === labelValue;});
frequencies[columnName][labelValue] = _.pluck(samples, columnIndex);
probabilities[columnName][labelValue] = {};
probabilities[columnName][labelValue].mean = math.mean(frequencies[columnName][labelValue]);
probabilities[columnName][labelValue].std = math.std(frequencies[columnName][labelValue]);
});
});
Node.js module
I’ve written a Node.js module you can use to perform the above training and classification steps:
npm install node-bayes --save
// Require module
var bayes = require('node-bayes');
var TRAINING_COLUMNS = ['weather', 'temperature', 'humidity', 'wind', 'play?'];
var TRAINING_DATA = [
['Sunny',85,85,'Weak','No'],
['Sunny',80,90,'Strong','No'],
['Overcast',83,86,'Weak','Yes'],
['Rain',70,96,'Weak','Yes'],
['Rain',68,80,'Weak','Yes'],
['Rain',65,70,'Strong','No'],
['Overcast',64,65,'Strong','Yes'],
['Sunny',72,95,'Weak','No'],
['Sunny',69,70,'Weak','Yes'],
['Rain',75,80,'Weak','Yes'],
['Sunny',75,70,'Strong','Yes'],
['Overcast',72,90,'Strong','Yes'],
['Overcast',81,75,'Weak','Yes']
];
var cls = new bayes.NaiveBayes({
columns: TRAINING_COLUMNS,
data: TRAINING_DATA,
verbose: true
});
cls.train();
var answer = cls.predict(['Sunny', 66, 90, 'Strong']);
console.log(answer);
You can view the source on Github. Pull requests are welcome!
In Python
In the Python data analysis ecosystem, there are a breadth of libraries you can use that contains implementations of Naive Bayes, such as GaussianNB()
from sklearn
:
from sklearn import datasets
from sklearn.naive_bayes import GaussianNB
iris = datasets.load_iris()
gnb = GaussianNB()
y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)
print("Number of mislabeled points out of a total %d points : %d" % (iris.data.shape[0],(iris.target != y_pred).sum()))
In Closing
Naive Bayes is a simple but powerful supervised learning algorithm. Let me know what you think! Pull Requests to node-bayes
are welcome.
📬 Get updates straight to your inbox.
Subscribe to my newsletter so you don't miss new content.