Probabilistic Inference with Naive Bayes

Today, let’s take a look at Naive Bayes from the ground up, and work through a working implementation in Javascript.

For one of my side projects, I’m working on a feature that lets you perform simple data analysis on small datasets via the browser. One use case is the automatic assignment of categories to a set of items (i.e. supervised learning.)

Since users’ datasets are small (N<60), we want to keep our model as simple as possible so it won’t overfit. This usually means making simple assumptions about the distribution the data comes from.

There are a handful of machine learning classifiers we can use: Naive Bayes, k-Nearest Neighbour, and linear SVM. I also considered using Decision Trees and MLP which we can visualize on the browser using client-side visualization libraries.

Naive Bayes is a simple probabilistic classifier that works despite the independence assumption, and has historically been used in document classification.

Intro to Bayes Theorem

I’m going to assume you know some basic probabilities.

As a refresher, here are a few definitions:

  • Let A and B be events that can happen with some probability
  • P(A): probability that A happens
  • P(A|B): probability that A happens, given that B happens (conditional probability)

According to Bayes Theorem:

\[P(A|B) = \frac {P(B|A)P(A)} {P(B)}\]

That is, the probability P(A|B) of A happening given B is equal to the probability P(B|A) of B happening given A AND the probability P(A) of A happening OVER the probability P(B) of B happening.

To clarify this idea further, let’s work through an example:

A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Also, .008 of the entire population have this cancer.

For a new patient that lab test returns a positive result, should he be diagnosed as having cancer or not?

Based on the above, we have the following probabilities:

P(cancer) = 0.008			P(!cancer) = 0.992
P(+|cancer) = 0.98			P(-|cancer) = 0.02
P(+|!cancer) = 0.03			P(-|cancer) = 0.97

We can then calculate the conditional probabilities of patients having cancer:

P(cancer|+) = P(+|cancer)P(cancer) / P(+)
P(!cancer|+) = P(+|!cancer)P(!cancer) / P(+)

P(+|cancer)P(cancer) = 0.0078
P(+|!cancer)P(!cancer) = 0.0298
P(cancer|+) = 0.0078 / (0.0078 + 0.0298) = 0.21
P(!cancer|+) = 0.79

Since P(!cancer|+) > P(cancer|+), we conclude that if a new patient gets a positive test result, he has a higher likelihood of having cancer.

Bayesian learning algorithms calculate explicit probabilities for hypotheses, which typically requires initial knowledge of many probabilities. What this means is that you’ll need labeled data beforehand (supervised learning.)

Supervised Learning with Naive Bayes

The above example works for a single attribute: a test result. However, in real-world cases we’ll often need more attributes in order to improve the accuracy of our predictions. That’s where supervised learning comes in.

A Naive Bayes classifier is a simple classification method that is based on Bayes’ Theorem and the assumption of conditional independence.

That is:

P(A ^ B | Y) = P(A | Y) P(B | Y)

Now, let’s say our training data looks as follows:

a1, ..., aN, Vj

Where ai is an attribute at the ith position, and Vj is the jth class label in the dataset. According to our conditional independence assumption:

P(a1, ..., aN|Vj) = P(a1|Vj) P(a2|Vj) ... P(aN|Vj)

According to Bayes’ Theorem, we’ll then be able to find:

P(Vj|a1, ..., aN)

which is the probability that a given sample of attributes a1, ..., AN belongs to a class Vj. This is the end result of our classification!

Notice that Naive Bayes is an eager learner: it calculates all the probabilities beforehand in a preprocessing step, making classification time close to zero. This is in contrast to lazy learners such as k-nearest neighbours with less training time and more classification time.

Naive Bayes Learner Implementation

Given the following training data, let’s build a Naive Bayes classifier to answer the following supervised learning question: is today a good day to Play outside?

Outlook Temperature Humidity Wind Play?
Sunny 85 85 Weak No
Sunny 80 90 Strong No
Overcast 83 86 Weak Yes
Rain 70 96 Weak Yes
Rain 68 80 Weak Yes
Rain 65 70 Strong No
Overcast 64 65 Strong Yes
Sunny 72 95 Weak No
Sunny 69 70 Weak Yes
Rain 75 80 Weak Yes
Sunny 75 70 Strong Yes
Overcast 72 90 Strong Yes
Overcast 81 75 Weak Yes
Rain 71 91 Strong No

As a recap, we want to calculate all the individual probabilities P(ai | Vj), so that we can calculate P(Vj | ai..aN)

Implementation

NaiveBayes.prototype.train = function() {
  var frequencies = {};
  var probabilities = {};
  var labelIndex = this.labelIndex;
  var columns = this.columns;
  var columnTypes = this.columnTypes;
  var data = this.data;
  var labelValues = _.map(data, function(sample) {
    return sample[labelIndex];
  });
  var labelKey = columns[labelIndex];

  // Calculate class frequencies/probabilities
  frequencies[labelKey] = _.countBy(data, function(sample) {
    return sample[labelIndex];
  });
  probabilities[labelKey] = _.mapObject(frequencies[labelKey], function(v, k) {
    return v / data.length;
  });

  // Calculate conditional probability for each non-numeric column value and class pair
  _.each(_.uniq(labelValues), function(labelValue) {
    _.each(_.without(columns, columns[labelIndex]), function(column, columnIndex) {
      var isNumericColumn = (columnTypes[columnIndex] === 'number');
      if (isNumericColumn) {
        return;
      };

      frequencies[column] = frequencies[column] || {};
      probabilities[column] = probabilities[column] || {};

      var columnValues =  _.uniq(_.pluck(data, columnIndex));
      _.each(columnValues, function(columnValue) {
        frequencies[column][columnValue] = frequencies[column][columnValue] || {};
        probabilities[column][columnValue] = probabilities[column][columnValue] || {};
        var count = _.size(_.filter(data, function(sample) {
              return _.isEqual(sample[columnIndex], columnValue) &&
            _.isEqual(sample[labelIndex], labelValue);}));
        frequencies[column][columnValue][labelValue] = count;
        probabilities[column][columnValue][labelValue] = (count + 1) / (frequencies[labelKey][labelValue] + columnValues.length);
      });
    });
  });

  // Calculate frequencies, mean, standard deviation for numeric attributes
  var numericColumns = [];
  _.each(this.columnTypes, function(type, index) {
    var isNumericColumn = (columnTypes[index] === 'number');
    if (!isNumericColumn) {
      return;
    }
    numericColumns.push({name: columns[index], index: index});
  });
  _.each(numericColumns, function(obj) {
    var columnName = obj.name;
    var columnIndex = obj.index;
    frequencies[columnName] = {};
    probabilities[columnName] = {};

    // Froup columns values that by sample label
    _.each(_.uniq(labelValues), function(labelValue) {
      var samples = _.filter(data, function(sample) {return sample[labelIndex] === labelValue;});
      frequencies[columnName][labelValue] = _.pluck(samples, columnIndex);

      probabilities[columnName][labelValue] = {};
      probabilities[columnName][labelValue].mean = math.mean(frequencies[columnName][labelValue]);
      probabilities[columnName][labelValue].std = math.std(frequencies[columnName][labelValue]);
    });
  });

  this.frequencies = frequencies;
  this.probabilities = probabilities;
  this.trainedAt = Date.now();

  return true;
};
NaiveBayes.prototype.predict = function(sample) {
  var blindColumnValues = _.without(this.columns, this.columns[this.labelIndex]);
  var blindColumnTypes = this.columnTypes; blindColumnTypes.splice(this.labelIndex, 1);
  var errors = validateSample(this.data,
                              blindColumnValues,
                              blindColumnTypes,
                              sample,
                              {allowNull: true});
  if (!_.isEmpty(errors)) {
    throw new Error('ValidationError: ' + errors.join());
  };
  if (this.eagerTraining && this.hasDirtySamples) {
    this.train();
  }

  var answer = {};
  var columnTypes = this.columnTypes;
  var probabilities = this.probabilities;
  var attributeColumns = _.without(this.columns, this.columns[this.labelIndex]);
  _.each(attributeColumns, function(columnName, columnIndex) {
    var columnValueProbabilities = probabilities[columnName][sample[columnIndex]];

    // Calculate probabilities
    var isNumericColumn = (columnTypes[columnIndex] === 'number');
    if (isNumericColumn) {
      var sampleValue = sample[columnIndex];
      _.each(_.keys(probabilities[columnName]), function(labelValue) {
        if (_.isNull(labelValue)) { // Skipped columns
          answer[labelValue] = (answer[labelValue] * 1) || 1;
        };

        var obj = probabilities[columnName][labelValue];
        var mean = obj.mean;
        var std = obj.std;

        var probability = numericProbability(sampleValue, mean, std);
        answer[labelValue] = (answer[labelValue] * probability) || probability;
      });
    } else {
      _.each(_.keys(columnValueProbabilities), function(labelValue) {
        if (_.isNull(labelValue)) { // Skipped columns
          answer[labelValue] = (answer[labelValue] * 1) || 1;
        };

        answer[labelValue] = (answer[labelValue] * columnValueProbabilities[labelValue]) || columnValueProbabilities[labelValue];
      });
    }
  });

  var keys = _.keys(answer);
  var verboseAnswer = _.max(keys, function(k) { return answer[k];});

  if (!this.verbose) {
    return verboseAnswer;
  }

  answer.answer = verboseAnswer;
  return answer;
};

The full implementation can be seen on Github: node-bayes

Handling numeric attributes

One consideration we have to take into account is how we can handle numerical attributes such as Temperature and Humidity in our dataset.

We calculate the probability of a yet unseen numerical value through a normal distribution, specifically a calculated mean & standard deviation of the sample values for the column.

  // Calculate frequencies, mean, standard deviation for numeric attributes
  var numericColumns = [];
  _.each(this.columnTypes, function(type, index) {
    var isNumericColumn = (columnTypes[index] === 'number');
    if (!isNumericColumn) {
      return;
    }
    numericColumns.push({name: columns[index], index: index});
  });
  _.each(numericColumns, function(obj) {
    var columnName = obj.name;
    var columnIndex = obj.index;
    frequencies[columnName] = {};
    probabilities[columnName] = {};

    // Froup columns values that by sample label
    _.each(_.uniq(labelValues), function(labelValue) {
      var samples = _.filter(data, function(sample) {return sample[labelIndex] === labelValue;});
      frequencies[columnName][labelValue] = _.pluck(samples, columnIndex);

      probabilities[columnName][labelValue] = {};
      probabilities[columnName][labelValue].mean = math.mean(frequencies[columnName][labelValue]);
      probabilities[columnName][labelValue].std = math.std(frequencies[columnName][labelValue]);
    });
  });

Node.js module

I’ve written a Node.js module you can use to perform the above training and classification steps:

npm install node-bayes --save
// Require module
var bayes = require('node-bayes');

var TRAINING_COLUMNS = ['weather', 'temperature', 'humidity', 'wind', 'play?'];
var TRAINING_DATA = [
    ['Sunny',85,85,'Weak','No'],
    ['Sunny',80,90,'Strong','No'],
    ['Overcast',83,86,'Weak','Yes'],
    ['Rain',70,96,'Weak','Yes'],
    ['Rain',68,80,'Weak','Yes'],
    ['Rain',65,70,'Strong','No'],
    ['Overcast',64,65,'Strong','Yes'],
    ['Sunny',72,95,'Weak','No'],
    ['Sunny',69,70,'Weak','Yes'],
    ['Rain',75,80,'Weak','Yes'],
    ['Sunny',75,70,'Strong','Yes'],
    ['Overcast',72,90,'Strong','Yes'],
    ['Overcast',81,75,'Weak','Yes']
];

var cls = new bayes.NaiveBayes({
  columns: TRAINING_COLUMNS,
  data: TRAINING_DATA,
  verbose: true
});
cls.train();
var answer = cls.predict(['Sunny', 66, 90, 'Strong']);
console.log(answer);

You can view the source on Github. Pull requests are welcome!

In Python

In the Python data analysis ecosystem, there are a breadth of libraries you can use that contains implementations of Naive Bayes, such as GaussianNB() from sklearn:

from sklearn import datasets
from sklearn.naive_bayes import GaussianNB

iris = datasets.load_iris()
gnb = GaussianNB()
y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)

print("Number of mislabeled points out of a total %d points : %d" % (iris.data.shape[0],(iris.target != y_pred).sum()))

In Closing

Naive Bayes is a simple but powerful supervised learning algorithm. Let me know what you think! Pull Requests to node-bayes are welcome.