Understanding Bayes’ Theorem and Naive Bayes in Python: A Practical Guide from Theory to Spam Classifier

If you’ve ever tried to explain Bayes’ theorem to someone at a party, you know the look you get. That glazed-over, polite nod that says “I’ll remember your name but not a single word you just said.” And yet, Bayes’ theorem is one of those quietly powerful tools in the machine learning toolbox that’s simultaneously simple, annoyingly subtle, and sneakily everywhere.

To warm you up, let’s start with the simple example. Imagine you go to the doctor for a routine check-up, and they run a screening test for a rare condition. Only about 1% of people have it. The test is quite accurate: if you have the condition, it returns positive 95% of the time. But even healthy people can sometimes test positive — about 5% of the time.

Now, suppose your test comes back positive. Intuitively, you might think there’s a 95% chance you have the condition, but Bayes’ theorem tells a different story. Because the condition is so rare, the probability that you actually have it given a positive test is only about 16%. This shows the essence of Bayes’ theorem: it updates your initial belief (the rare condition) based on new evidence (the positive test), giving a realistic picture rather than just relying on the test’s accuracy alone.

This is basically the same question your spam filter asks when it sees an email with the word “lottery” in it. Or your medical AI asks when it sees a weird anomaly in an MRI scan. Bayes’ theorem is the little math engine that turns “I know some things about the world” into “here’s how likely it is this thing is actually happening.”

So, What Is Bayes’ Theorem, Really?

Mathematically, Bayes’ theorem says:

\(P(H∣E) = \frac{P(E | H) \cdot P(H)}{P(E)}\)

Where:

\(P(H∣E)\) is the probability of your hypothesis \(H\) given the evidence \(E\). This is what you actually want to know.
\(P(E∣H)\) is the probability of seeing that evidence if your hypothesis is true.
\(P(H)\) is your prior belief about the hypothesis before you saw the evidence.
\(P(E)\) is the probability of the evidence happening at all.

If you’re allergic to formulas, here’s the short translation: Bayes’ theorem is a recipe for updating your beliefs when new evidence shows up. You start with a prior belief (the prior), you look at how likely that evidence is if your belief were true (likelihood), and then you adjust based on how common the evidence is in general (normalization).

A Practical Example

Imagine you go to the doctor for a routine check-up, and they run a screening test for a certain condition. The condition is relatively rare — about 1% of people have it. So your prior probability is:

\(P(H) = 0.01 \)

Now, suppose the test is pretty accurate: if someone does have the condition, it correctly returns positive 95% of the time:

\(P(E \mid H) = 0.95P \)

But the test isn’t perfect. Even if someone is healthy, it can still give a false positive 5% of the time. That means the overall probability of seeing a positive result, \(P(E)\), includes both true positives and false positives. Let’s calculate it.

\(P(E) = P(E \mid H) \cdot P(H) + P(E \mid – H) \cdot P(- H) \\
= (0.95 \cdot 0.01) + (0.05 \cdot 0.99) \\
= 0.0095+0.0495 = 0.059 \)

Now, if you test positive, Bayes tells us the probability you actually have the condition is:

\(P(H \mid E) = \frac{P(E \mid H) \cdot P(H)}{P(E)} = \frac{0.95 \cdot 0.01}{0.059} ≈ 0.16\)

So even after a positive result, the chance you truly have the condition is only about 16%.

This is Bayes’ theorem in action: your prior (1%) gets updated by the evidence (a positive test). The result isn’t 95%, because false positives still happen fairly often relative to how rare the condition is. Instead, your updated belief — the posterior — is a much more realistic 16%.

The Naive Bayes Twist

The “naive” part of Naive Bayes is that it assumes all features are independent given the class. In reality, features often hang out in cliques—words in spam emails are not independent; “Nigerian” and “prince” tend to co-occur suspiciously. But this assumption makes the math deliciously easy and surprisingly effective.

The classifier basically multiplies together a bunch of likelihoods and priors to get posterior probabilities for each class, then picks the highest one.

Let’s Play with Tiny Data

We’ll start with a hand-crafted example because nothing teaches you like making the math hurt just enough to remember it.

Imagine a dataset of weather conditions and whether people decide to go for a picnic:

Weather	Windy	Picnic
Sunny	No	Yes
Sunny	Yes	Yes
Rainy	No	No
Sunny	No	Yes
Rainy	Yes	No
Rainy	No	No

We want to predict: If it’s sunny and windy, will people go for a picnic?

Step 1: Calculate priors.

\(P(\text{Picnic} = \text{Yes}) = 3/6 = 0.5.\)
\(P(\text{Picnic} = \text{No}) = 3/6 = 0.5.\)

Step 2: Likelihoods.

\(P(\text{Sunny} | \text{Yes}) = 3/3 = 1.0.\)
\(P(\text{Windy} | \text{Yes}) = 1/3 ≈ 0.333.\)
\(P(\text{Sunny} | \text{No}) = 0/3 = 0.0.\)
\(P(\text{Windy} | \text{No}) = 1/3 ≈ 0.333.\)

Step 3: Apply Bayes.

For “Yes”:

\(P(\text{Yes} | \text{Sunny, Windy}) \propto 1.0 \cdot 0.333 \cdot 0.5 =0.1665 \)

For “No”:

\(P(\text{No} | \text{Sunny, Windy}) \propto 0.0 \cdot 0.333 \cdot 0.5 = 0 \)

Normalize: Yes wins, because No is literally zero. Prediction: Yes, people will picnic.

This example is absurdly simple, but it’s exactly the same reasoning your email filter or medical AI is doing—just at massive scale and with features that don’t have human-readable names.

Before we jump into the code, here’s a little side note: if you’re into machine learning and Python tricks, you might enjoy what we share over at @machinelearningsite. Alright, back to business—time to build our Naive Bayes classifier in Python.

Now A Practical Python Example

Let’s implement a Naive Bayes classifier in Python from scratch first, so you can see the moving parts.

from collections import defaultdict
import math

class NaiveBayes:
    def __init__(self):
        self.feature_counts = {}
        self.class_counts = defaultdict(int)
        self.vocab = set()
        print("Initialized NaiveBayes classifier")

    def train(self, X, y):
        print("\n--- Training started ---")
        self.feature_counts = {c: defaultdict(int) for c in set(y)}
        for features, label in zip(X, y):
            print(f"Training on sample {features} with label {label}")
            self.class_counts[label] += 1
            for feature in features:
                self.feature_counts[label][feature] += 1
                self.vocab.add(feature)
                print(f"Incremented count: class={label}, feature={feature}, "
                      f"count={self.feature_counts[label][feature]}")
        print("Class counts:", dict(self.class_counts))
        print("Vocabulary:", self.vocab)
        print("--- Training finished ---\n")

    def predict(self, features):
        print(f"\n--- Prediction for {features} ---")
        total_count = sum(self.class_counts.values())
        scores = {}
        for c in self.class_counts:
            prior = math.log(self.class_counts[c] / total_count)
            print(f"\nClass: {c}")
            print(f" Prior (log): {prior}")
            likelihood = 0
            for feature in self.vocab:
                count = self.feature_counts[c][feature] + 1  # Laplace smoothing
                total = sum(self.feature_counts[c].values()) + len(self.vocab)
                if feature in features:
                    contrib = math.log(count / total)
                    print(f"  Feature {feature} present → contrib {contrib}")
                    likelihood += contrib
                else:
                    contrib = math.log(1 - (count / total))
                    print(f"  Feature {feature} absent → contrib {contrib}")
                    likelihood += contrib
            scores[c] = prior + likelihood
            print(f" Total log-probability for class {c}: {scores[c]}")
        prediction = max(scores, key=scores.get)
        print(f"Predicted class: {prediction}")
        print("--- Prediction finished ---\n")
        return prediction


# Training data: weather + windy status
X = [
    ("Sunny", "NotWindy"),
    ("Sunny", "Windy"),
    ("Rainy", "NotWindy"),
    ("Sunny", "NotWindy"),
    ("Rainy", "Windy"),
    ("Rainy", "NotWindy"),
]
y = ["Yes", "Yes", "No", "Yes", "No", "No"]

nb = NaiveBayes()
nb.train(X, y)
print("Final prediction:", nb.predict(("Sunny", "Windy")))  # Expected 'Yes'

from collections import defaultdict
import math

class NaiveBayes:
    def __init__(self):
        self.feature_counts = {}
        self.class_counts = defaultdict(int)
        self.vocab = set()
        print("Initialized NaiveBayes classifier")

    def train(self, X, y):
        print("\n--- Training started ---")
        self.feature_counts = {c: defaultdict(int) for c in set(y)}
        for features, label in zip(X, y):
            print(f"Training on sample {features} with label {label}")
            self.class_counts[label] += 1
            for feature in features:
                self.feature_counts[label][feature] += 1
                self.vocab.add(feature)
                print(f"Incremented count: class={label}, feature={feature}, "
                      f"count={self.feature_counts[label][feature]}")
        print("Class counts:", dict(self.class_counts))
        print("Vocabulary:", self.vocab)
        print("--- Training finished ---\n")

    def predict(self, features):
        print(f"\n--- Prediction for {features} ---")
        total_count = sum(self.class_counts.values())
        scores = {}
        for c in self.class_counts:
            prior = math.log(self.class_counts[c] / total_count)
            print(f"\nClass: {c}")
            print(f" Prior (log): {prior}")
            likelihood = 0
            for feature in self.vocab:
                count = self.feature_counts[c][feature] + 1  # Laplace smoothing
                total = sum(self.feature_counts[c].values()) + len(self.vocab)
                if feature in features:
                    contrib = math.log(count / total)
                    print(f"  Feature {feature} present → contrib {contrib}")
                    likelihood += contrib
                else:
                    contrib = math.log(1 - (count / total))
                    print(f"  Feature {feature} absent → contrib {contrib}")
                    likelihood += contrib
            scores[c] = prior + likelihood
            print(f" Total log-probability for class {c}: {scores[c]}")
        prediction = max(scores, key=scores.get)
        print(f"Predicted class: {prediction}")
        print("--- Prediction finished ---\n")
        return prediction


# Training data: weather + windy status
X = [
    ("Sunny", "NotWindy"),
    ("Sunny", "Windy"),
    ("Rainy", "NotWindy"),
    ("Sunny", "NotWindy"),
    ("Rainy", "Windy"),
    ("Rainy", "NotWindy"),
]
y = ["Yes", "Yes", "No", "Yes", "No", "No"]

nb = NaiveBayes()
nb.train(X, y)
print("Final prediction:", nb.predict(("Sunny", "Windy")))  # Expected 'Yes'

When the classifier is first created, it starts with an empty structure: no class counts, no feature counts, and an empty vocabulary. During training, each sample is processed in turn. For every label, the class counter is updated, and for every feature within that label, the feature count is incremented. Over time, this builds a picture of how frequently each feature appears under each class. For example, if “Sunny” is often paired with the class “Yes,” then the model captures that association directly in its counts. The vocabulary grows as well, ensuring the model is aware of all possible features it may encounter.

Once training is complete, prediction begins with the computation of priors. These are the baseline probabilities of each class based on how often they appeared in the training data. The model then examines the given features against its vocabulary. If a feature is present, it calculates how likely it is to see that feature under the current class; if the feature is absent, it factors in the likelihood of not observing it. To avoid the problem of unseen features having zero probability, Laplace smoothing is applied by adding one to every count. All these likelihoods are accumulated in logarithmic form, which prevents numerical underflow and makes the computation more stable.

The final step is to combine the prior with the accumulated likelihood for each class, producing a score that represents the log-posterior probability. The class with the highest score is selected as the prediction. In the given example, the combination of “Sunny” and “Windy” aligns more strongly with the “Yes” class, so the model confidently returns “Yes” as the result.

Understanding the Example in the context of Bayes Theorem

In the Naive Bayes classifier we just built, every step corresponds directly to Bayes’ theorem. The class label is our hypothesis \(H\), and the features are the evidence \(E\). When the model starts prediction, the first thing it calculates is the prior \(P(H)\), which comes from the frequency of each class in the training data. This tells us how likely each class is before we even look at the features.

Next comes the likelihood \(P(E \mid H)\). Because we assume that features are conditionally independent given the class (the “naive” assumption), we can compute the probability of all features by multiplying the probabilities of each individual feature. If a feature is present, we use its probability under the class; if it’s absent, we use the probability of not seeing it. Laplace smoothing ensures that no feature ever has zero probability, even if it was missing in the training examples.

The denominator \(P(E)\) — the overall probability of seeing the evidence — never has to be calculated directly, because it is the same for all classes. In practice, the classifier only needs to compare the numerators \(P(E \mid H) \cdot P(H)\) across classes. The class with the largest value gives us the posterior \(P(H \mid E)\), which is the updated belief after considering the evidence.

So, when the model predicts “Yes” for the features “Sunny” and “Windy,” it is literally applying Bayes’ theorem. It starts with the prior probability of “Yes,” weighs it by how likely “Sunny” and “Windy” are under “Yes,” and then compares the result against the corresponding calculation for “No.” The highest score becomes the prediction.

How Machine Learning Uses This

In machine learning, especially in classification problems, you’re constantly playing this game: You have prior beliefs about which class something belongs to, you see evidence (features), and you want to update those beliefs. Naive Bayes classifiers literally build an entire career around this.

Imagine your spam filter. Before it reads your email, maybe it thinks 50% of messages are spam (that’s \(P(H)\)). It reads “lottery” in the subject line (evidence EEE), and based on its training, it knows:

\(P(\text{“lottery”} | \text{ spam})\) is high, like 0.8.
\(P(\text{“lottery”} | \text{ not spam})\) is low, like 0.01.

Bayes’ theorem lets it instantly compute how much more spammy that email now looks.

Spam Classifier with Naive Bayes

In this example, we’ll build a small spam filter using Python. We’ll load a dataset of messages, convert the text into numerical features, train a Naive Bayes classifier, and evaluate its performance. Through this, you’ll see how Naive Bayes uses probabilities to make predictions, all happening under the hood while you focus on the data and workflow.

# Step 1: Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Step 2: Load dataset
# Example dataset
data = {
    'message': [
        "Win a free iPhone now",
        "Meeting scheduled at 10am",
        "Limited offer just for you",
        "Project update attached",
        "Congratulations, you won!",
        "Lunch plans today?"
    ],
    'label': ["Spam", "Not Spam", "Spam", "Not Spam", "Spam", "Not Spam"]
}
df = pd.DataFrame(data)

# Step 3: Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    df['message'], df['label'], test_size=0.33, random_state=42
)

# Step 4: Convert text messages to numerical features
vectorizer = CountVectorizer()  # Bag-of-words representation
X_train_vect = vectorizer.fit_transform(X_train)
X_test_vect = vectorizer.transform(X_test)

# Step 5: Train Naive Bayes classifier
nb_model = MultinomialNB()
nb_model.fit(X_train_vect, y_train)

# Step 6: Make predictions
y_pred = nb_model.predict(X_test_vect)

# Step 7: Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Step 1: Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Step 2: Load dataset
# Example dataset
data = {
    'message': [
        "Win a free iPhone now",
        "Meeting scheduled at 10am",
        "Limited offer just for you",
        "Project update attached",
        "Congratulations, you won!",
        "Lunch plans today?"
    ],
    'label': ["Spam", "Not Spam", "Spam", "Not Spam", "Spam", "Not Spam"]
}
df = pd.DataFrame(data)

# Step 3: Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    df['message'], df['label'], test_size=0.33, random_state=42
)

# Step 4: Convert text messages to numerical features
vectorizer = CountVectorizer()  # Bag-of-words representation
X_train_vect = vectorizer.fit_transform(X_train)
X_test_vect = vectorizer.transform(X_test)

# Step 5: Train Naive Bayes classifier
nb_model = MultinomialNB()
nb_model.fit(X_train_vect, y_train)

# Step 6: Make predictions
y_pred = nb_model.predict(X_test_vect)

# Step 7: Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Where Naive Bayes is Working

Training (nb_model.fit):
Naive Bayes calculates priors P(Spam)P(\text{Spam})P(Spam) and P(Not Spam)P(\text{Not Spam})P(Not Spam) based on how many examples of each class appear in the training data. It also calculates likelihoods P(word∣class)P(\text{word} \mid \text{class})P(word∣class) for every word in the vocabulary. For example, it counts how often "free" appears in spam versus non-spam messages. Laplace smoothing ensures that words that never appear in one class don’t break the calculation.
Prediction (nb_model.predict):
For a new message, Naive Bayes uses Bayes’ theorem internally to compute the posterior probability of each class: \(P(\text{class} \mid \text{message}) \propto P(\text{class}) \cdot \prod_{i} P(\text{word}_i \mid \text{class})\).
The class with the highest posterior probability is chosen as the prediction.
Feature Representation:
The CountVectorizer converts text into numerical vectors of word counts. Each word is treated as a feature, and Naive Bayes assumes these features are conditionally independent given the class. These counts feed directly into the likelihood computations.

Summary

In this blog, we explored the fundamentals of Bayes’ Theorem and how it helps us update our beliefs when new evidence appears. We started with a simple, intuitive example to see the theorem in action and understand the concept of prior, likelihood, and posterior probabilities in a tangible way.

We then moved to a hands-on Python implementation, building a Naive Bayes classifier from scratch. This allowed us to see how Bayes’ Theorem underlies real-world machine learning models, turning theoretical probabilities into actionable predictions.

Finally, we connected these ideas to practical machine learning tasks by implementing a spam classifier in Python. This example demonstrated how Naive Bayes is used for text classification, showing the flow from raw data to feature extraction, training, prediction, and evaluation. Through this journey, we saw how a probabilistic model like Naive Bayes can solve real problems efficiently, despite its simplicity.

By the end of this blog, you should have a clear understanding of Bayes’ Theorem, how Naive Bayes works, and how it fits into practical machine learning pipelines, making tasks like spam detection both approachable and effective.

If you found this probabilistic modeling interesting, you might want to explore how we measure a model’s performance in practice. The next guide on the Confusion Matrix, Precision, and Recall dives into evaluating classifiers, giving you the tools to assess models like Naive Bayes with confidence.

If you’re curious about more machine learning experiments, Python tips, and occasional memes, you might enjoy looking into @machinelearningsite.