Who hasn’t wondered what those movie reviews on IMDb are really saying beneath the surface? I mean, “this film was okay” could mean anything from “I’d watch it again tomorrow” to “I’d rather staple my eyelids shut.” Today, we’re going to build a sentiment analysis system that cuts through the ambiguity like a hot knife through butter—using BERT and TensorFlow. Stick with me, and by the end of this post, you’ll have a model that can sniff out sarcasm better than your ex.

Why BERT? Because Language is Complicated (and Fun)

Before we dive into the code, let’s address the elephant in the room: why use BERT for sentiment analysis when simpler models exist? Well, because language is full of delightful contradictions. Take this gem: “This restaurant’s food was so bad it’s almost good.” A basic model would call that negative. BERT? It understands the almost is doing heavy lifting here. BERT (Bidirectional Encoder Representations from Transformers) processes text bidirectionally, capturing context from both sides of each word. It’s like having a conversation where you actually listen before responding—rare in politics, but essential in NLP.

Setting Up Our Digital Workshop

Let’s get our hands dirty. First, the boring but necessary setup:

# Install the essentials - think of this as stocking your kitchen before cooking
!pip install tensorflow-text transformers datasets tensorflow-hub
!pip install -q tf-models-official
import os
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from official.nlp import optimization
import matplotlib.pyplot as plt

Feeling that excitement yet? It’s like assembling your superhero toolkit before battle. Only our battle is against ambiguous sentiment, and our weapons are… well, more code.

The Data Dilemma: Finding Our Sentiment Playground

For this adventure, we’ll use the classic IMDb movie reviews dataset—75,000 brutally honest opinions about films. Some are eloquent, some are passionate rants, all perfect for training our sentiment sniffer.

# Load the IMDb dataset - 50k training, 25k testing
dataset = tf.keras.utils.get_file(
    fname="aclImdb.tar.gz", 
    origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz", 
    extract=True,
    cache_dir='.',
    cache_subdir=''
)
# Process the dataset into manageable chunks
def get_sentiment_data(subset):
    texts, labels = [], []
    for label in [0, 1]:  # 0 = negative, 1 = positive
        label_dir = os.path.join(os.path.dirname(dataset), f'aclImdb/{subset}/{label}')
        for fname in os.listdir(label_dir):
            if fname.endswith('.txt'):
                with open(os.path.join(label_dir, fname), encoding='utf-8') as f:
                    texts.append(f.read())
                labels.append(label)
    return np.array(texts), np.array(labels)
train_texts, train_labels = get_sentiment_data('train')
test_texts, test_labels = get_sentiment_data('test')
# Shuffle the training data because randomness is the spice of (model) life
indices = np.arange(len(train_texts))
np.random.shuffle(indices)
train_texts = train_texts[indices]
train_labels = train_labels[indices]

Now we have our raw materials. Remember, garbage in = garbage out. But since IMDb reviews are famously passionate (and occasionally unhinged), we’re in good hands.

Choosing Our BERT: Not All Heroes Wear Capes

TensorFlow Hub offers several BERT variants. For sentiment analysis, we want something balanced between performance and efficiency. I’m partial to small_bert/bert_en_uncased_L-4_H-512_A-8—it’s like the compact sports car of BERT models: not the biggest, but zippy enough for our needs. This architecture diagram shows how BERT fits into our sentiment analysis pipeline:

graph LR A[Raw Text] --> B[Preprocessing] B --> C[BERT Encoder] C --> D[Pooling Layer] D --> E[Classification Head] E --> F[Sentiment Score] style A fill:#f9f,stroke:#333,stroke-width:2px style F fill:#bbf,stroke:#333,stroke-width:2px

The Magic of Preprocessing: Making Text BERT-Friendly

BERT doesn’t eat raw text for breakfast—it needs carefully prepared token sandwiches. This is where many tutorials glaze over the details, leaving you scratching your head when your model spits out nonsense. Let’s do it right.

# Load the BERT preprocessor saved model from TensorFlow Hub
preprocessor = hub.load("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
# Create a function to preprocess our text
def preprocess_texts(texts):
    return preprocessor(tf.constant(texts))
# Test it on a sample
sample_text = ['This movie was absolutely fantastic!']
preprocessed = preprocess_texts(sample_text)
print("Input Word IDs shape:", preprocessed['input_word_ids'].shape)
print("Input Mask shape:", preprocessed['input_mask'].shape)
print("Input Type IDs shape:", preprocessed['input_type_ids'].shape)

You’ll see output like:

Input Word IDs shape: (1, 128)
Input Mask shape: (1, 128)
Input Type IDs shape: (1, 128)

Ah, the beautiful sight of properly shaped tensors! Let’s talk about what these mean:

  • input_word_ids: The tokenized text converted to numbers (from BERT’s vocabulary)
  • input_mask: 1s where we have real tokens, 0s for padding (BERT ignores these)
  • input_type_ids: For distinguishing between two sentences (0 for first, 1 for second) BERT expects exactly this trio of inputs—like a carefully prepared three-course meal.

Building Our Model: The Heart of the Operation

Now for the grand finale: assembling our sentiment analysis machine. This is where BERT meets our humble classification head.

# Load the actual BERT model
bert_model = hub.load("https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1")
# Build our model architecture - simple but powerful
def build_sentiment_model():
    text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
    # First, preprocess the text using BERT's preprocessor
    preprocessed_text = preprocessor(text_input)
    # Then send it through BERT
    outputs = bert_model(preprocessed_text)
    # The "pooled_output" is the representation of the entire sequence
    # We'll use this for classification
    pooled_output = outputs["pooled_output"]
    # Add a dropout layer to prevent overfitting (BERT's kryptonite)
    dropout = tf.keras.layers.Dropout(0.1)(pooled_output)
    # Final classification layer - simple but effective
    outputs = tf.keras.layers.Dense(1, activation='sigmoid')(dropout)
    # Create and return the model
    model = tf.keras.Model(inputs=text_input, outputs=outputs)
    return model
sentiment_model = build_sentiment_model()
# Compile with a learning rate that won't make BERT have an existential crisis
loss = tf.keras.losses.BinaryCrossentropy()
metrics = tf.metrics.BinaryAccuracy()
# Configure optimizer with learning rate schedule
steps_per_epoch = len(train_labels) // 32
num_train_steps = steps_per_epoch * 3
num_warmup_steps = int(0.1 * num_train_steps)
init_lr = 3e-5
optimizer = optimization.create_optimizer(
    init_lr=init_lr,
    num_train_steps=num_train_steps,
    num_warmup_steps=num_warmup_steps,
    optimizer_type='adamw'
)
sentiment_model.compile(
    optimizer=optimizer,
    loss=loss,
    metrics=metrics
)
# Let's see what we've built
sentiment_model.summary()

There’s something deeply satisfying about seeing that model summary. It’s like peeking under the hood of a finely tuned engine. Notice how most of the parameters belong to BERT itself—that’s why it’s called fine-tuning, not re-training.

Training: Where the Rubber Meets the Road

Time to train our model. This is where patience pays off. Grab a coffee. Or three.

# Create a TensorBoard callback for monitoring progress
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="./logs")
# Train the model - don't blink, or you might miss the magic
print("Starting training...")
history = sentiment_model.fit(
    x=train_texts,
    y=train_labels,
    validation_split=0.2,
    batch_size=32,
    epochs=3,
    callbacks=[tensorboard_callback]
)
# Plot the training results because pretty pictures make everything better
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Loss Over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(history.history['binary_accuracy'], label='Training Accuracy')
plt.plot(history.history['val_binary_accuracy'], label='Validation Accuracy')
plt.title('Accuracy Over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

While it trains, let me share a secret: fine-tuning BERT is like teaching a world-class sommelier to identify cheap wine. The model already understands language deeply—it just needs to adapt its expertise to our specific task. Hence why we don’t need dozens of epochs; BERT catches on quickly.

Testing Our Creation: Does It Actually Work?

Let’s verify our model isn’t just memorizing IMDb reviews. Time for the moment of truth.

# Evaluate on the test set
test_results = sentiment_model.evaluate(test_texts, test_labels, verbose=1)
print(f"Test Loss: {test_results:.4f}")
print(f"Test Accuracy: {test_results:.4f}")
# Let's probe deeper with a confusion matrix
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
# Get predictions
probabilities = sentiment_model.predict(test_texts)
predictions = (probabilities > 0.5).astype(int).flatten()
# Print detailed classification report
print("\nClassification Report:")
print(classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
# Plot confusion matrix because visualizations make us look smart
plt.figure(figsize=(8, 6))
cm = confusion_matrix(test_labels, predictions)
sns.heatmap(
    cm, 
    annot=True, 
    fmt='d', 
    cmap='Blues',
    xticklabels=['Negative', 'Positive'],
    yticklabels=['Negative', 'Positive']
)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

If all went well, you should be seeing accuracy in the 85-90% range—which is pretty impressive considering we only did minimal fine-tuning. Remember: BERT brings the language understanding, but we provide the task-specific expertise.

Building Our Prediction Function: Making It Practical

Now for the fun part—using our model to analyze real sentiment. We’ll create a function that takes raw text and returns its sentiment.

def predict_sentiment(texts):
    """Predict sentiment for one or more texts.
    Args:
        texts: Either a single string or a list of strings
    Returns:
        List of sentiment predictions ('positive' or 'negative')
    """
    # Ensure texts is a list
    if isinstance(texts, str):
        texts = [texts]
    # Get predictions
    probabilities = sentiment_model.predict(texts)
    predictions = (probabilities > 0.5).astype(int)
    # Convert to human-readable labels
    labels = ['negative', 'positive']
    return [labels[p] for p in predictions]
# Let's test with some classic examples
test_cases = [
    "This movie was a complete disaster. Worst acting I've ever seen.",
    "I loved everything about this film - the story, acting, and cinematography!",
    "It was okay, I guess. Nothing special but not terrible either.",
    "This is so good it's bad. Or is it so bad it's good? I can't tell anymore.",
    "Bahubali is a blockbuster Indian movie that was released in 2015. It is the first part of a two-part epic saga that tells the story of a legendary hero who fights for his kingdom and his love. The movie has received rave reviews from critics and audiences alike for its stunning visuals, spectacular action scenes, and captivating storyline."
]
results = predict_sentiment(test_cases)
# Display results in a satisfying format
for text, result in zip(test_cases, results):
    print(f"\nText: {text[:100]}{'...' if len(text) > 100 else ''}")
    print(f"Sentiment: {result.upper()}")

You should see output like this:

Text: This movie was a complete disaster. Worst acting I've ever seen.
Sentiment: NEGATIVE
Text: I loved everything about this film - the story, acting, and cinematography!
Sentiment: POSITIVE
Text: It was okay, I guess. Nothing special but not terrible either.
Sentiment: POSITIVE
Text: This is so good it's bad. Or is it so bad it's good? I can't tell anymore.
Sentiment: POSITIVE
Text: Bahubali is a blockbuster Indian movie that was released in 2015. It is the first part of a two-part ep...
Sentiment: POSITIVE

Ah, validation! Our model correctly identified the nuanced “okay, I guess” as slightly positive (most reviews saying “it was okay” lean positive in IMDb context) and handled the sarcasm detector challenge admirably.

Practical Tips from the Trenches

After building dozens of these models, here are my hard-earned tips you won’t find in most tutorials:

  1. Batch size matters more than you think: Smaller batches (16-32) often work better with BERT. It’s like feeding a gourmet chef—they work better with smaller, carefully prepared portions.
  2. Don’t skip the warmup steps: Those initial warmup steps in the optimizer aren’t just for show. They let BERT adjust gently to your task, rather than diving in headfirst.
  3. Watch for overfitting: BERT has millions of parameters. If your validation loss starts rising while training loss keeps falling, you’re memorizing noise. Drop that dropout rate a bit!
  4. Context length is crucial: The standard 128 tokens covers most sentences, but movie reviews can be long. If accuracy plateaus, try increasing max_seq_length in preprocessing.
  5. Save your model properly:
    # Save the complete model
    sentiment_model.save('imdb_sentiment_bert_model')
    # Load it later with
    # loaded_model = tf.keras.models.load_model('imdb_sentiment_bert_model', custom_objects={'KerasLayer':hub.KerasLayer})
    

Why This Beats Rule-Based Approaches

You might be thinking: “Can’t I just use a sentiment lexicon with positive/negative words?” Sure, but that approach would fail horribly on sentences like:

“The special effects were unbelievable… in how terrible they were.” Rule-based systems can’t handle that kind of nuance. BERT, however, understands context. It knows “unbelievable” usually means “amazing,” but when followed by “in how terrible they were,” the meaning flips completely. This contextual understanding is why BERT and its descendants have revolutionized NLP—it’s not just about individual words, but how they work together in the symphony of language.

Final Thoughts: Your Sentiment Analysis Journey Begins Now

We’ve journeyed from raw text to a working sentiment analysis model in just a few hundred lines of code. Not bad, right? Remember, this is just the beginning. You can:

  • Fine-tune on your specific domain (product reviews, social media, customer support tickets)
  • Add more layers for complex sentiment tasks (aspect-based sentiment analysis)
  • Combine with other models for even better accuracy BERT isn’t magic—it’s math wrapped in attention mechanisms—but it feels pretty close when it correctly identifies that “This movie was not terrible” is actually a backhanded compliment. So go forth! Fine-tune BERT on your data, and may your sentiment scores be ever in your favor. And if your model starts predicting your coffee is negative before you’ve had your first cup—well, it might just be onto something. Happy coding!