Building a Recommendation System for News Aggregators: From Theory to Production

The Problem Nobody Asked For (But Everyone Needs)

You know that feeling when you open a news app and it’s just… noise? Thousands of articles screaming for attention, none of them knowing anything about you, your interests, or why you’d actually want to read about quantum computing when you’re clearly a sports enthusiast at 6 AM before your coffee kicks in. That’s the problem we’re solving today. News recommendation systems are the unsung heroes of content discovery. They’re the difference between users spending five minutes scrolling aimlessly (and leaving) versus spending thirty minutes genuinely engaged with content that actually matters to them. And if you’re building a news aggregator, nailing this is what separates you from the noise. The good news? You don’t need a PhD in machine learning or a team of hundred engineers to build something that works remarkably well. You need understanding, strategy, and the right approach for your use case.

Understanding the Recommendation Landscape

Before we start writing code, let’s establish what we’re actually trying to solve. A news recommendation system needs to handle three fundamental challenges:

Cold Start Problem: When you have new users or new articles, you have no historical interaction data to work with
Sparsity: Users interact with a tiny fraction of available articles, leaving most preferences unknown
Diversity vs. Relevance: Recommending only what users already like gets boring; recommending too much novelty loses them There are three primary approaches to solving these challenges, each with its own personality: Content-Based Filtering analyzes the properties of articles (categories, keywords, topics) and matches them with user preferences. Think of it as a librarian who remembers you love detective novels, so they keep recommending detective novels. Simple, predictable, but occasionally boring. Collaborative Filtering looks at what users similar to you have enjoyed. It’s social-network-aware: if 1,000 people just like you read and loved an article, you probably will too. This can surface unexpected gems but struggles when data is sparse. Hybrid Approaches combine both methods, stealing the best parts of each while trying to patch their weaknesses. This is usually where the magic happens.

The Architecture Blueprint

Let’s look at how a production-ready system actually fits together:

Here’s what’s happening at each stage: Parser: Extracts articles from RSS feeds, pulling headlines, abstracts, categories, URLs, and metadata. Content Pipeline: Cleans, tokenizes, and vectorizes article content. This is where you extract features that the recommendation engine will actually use. Interaction Logger: Records every click, read, share, or save. This is your gold mine of data. Timestamps matter more than you’d think. Recommendation Engine: The brain of the operation. This processes both content features and user behavior patterns. Ranking Module: Takes raw recommendation scores and applies business logic. Maybe you want to ensure diversity, avoid duplicates, or respect fresh content. API Endpoint: Serves recommendations in real-time to your clients.

Implementation: Let’s Build This Thing

Step 1: Setting Up Your Foundation

import numpy as np
import pandas as pd
from datetime import datetime, timedelta
from typing import List, Dict, Tuple
from dataclasses import dataclass
from collections import defaultdict
@dataclass
class Article:
    """Represents a news article with its metadata"""
    article_id: str
    title: str
    content: str
    category: str
    url: str
    published_at: datetime
    source: str
    keywords: List[str]
@dataclass
class UserInteraction:
    """Records how a user interacted with an article"""
    user_id: str
    article_id: str
    interaction_type: str  # 'click', 'read', 'share', 'save'
    timestamp: datetime
    engagement_score: float  # 0.0 to 1.0
class NewsRecommendationSystem:
    """Main recommendation engine"""
    def __init__(self, articles: List[Article], interactions: List[UserInteraction]):
        self.articles = {a.article_id: a for a in articles}
        self.interactions = interactions
        self.user_profiles = self._build_user_profiles()
        self.article_vectors = self._vectorize_articles()
    def _build_user_profiles(self) -> Dict[str, Dict]:
        """Build preference profiles based on user interactions"""
        profiles = defaultdict(lambda: {
            'categories': defaultdict(float),
            'keywords': defaultdict(float),
            'sources': defaultdict(float),
            'interaction_count': 0,
            'last_active': None
        })
        for interaction in self.interactions:
            article = self.articles[interaction.article_id]
            user_id = interaction.user_id
            profile = profiles[user_id]
            # Weight interactions by recency and engagement
            time_weight = self._calculate_time_weight(interaction.timestamp)
            final_weight = interaction.engagement_score * time_weight
            profile['categories'][article.category] += final_weight
            profile['sources'][article.source] += final_weight
            for keyword in article.keywords:
                profile['keywords'][keyword] += final_weight
            profile['interaction_count'] += 1
            profile['last_active'] = max(
                profile['last_active'] or interaction.timestamp,
                interaction.timestamp
            )
        return dict(profiles)
    def _calculate_time_weight(self, interaction_time: datetime) -> float:
        """Recent interactions matter more than old ones"""
        days_ago = (datetime.now() - interaction_time).days
        # Exponential decay: halve the weight every 30 days
        return 0.5 ** (days_ago / 30.0)
    def _vectorize_articles(self) -> Dict[str, np.ndarray]:
        """Convert articles to numerical vectors"""
        # For simplicity, we'll use a basic approach
        # In production, use TF-IDF, word embeddings, or transformers
        vectors = {}
        unique_keywords = set()
        unique_categories = set()
        for article in self.articles.values():
            unique_keywords.update(article.keywords)
            unique_categories.add(article.category)
        keyword_list = sorted(list(unique_keywords))
        category_list = sorted(list(unique_categories))
        for article in self.articles.values():
            vector = []
            # Add category one-hot encoding
            for cat in category_list:
                vector.append(1.0 if article.category == cat else 0.0)
            # Add keyword presence
            for kw in keyword_list:
                vector.append(1.0 if kw in article.keywords else 0.0)
            vectors[article.article_id] = np.array(vector)
        return vectors

Step 2: Content-Based Recommendations

from sklearn.metrics.pairwise import cosine_similarity
class ContentBasedRecommender:
    """Recommends articles similar to user's reading history"""
    def __init__(self, system: NewsRecommendationSystem):
        self.system = system
    def recommend(self, user_id: str, top_k: int = 10) -> List[Tuple[str, float]]:
        """Get content-based recommendations for a user"""
        if user_id not in self.system.user_profiles:
            return self._get_popular_articles(top_k)
        user_profile = self.system.user_profiles[user_id]
        user_interactions = [
            i for i in self.system.interactions if i.user_id == user_id
        ]
        if not user_interactions:
            return self._get_popular_articles(top_k)
        # Build user preference vector based on articles they've interacted with
        user_vector = self._build_user_vector(user_interactions)
        # Calculate similarity between user preferences and all articles
        recommendations = {}
        for article_id, article_vector in self.system.article_vectors.items():
            # Skip articles user has already read
            if any(i.article_id == article_id for i in user_interactions):
                continue
            similarity = cosine_similarity(
                [user_vector], 
                [article_vector]
            )
            recommendations[article_id] = similarity
        # Sort by similarity and return top-k
        sorted_recs = sorted(
            recommendations.items(), 
            key=lambda x: x, 
            reverse=True
        )
        return sorted_recs[:top_k]
    def _build_user_vector(self, interactions: List[UserInteraction]) -> np.ndarray:
        """Average vectors of articles user has interacted with"""
        vectors = []
        for interaction in interactions:
            if interaction.article_id in self.system.article_vectors:
                vectors.append(self.system.article_vectors[interaction.article_id])
        if not vectors:
            return np.zeros(list(self.system.article_vectors.values()).shape)
        return np.mean(vectors, axis=0)
    def _get_popular_articles(self, top_k: int) -> List[Tuple[str, float]]:
        """Fallback: return trending articles when no user history exists"""
        article_scores = defaultdict(float)
        # Count interactions per article with recency weighting
        for interaction in self.system.interactions:
            time_weight = self.system._calculate_time_weight(interaction.timestamp)
            article_scores[interaction.article_id] += time_weight
        sorted_articles = sorted(
            article_scores.items(),
            key=lambda x: x,
            reverse=True
        )
        return sorted_articles[:top_k]

Step 3: Collaborative Filtering

class CollaborativeRecommender:
    """Recommends articles based on similar users"""
    def __init__(self, system: NewsRecommendationSystem):
        self.system = system
        self.user_similarity_cache = {}
    def recommend(self, user_id: str, top_k: int = 10) -> List[Tuple[str, float]]:
        """Get collaborative filtering recommendations"""
        # Find similar users
        similar_users = self._find_similar_users(user_id, n=50)
        if not similar_users:
            return []
        # Collect articles from similar users
        recommendations = self._aggregate_similar_user_articles(
            user_id, 
            similar_users
        )
        # Sort by score
        sorted_recs = sorted(
            recommendations.items(),
            key=lambda x: x,
            reverse=True
        )
        return sorted_recs[:top_k]
    def _find_similar_users(self, user_id: str, n: int = 50) -> List[Tuple[str, float]]:
        """Find users with similar reading preferences"""
        user_interactions = {
            i.article_id: i.engagement_score
            for i in self.system.interactions 
            if i.user_id == user_id
        }
        if not user_interactions:
            return []
        user_similarities = []
        for other_user_id in self.system.user_profiles.keys():
            if other_user_id == user_id:
                continue
            other_interactions = {
                i.article_id: i.engagement_score
                for i in self.system.interactions 
                if i.user_id == other_user_id
            }
            # Calculate overlap-based similarity
            common_articles = set(user_interactions.keys()) & set(other_interactions.keys())
            if not common_articles:
                continue
            similarity = sum(
                min(user_interactions[aid], other_interactions[aid])
                for aid in common_articles
            ) / max(
                len(user_interactions),
                len(other_interactions)
            )
            user_similarities.append((other_user_id, similarity))
        return sorted(user_similarities, key=lambda x: x, reverse=True)[:n]
    def _aggregate_similar_user_articles(
        self, 
        user_id: str, 
        similar_users: List[Tuple[str, float]]
    ) -> Dict[str, float]:
        """Collect articles from similar users weighted by similarity"""
        user_read_articles = {
            i.article_id for i in self.system.interactions 
            if i.user_id == user_id
        }
        article_scores = defaultdict(float)
        for similar_user_id, similarity_score in similar_users:
            for interaction in self.system.interactions:
                if interaction.user_id == similar_user_id:
                    if interaction.article_id not in user_read_articles:
                        weighted_score = interaction.engagement_score * similarity_score
                        article_scores[interaction.article_id] += weighted_score
        return dict(article_scores)

Step 4: Hybrid Recommendation Engine

class HybridRecommender:
    """Combines content-based and collaborative filtering"""
    def __init__(
        self, 
        system: NewsRecommendationSystem,
        content_weight: float = 0.6,
        collaborative_weight: float = 0.4
    ):
        self.system = system
        self.content_recommender = ContentBasedRecommender(system)
        self.collaborative_recommender = CollaborativeRecommender(system)
        self.content_weight = content_weight
        self.collaborative_weight = collaborative_weight
    def recommend(self, user_id: str, top_k: int = 10) -> List[Tuple[str, float]]:
        """Get hybrid recommendations"""
        # Get recommendations from both approaches
        content_recs = self.content_recommender.recommend(user_id, top_k=top_k*2)
        collaborative_recs = self.collaborative_recommender.recommend(user_id, top_k=top_k*2)
        # Merge and weight scores
        merged_scores = defaultdict(float)
        for article_id, score in content_recs:
            merged_scores[article_id] += score * self.content_weight
        for article_id, score in collaborative_recs:
            merged_scores[article_id] += score * self.collaborative_weight
        # Apply diversity and freshness heuristics
        final_recommendations = self._apply_ranking_heuristics(
            merged_scores, 
            user_id
        )
        sorted_recs = sorted(
            final_recommendations.items(),
            key=lambda x: x,
            reverse=True
        )
        return sorted_recs[:top_k]
    def _apply_ranking_heuristics(
        self, 
        scores: Dict[str, float], 
        user_id: str
    ) -> Dict[str, float]:
        """Apply business logic to improve ranking"""
        adjusted_scores = {}
        for article_id, base_score in scores.items():
            article = self.system.articles[article_id]
            score = base_score
            # Boost fresh articles (published in last 24 hours)
            hours_old = (datetime.now() - article.published_at).total_seconds() / 3600
            if hours_old < 24:
                score *= 1.2
            # Avoid over-representation from single source
            same_source_count = sum(
                1 for other_id, _ in scores.items()
                if self.system.articles[other_id].source == article.source
            )
            if same_source_count > 3:
                score *= 0.9
            # Category diversity boost
            user_categories = self.system.user_profiles[user_id]['categories']
            if article.category not in user_categories or user_categories[article.category] < 0.5:
                score *= 1.1
            adjusted_scores[article_id] = score
        return adjusted_scores

Step 5: Evaluation Metrics That Actually Matter

Here’s where most people skip ahead. Don’t. Measuring what matters is the difference between a system that works and one that looks like it works:

class RecommendationMetrics:
    """Evaluate recommendation quality"""
    @staticmethod
    def area_under_curve(predictions: List[Tuple[str, float]], 
                        ground_truth: List[str]) -> float:
        """
        AUC measures ability to rank relevant items higher than irrelevant ones.
        Ranges from 0.5 (random) to 1.0 (perfect).
        """
        true_positives = 0
        false_positives = 0
        for article_id, _ in predictions:
            if article_id in ground_truth:
                true_positives += 1
            else:
                false_positives += 1
        if false_positives == 0 or true_positives == 0:
            return 1.0 if true_positives > 0 else 0.5
        return true_positives / (true_positives + false_positives)
    @staticmethod
    def mean_reciprocal_rank(predictions: List[Tuple[str, float]], 
                             ground_truth: List[str]) -> float:
        """
        MRR evaluates how quickly you find the first relevant item.
        Good for understanding if relevant articles appear early.
        """
        for rank, (article_id, _) in enumerate(predictions, 1):
            if article_id in ground_truth:
                return 1.0 / rank
        return 0.0
    @staticmethod
    def ndcg_at_k(predictions: List[Tuple[str, float]], 
                  ground_truth: List[str], 
                  k: int = 10) -> float:
        """
        nDCG@k balances relevance and position, with diminishing returns further down.
        The metric used by most major recommendation systems.
        """
        dcg = 0.0
        for position, (article_id, _) in enumerate(predictions[:k], 1):
            if article_id in ground_truth:
                dcg += 1.0 / np.log2(position + 1)
        # Ideal DCG: all relevant items first
        ideal_dcg = sum(
            1.0 / np.log2(i + 1) 
            for i in range(1, min(len(ground_truth), k) + 1)
        )
        return dcg / ideal_dcg if ideal_dcg > 0 else 0.0

Step 6: Real-Time API Service

from flask import Flask, request, jsonify
from datetime import datetime
app = Flask(__name__)
# Initialize your recommendation system
articles = [
    Article(
        article_id="1",
        title="Breaking: AI Discovers New Particle",
        content="Researchers at CERN...",
        category="science",
        url="https://news.example.com/ai-particle",
        published_at=datetime.now(),
        source="Science Daily",
        keywords=["AI", "physics", "CERN", "breakthrough"]
    ),
    # ... more articles
]
interactions = [
    UserInteraction(
        user_id="user_123",
        article_id="1",
        interaction_type="read",
        timestamp=datetime.now(),
        engagement_score=0.9
    ),
    # ... more interactions
]
system = NewsRecommendationSystem(articles, interactions)
recommender = HybridRecommender(system, content_weight=0.6, collaborative_weight=0.4)
@app.route('/api/recommendations', methods=['GET'])
def get_recommendations():
    """Serve recommendations in real-time via API"""
    user_id = request.args.get('user_id')
    top_k = int(request.args.get('top_k', 10))
    # Optionally filter by topic
    topic_filter = request.args.get('topic')
    if not user_id:
        return jsonify({'error': 'user_id parameter required'}), 400
    try:
        recommendations = recommender.recommend(user_id, top_k=top_k*2)
        # Apply topic filter if specified
        if topic_filter:
            recommendations = [
                (aid, score) for aid, score in recommendations
                if system.articles[aid].category == topic_filter
            ][:top_k]
        else:
            recommendations = recommendations[:top_k]
        result = []
        for article_id, score in recommendations:
            article = system.articles[article_id]
            result.append({
                'article_id': article_id,
                'title': article.title,
                'url': article.url,
                'category': article.category,
                'source': article.source,
                'score': float(score),
                'published_at': article.published_at.isoformat()
            })
        return jsonify({
            'user_id': user_id,
            'recommendations': result,
            'timestamp': datetime.now().isoformat()
        })
    except Exception as e:
        return jsonify({'error': str(e)}), 500
@app.route('/api/feedback', methods=['POST'])
def log_feedback():
    """Record user interactions for continuous learning"""
    data = request.json
    interaction = UserInteraction(
        user_id=data['user_id'],
        article_id=data['article_id'],
        interaction_type=data['interaction_type'],  # 'click', 'read', 'share'
        timestamp=datetime.now(),
        engagement_score=data.get('engagement_score', 0.5)
    )
    system.interactions.append(interaction)
    # Optionally retrain or update profiles here
    system.user_profiles = system._build_user_profiles()
    return jsonify({'status': 'logged'}), 200
if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000, debug=False)

Deployment Considerations for Real-World Scenarios

Batch vs. Real-Time: You don’t always need real-time recommendations. For email digests with personalized recommendations, batch processing overnight is perfectly fine and much cheaper. Use real-time only when users are actively browsing. Filtering Business Rules: After your model generates recommendations, you’ll want to filter them through business logic. Maybe exclude articles from sources you’re not featuring this week, or ensure diversity of topics. Don’t let your ML model be the final say—it’s an advisor, not a dictator. Cold Start Handling: For new users with no history, lean on content-based filtering and trending articles. For new articles with no interactions, boost them slightly if they match trending topics. Don’t panic; this gets better as data accumulates. Scalability Challenges: If you’re operating at scale, consider these:

Cache user profiles and rebuild them every 6-24 hours instead of real-time
Use approximate nearest neighbors (ANN) for fast vector similarity searches
Implement recommendation batching to reduce API overhead
Store pre-computed recommendations for common user segments

The Metrics You Actually Care About

Remember earlier when we built those evaluation functions? Here’s what they mean in practice:

AUC ≥ 0.75: Your system is distinguishing relevant articles from noise reasonably well
MRR ≥ 0.5: Users find something interesting within the first few recommendations
nDCG@10 ≥ 0.6: Your top-10 list has good balance between relevance and position Don’t obsess over perfect scores. Real recommendation systems live in the 0.7-0.85 range and still drive massive engagement improvements.

Future Enhancements Worth Exploring

Once you’ve got the basics working, here are the next frontiers: Sequential Models: Instead of treating article reads as independent events, model them as sequences. What someone reads at 6 AM differs from 9 PM. RNNs and transformers can capture this. Context-Aware Sampling: Instead of random sampling for training, sample articles that are contextually relevant to user behavior patterns. This improves both model quality and training efficiency. Embeddings at Scale: Move beyond simple TF-IDF to learned embeddings. Either fine-tune pre-trained models like BERT for news articles, or train your own domain-specific embeddings. Multi-Objective Optimization: Optimize for multiple goals simultaneously: relevance, diversity, freshness, and business metrics. This is where recommendation systems stop being academic exercises and become business tools.

Final Thoughts: The Human Element

Here’s something they don’t teach in machine learning courses: the best recommendation system can fail spectacularly if users don’t understand why they’re seeing what they’re seeing. Include explanations. Tell users why you’re recommending an article. “Because you read about climate science yesterday and this is the latest development” builds trust. “Because algorithms” builds resentment. The code you’ve seen here is production-ready but not production-complete. You’ll want monitoring, A/B testing infrastructure, feedback loops, and continuous retraining. You’ll need to handle edge cases, rate limiting, and graceful degradation when services fail. But now you have the foundation. You understand how the pieces fit together, why each component exists, and what each approach trades off against the others. That’s enough to build something real. The path from “interesting prototype” to “drives meaningful engagement” is mostly engineering and experimentation. Start with the hybrid approach, measure everything, and iterate based on what your actual users do, not what your models predict they’ll do. Your users will thank you. Or at least, they’ll stop immediately closing your app in disgust. Which, honestly, is a victory in the news industry.

Subscribe to Our Telegram Channel

Подпишитесь на наш телеграм

Thank you for subscribing!

Спасибо за подписку!

The Problem Nobody Asked For (But Everyone Needs)#

Understanding the Recommendation Landscape#

The Architecture Blueprint#

Implementation: Let’s Build This Thing#

Step 1: Setting Up Your Foundation#

Step 2: Content-Based Recommendations#

Step 3: Collaborative Filtering#

Step 4: Hybrid Recommendation Engine#

Step 5: Evaluation Metrics That Actually Matter#

Step 6: Real-Time API Service#

Deployment Considerations for Real-World Scenarios#

The Metrics You Actually Care About#

Future Enhancements Worth Exploring#

Final Thoughts: The Human Element#