🤖 Machine Learning Models

Overview

Smart Mobility Predictor employs an ensemble of machine learning models, each optimized for specific prediction tasks.

1. Travel Time Prediction (Regression)

Problem Definition

Objective: Predict the travel time (in minutes) for a given origin-destination pair considering current conditions.

Challenge: Travel time is non-linear, influenced by multiple factors with complex interactions.

Model Selection

Tried Approaches:

Linear Regression: R² = 0.72 (too simple)
Random Forest: R² = 0.84 (good)
XGBoost: R² = 0.86 (better - chosen)
Ensemble (RF + XGBoost): R² = 0.87 ✅ (best)

Final Choice: XGBoost with Random Forest post-processing

Input Features

Feature	Type	Range	Example	Importance
distance_km	Numeric	0-50	18.5	12%
current_congestion	Numeric	0-100	78	32%
congestion_1h_lag	Numeric	0-100	72	15%
congestion_3h_lag	Numeric	0-100	68	8%
rolling_7day_avg	Numeric	0-100	62	10%
hour_of_day	Numeric	0-23	17	18%
day_of_week	Categorical	0-6	5 (Friday)	5%
is_weekend	Boolean	0-1	0	2%
is_holiday	Boolean	0-1	0	1%
weather_condition	Categorical	0-5	2 (rain)	7%
temperature	Numeric	-10 to 50	16	3%
precipitation	Numeric	0-100	2	4%
visibility_km	Numeric	0-10	8	2%
wind_speed	Numeric	0-50	12	1%
road_type	Categorical	highway/arterial/local	highway	6%
special_event_nearby	Boolean	0-1	0	4%

Model Architecture

XGBoost Parameters:

n_estimators: 200        # Number of boosting rounds
max_depth: 7             # Tree depth (prevents overfitting)
learning_rate: 0.05      # Shrinkage factor
subsample: 0.8           # Row sampling ratio
colsample_bytree: 0.8    # Feature sampling ratio
objective: reg:squarederror
eval_metric: mae
early_stopping_rounds: 50

Random Forest Ensemble:

n_estimators: 100
max_depth: 15
min_samples_split: 5
min_samples_leaf: 2
max_features: sqrt
bootstrap: True

Output


json
{
  "predicted_travel_time": 38,
  "unit": "minutes",
  "confidence_interval": {
    "lower": 33,
    "upper": 43,
    "confidence_level": 0.95
  },
  "feature_contributions": {
    "current_congestion": 12.4,
    "hour_of_day": 6.8,
    "congestion_1h_lag": 5.7,
    "distance_km": 2.2,
    "other": 10.9
  },
  "model_uncertainty": 0.18
}

Performance Metrics

Metric	Value	Target	Status
Mean Absolute Error (MAE)	4.2 min	< 5 min	✅
Root Mean Square Error (RMSE)	6.8 min	< 8 min	✅
R² Score	0.87	> 0.85	✅
Median Absolute Percentage Error	8.3%	< 10%	✅
90th Percentile Error	12.5 min	< 15 min	✅

Model Evaluation

Cross-Validation Results:

Fold 1: MAE = 4.1 min
Fold 2: MAE = 4.3 min
Fold 3: MAE = 4.2 min
Fold 4: MAE = 4.0 min
Fold 5: MAE = 4.4 min
───────────────────────
Mean: 4.2 min (±0.18)

Performance by Condition:

Scenario	MAE	Sample Size
Free flow (low congestion)	2.8 min	25,000
Moderate congestion	4.2 min	180,000
Heavy congestion	5.9 min	95,000
Rainy conditions	5.1 min	40,000
Peak hours (8-9 AM, 5-6 PM)	4.8 min	120,000

Real-World Example

Input:


json
{
  "distance_km": 18.5,
  "current_congestion": 78,
  "hour_of_day": 17,
  "day_of_week": 5,
  "weather_condition": "light_rain",
  "temperature": 16,
  "precipitation": 2
}

Output:


json
{
  "predicted_travel_time": 38,
  "confidence_interval": {
    "lower": 33,
    "upper": 43
  },
  "factors_increasing_time": [
    "High congestion (78%): +8 minutes",
    "Peak hour (5:00 PM): +5 minutes",
    "Rain condition: +2 minutes"
  ]
}

2. Traffic Congestion Classification

Problem Definition

Objective: Classify traffic condition into three categories:

🟢 Low (0-25% congestion): Free flow
🟡 Medium (25-75% congestion): Moderate delays
🔴 High (75-100% congestion): Severe congestion

Challenge: Class imbalance (20% high, 50% medium, 30% low)

Model Selection

Options Considered:

Logistic Regression: F1 = 0.78 (baseline)
SVM: F1 = 0.84
Random Forest: F1 = 0.87 ✅ (chosen)
XGBoost: F1 = 0.86

Final Choice: Random Forest with class weights

Input Features

Same as regression model (16 features):

Traffic temporal features (lag, rolling averages)
Temporal features (hour, day, holiday)
Weather conditions
Road characteristics

Model Architecture


python
RandomForestClassifier(
    n_estimators=150,
    max_depth=10,
    min_samples_split=5,
    class_weight='balanced',  # Handle imbalance
    random_state=42
)

Output


json
{
  "predicted_class": "HIGH",
  "probability": 0.89,
  "class_probabilities": {
    "LOW": 0.05,
    "MEDIUM": 0.06,
    "HIGH": 0.89
  },
  "confidence_threshold": 0.7,
  "risk_level": "elevated"
}

Performance Metrics

Classification Report:

Class	Precision	Recall	F1-Score	Support
Low	0.91	0.85	0.88	65,000
Medium	0.84	0.88	0.86	110,000
High	0.89	0.85	0.87	45,000
Weighted Avg	0.88	0.86	0.87	220,000

Confusion Matrix:

                Predicted
                Low   Med   High
Actual Low      55k   8k    2k
       Medium   6k    97k   7k
       High     2k    5k    38k

Evaluation Metrics

Metric	Value
Accuracy	86%
Precision (weighted)	88%
Recall (weighted)	86%
F1-Score (weighted)	87%
ROC-AUC (macro)	0.94
Matthews Correlation Coefficient	0.79

3. Urban Congestion Clustering

Problem Definition

Objective: Identify geographic zones with similar traffic patterns without labels.

Use Case: Understand city congestion zones, predict patterns by zone, support urban planning.

Model Selection

Algorithms Evaluated:

K-Means (Final choice)
- Simple, efficient
- Silhouette score: 0.68
- Interpretable clusters
DBSCAN
- Density-based
- Silhouette score: 0.62
- Better for irregular shapes
Hierarchical Clustering
- Bottom-up approach
- Dendrogram visualization
- Slower (not scalable)

Input Features (Aggregated)


json
{
  "avg_congestion_level": 65,          // Historical mean
  "peak_hour": 17,                     // Most congested hour
  "peak_hour_congestion": 82,          // Congestion at peak
  "std_dev_congestion": 18,            // Variability
  "rolling_7day_trend": -2,            // Trend (improving)
  "road_density": 3.5,                 // Km of roads per km²
  "intersection_count": 12,            // Major intersections
  "latitude": 36.85,                   // Geographic location
  "longitude": 10.32,
  "proximity_to_center": 2.5           // Distance to city center
}

Cluster Results

Optimal K = 5 (Elbow Method & Silhouette Score)

K=2: Silhouette = 0.45
K=3: Silhouette = 0.58
K=4: Silhouette = 0.65
K=5: Silhouette = 0.68 ← OPTIMAL
K=6: Silhouette = 0.64
K=7: Silhouette = 0.59

Identified Clusters

Cluster 0: Downtown Hub

Average congestion: 78%
Peak: 5-6 PM (88%)
Characteristics: High intersection count, dense network
Roads affected: Avenue H. Bourguiba, Rue de Carthage
Recommendation: Invest in public transit

Cluster 1: Suburban Ring

Average congestion: 52%
Peak: 8-9 AM (68%)
Characteristics: Moderate density, mixed use
Recommendation: Implement demand management

Cluster 2: Peripheral Areas

Average congestion: 25%
Peak: None (smooth flow)
Characteristics: Low density, residential
Recommendation: Monitor for growth impacts

Cluster 3: Commercial Zones

Average congestion: 65%
Peak: 11 AM - 1 PM (74%)
Characteristics: Retail/office, parking lots
Recommendation: Encourage off-peak shopping

Cluster 4: Transit Corridors

Average congestion: 48%
Peak: 7-8 AM (72%)
Characteristics: Main arterials, high bus volume
Recommendation: Dedicated bus lanes

Performance Metrics

Metric	Value	Interpretation
Silhouette Score	0.68	Good cluster separation
Davies-Bouldin Index	0.82	Low (good)
Calinski-Harabasz Index	142.5	High (good)
Inertia	3,847	Cluster compactness

4. Route & Transport Recommendation System

Architecture

Multi-Stage Recommendation Process:

1. Route Generation
   ├─ Generate 3-5 candidate routes
   └─ Use routing API (OSRM, Mapbox)

2. Time Prediction
   ├─ Apply regression model to each
   └─ Get travel time + confidence

3. Congestion Risk Assessment
   ├─ Apply classification model
   └─ Get risk level + probability

4. Scoring & Ranking
   ├─ Compute composite score
   ├─ Consider user preferences
   └─ Rank by preference weights

5. Recommendation
   └─ Return top 3 with explanations

Scoring Formula

SCORE = w1 * TIME_SCORE + w2 * RISK_SCORE + w3 * COST_SCORE + w4 * EMISSIONS_SCORE

Where:
- w1 = 0.35 (time priority for "fast" mode)
- w2 = 0.25 (avoid high-risk routes)
- w3 = 0.20 (budget consideration)
- w4 = 0.20 (environmental impact)

TIME_SCORE = 1 - (predicted_time / max_possible_time) × 0.8 + confidence × 0.2
RISK_SCORE = 1 - congestion_probability_high
COST_SCORE = 1 - (fuel_cost / max_cost)
EMISSIONS_SCORE = 1 - (co2_kg / max_emissions)

All scores normalized to [0, 1]

User Preference Profiles

Profile 1: Speed Priority


json
{
  "name": "Fast Commuter",
  "weights": {
    "time": 0.50,
    "risk": 0.20,
    "cost": 0.20,
    "emissions": 0.10
  },
  "transport_preference": ["car"],
  "avoid_highways": false
}

Profile 2: Eco-Conscious


json
{
  "name": "Green Advocate",
  "weights": {
    "time": 0.20,
    "risk": 0.15,
    "cost": 0.15,
    "emissions": 0.50
  },
  "transport_preference": ["public_transit", "bike"],
  "avoid_highways": true
}

Profile 3: Budget-Focused


json
{
  "name": "Budget Traveler",
  "weights": {
    "time": 0.20,
    "risk": 0.20,
    "cost": 0.50,
    "emissions": 0.10
  },
  "transport_preference": ["public_transit"],
  "max_cost": 2.00
}

Example Output


json
{
  "recommendations": [
    {
      "rank": 1,
      "route_id": "route_001",
      "name": "Via Route de Tunis",
      "distance": 18.5,
      "predicted_time": 32,
      "time_confidence": 0.91,
      "congestion_risk": "MEDIUM",
      "risk_probability": 0.45,
      "transport_mode": "car",
      "cost_estimate": 2.45,
      "emissions_kg_co2": 2.8,
      "score": 9.2,
      "explanation": "Recommended for speed. Smooth sailing on highway.",
      "waypoints": ["Tunis Marine", "Route de Tunis", "La Marsa"]
    },
    {
      "rank": 2,
      "route_id": "route_002",
      "name": "Public Transit",
      "distance": 18.5,
      "predicted_time": 48,
      "time_confidence": 0.92,
      "congestion_risk": "NONE",
      "risk_probability": 0.0,
      "transport_mode": "public_transit",
      "cost_estimate": 0.95,
      "emissions_kg_co2": 0.4,
      "score": 8.1,
      "explanation": "Eco-friendly. Avoid traffic entirely.",
      "transit_legs": [
        {
          "type": "metro",
          "line": "Line 1",
          "duration": 28
        },
        {
          "type": "walk",
          "duration": 8
        }
      ]
    }
  ]
}

Model Training Pipeline

Data Preparation

Dataset Size:

Total samples: 730 million (2 years)
Training: 60% (438M)
Validation: 20% (146M)
Test: 20% (146M)

Temporal Stratification:

Training: Jan 2022 - Aug 2023
Validation: Sep 2023 - Oct 2023
Test: Nov 2023 - Dec 2023

(Prevents data leakage from future information)

Feature Engineering Pipeline


python
def create_features(data):
    """
    Transform raw data into ML features
    """
    # 1. Temporal features
    data['hour'] = data['timestamp'].dt.hour
    data['day_of_week'] = data['timestamp'].dt.dayofweek
    data['is_weekend'] = data['day_of_week'].isin([5, 6])
    
    # 2. Lag features
    for lag in [1, 3, 6, 12, 24]:
        data[f'congestion_lag_{lag}h'] = \
            data.groupby('segment_id')['congestion'].shift(lag)
    
    # 3. Rolling aggregates
    for window in [6, 24, 168]:  # hours
        data[f'congestion_roll_{window}'] = \
            data.groupby('segment_id')['congestion'].rolling(window).mean()
    
    # 4. Interaction features
    data['rain_and_congestion'] = data['is_raining'] * data['congestion']
    data['peak_hour_factor'] = (data['hour'].isin([8, 9, 17, 18])).astype(int)
    
    return data

Model Training Script


python
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
import pickle

# Load prepared data
X_train, y_train = load_training_data()
X_val, y_val = load_validation_data()

# Train XGBoost
xgb_model = XGBRegressor(
    n_estimators=200,
    max_depth=7,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8
)
xgb_model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    early_stopping_rounds=50,
    verbose=10
)

# Train Random Forest
rf_model = RandomForestRegressor(
    n_estimators=100,
    max_depth=15,
    min_samples_split=5
)
rf_model.fit(X_train, y_train)

# Ensemble predictions
xgb_pred = xgb_model.predict(X_val)
rf_pred = rf_model.predict(X_val)
ensemble_pred = (xgb_pred + rf_pred) / 2

# Save models
with open('xgb_model.pkl', 'wb') as f:
    pickle.dump(xgb_model, f)

with open('rf_model.pkl', 'wb') as f:
    pickle.dump(rf_model, f)

Model Monitoring & Drift Detection

Prediction Drift

Monitor actual vs. predicted values:


python
def detect_drift(actual, predicted):
    """
    Check if model predictions are drifting
    """
    mae = mean_absolute_error(actual, predicted)
    if mae > THRESHOLD:
        alert("MAE exceeded threshold!")
        trigger_retraining()

Data Drift

Check for changes in input data distribution:

If recent_data_mean significantly differs from training_data_mean:
    → Data drift detected
    → Retrain model with recent data
    → Update feature statistics

Retraining Strategy

Schedule:

Weekly Full Retraining: Use 2+ weeks of recent data
Daily Incremental: Update model with latest 24h
Monthly Major Version: Full pipeline review

Triggers:

Prediction MAE > 7 min (threshold)
Data drift detected (KL divergence)
New model performance > current by 5%

Documentation

🤖 Machine Learning Models

Overview

1. Travel Time Prediction (Regression)

Problem Definition

Model Selection

Input Features

Model Architecture

Output

Performance Metrics

Model Evaluation

Real-World Example

2. Traffic Congestion Classification

Problem Definition

Model Selection

Input Features

Model Architecture

Output

Performance Metrics

Evaluation Metrics

3. Urban Congestion Clustering

Problem Definition

Model Selection

Input Features (Aggregated)

Cluster Results

Identified Clusters

Performance Metrics

4. Route & Transport Recommendation System

Architecture

Scoring Formula

User Preference Profiles

Example Output

Model Training Pipeline

Data Preparation

Feature Engineering Pipeline

Model Training Script

Model Monitoring & Drift Detection

Prediction Drift

Data Drift

Retraining Strategy