Documentation

🤖 Machine Learning Models

Overview

Smart Mobility Predictor employs an ensemble of machine learning models, each optimized for specific prediction tasks.


1. Travel Time Prediction (Regression)

Problem Definition

Objective: Predict the travel time (in minutes) for a given origin-destination pair considering current conditions.

Challenge: Travel time is non-linear, influenced by multiple factors with complex interactions.

Model Selection

Tried Approaches:

  1. Linear Regression: R² = 0.72 (too simple)
  2. Random Forest: R² = 0.84 (good)
  3. XGBoost: R² = 0.86 (better - chosen)
  4. Ensemble (RF + XGBoost): R² = 0.87 ✅ (best)

Final Choice: XGBoost with Random Forest post-processing


Input Features

FeatureTypeRangeExampleImportance
distance_kmNumeric0-5018.512%
current_congestionNumeric0-1007832%
congestion_1h_lagNumeric0-1007215%
congestion_3h_lagNumeric0-100688%
rolling_7day_avgNumeric0-1006210%
hour_of_dayNumeric0-231718%
day_of_weekCategorical0-65 (Friday)5%
is_weekendBoolean0-102%
is_holidayBoolean0-101%
weather_conditionCategorical0-52 (rain)7%
temperatureNumeric-10 to 50163%
precipitationNumeric0-10024%
visibility_kmNumeric0-1082%
wind_speedNumeric0-50121%
road_typeCategoricalhighway/arterial/localhighway6%
special_event_nearbyBoolean0-104%

Model Architecture

XGBoost Parameters:

n_estimators: 200        # Number of boosting rounds
max_depth: 7             # Tree depth (prevents overfitting)
learning_rate: 0.05      # Shrinkage factor
subsample: 0.8           # Row sampling ratio
colsample_bytree: 0.8    # Feature sampling ratio
objective: reg:squarederror
eval_metric: mae
early_stopping_rounds: 50

Random Forest Ensemble:

n_estimators: 100
max_depth: 15
min_samples_split: 5
min_samples_leaf: 2
max_features: sqrt
bootstrap: True

Output

json
{ "predicted_travel_time": 38, "unit": "minutes", "confidence_interval": { "lower": 33, "upper": 43, "confidence_level": 0.95 }, "feature_contributions": { "current_congestion": 12.4, "hour_of_day": 6.8, "congestion_1h_lag": 5.7, "distance_km": 2.2, "other": 10.9 }, "model_uncertainty": 0.18 }

Performance Metrics

MetricValueTargetStatus
Mean Absolute Error (MAE)4.2 min< 5 min
Root Mean Square Error (RMSE)6.8 min< 8 min
R² Score0.87> 0.85
Median Absolute Percentage Error8.3%< 10%
90th Percentile Error12.5 min< 15 min

Model Evaluation

Cross-Validation Results:

Fold 1: MAE = 4.1 min
Fold 2: MAE = 4.3 min
Fold 3: MAE = 4.2 min
Fold 4: MAE = 4.0 min
Fold 5: MAE = 4.4 min
───────────────────────
Mean: 4.2 min (±0.18)

Performance by Condition:

ScenarioMAESample Size
Free flow (low congestion)2.8 min25,000
Moderate congestion4.2 min180,000
Heavy congestion5.9 min95,000
Rainy conditions5.1 min40,000
Peak hours (8-9 AM, 5-6 PM)4.8 min120,000

Real-World Example

Input:

json
{ "distance_km": 18.5, "current_congestion": 78, "hour_of_day": 17, "day_of_week": 5, "weather_condition": "light_rain", "temperature": 16, "precipitation": 2 }

Output:

json
{ "predicted_travel_time": 38, "confidence_interval": { "lower": 33, "upper": 43 }, "factors_increasing_time": [ "High congestion (78%): +8 minutes", "Peak hour (5:00 PM): +5 minutes", "Rain condition: +2 minutes" ] }

2. Traffic Congestion Classification

Problem Definition

Objective: Classify traffic condition into three categories:

  • 🟢 Low (0-25% congestion): Free flow
  • 🟡 Medium (25-75% congestion): Moderate delays
  • 🔴 High (75-100% congestion): Severe congestion

Challenge: Class imbalance (20% high, 50% medium, 30% low)

Model Selection

Options Considered:

  1. Logistic Regression: F1 = 0.78 (baseline)
  2. SVM: F1 = 0.84
  3. Random Forest: F1 = 0.87 ✅ (chosen)
  4. XGBoost: F1 = 0.86

Final Choice: Random Forest with class weights


Input Features

Same as regression model (16 features):

  • Traffic temporal features (lag, rolling averages)
  • Temporal features (hour, day, holiday)
  • Weather conditions
  • Road characteristics

Model Architecture

python
RandomForestClassifier( n_estimators=150, max_depth=10, min_samples_split=5, class_weight='balanced', # Handle imbalance random_state=42 )

Output

json
{ "predicted_class": "HIGH", "probability": 0.89, "class_probabilities": { "LOW": 0.05, "MEDIUM": 0.06, "HIGH": 0.89 }, "confidence_threshold": 0.7, "risk_level": "elevated" }

Performance Metrics

Classification Report:

ClassPrecisionRecallF1-ScoreSupport
Low0.910.850.8865,000
Medium0.840.880.86110,000
High0.890.850.8745,000
Weighted Avg0.880.860.87220,000

Confusion Matrix:

                Predicted
                Low   Med   High
Actual Low      55k   8k    2k
       Medium   6k    97k   7k
       High     2k    5k    38k

Evaluation Metrics

MetricValue
Accuracy86%
Precision (weighted)88%
Recall (weighted)86%
F1-Score (weighted)87%
ROC-AUC (macro)0.94
Matthews Correlation Coefficient0.79

3. Urban Congestion Clustering

Problem Definition

Objective: Identify geographic zones with similar traffic patterns without labels.

Use Case: Understand city congestion zones, predict patterns by zone, support urban planning.

Model Selection

Algorithms Evaluated:

  1. K-Means (Final choice)

    • Simple, efficient
    • Silhouette score: 0.68
    • Interpretable clusters
  2. DBSCAN

    • Density-based
    • Silhouette score: 0.62
    • Better for irregular shapes
  3. Hierarchical Clustering

    • Bottom-up approach
    • Dendrogram visualization
    • Slower (not scalable)

Input Features (Aggregated)

json
{ "avg_congestion_level": 65, // Historical mean "peak_hour": 17, // Most congested hour "peak_hour_congestion": 82, // Congestion at peak "std_dev_congestion": 18, // Variability "rolling_7day_trend": -2, // Trend (improving) "road_density": 3.5, // Km of roads per km² "intersection_count": 12, // Major intersections "latitude": 36.85, // Geographic location "longitude": 10.32, "proximity_to_center": 2.5 // Distance to city center }

Cluster Results

Optimal K = 5 (Elbow Method & Silhouette Score)

K=2: Silhouette = 0.45
K=3: Silhouette = 0.58
K=4: Silhouette = 0.65
K=5: Silhouette = 0.68 ← OPTIMAL
K=6: Silhouette = 0.64
K=7: Silhouette = 0.59

Identified Clusters

Cluster 0: Downtown Hub

  • Average congestion: 78%
  • Peak: 5-6 PM (88%)
  • Characteristics: High intersection count, dense network
  • Roads affected: Avenue H. Bourguiba, Rue de Carthage
  • Recommendation: Invest in public transit

Cluster 1: Suburban Ring

  • Average congestion: 52%
  • Peak: 8-9 AM (68%)
  • Characteristics: Moderate density, mixed use
  • Recommendation: Implement demand management

Cluster 2: Peripheral Areas

  • Average congestion: 25%
  • Peak: None (smooth flow)
  • Characteristics: Low density, residential
  • Recommendation: Monitor for growth impacts

Cluster 3: Commercial Zones

  • Average congestion: 65%
  • Peak: 11 AM - 1 PM (74%)
  • Characteristics: Retail/office, parking lots
  • Recommendation: Encourage off-peak shopping

Cluster 4: Transit Corridors

  • Average congestion: 48%
  • Peak: 7-8 AM (72%)
  • Characteristics: Main arterials, high bus volume
  • Recommendation: Dedicated bus lanes

Performance Metrics

MetricValueInterpretation
Silhouette Score0.68Good cluster separation
Davies-Bouldin Index0.82Low (good)
Calinski-Harabasz Index142.5High (good)
Inertia3,847Cluster compactness

4. Route & Transport Recommendation System

Architecture

Multi-Stage Recommendation Process:

1. Route Generation
   ├─ Generate 3-5 candidate routes
   └─ Use routing API (OSRM, Mapbox)

2. Time Prediction
   ├─ Apply regression model to each
   └─ Get travel time + confidence

3. Congestion Risk Assessment
   ├─ Apply classification model
   └─ Get risk level + probability

4. Scoring & Ranking
   ├─ Compute composite score
   ├─ Consider user preferences
   └─ Rank by preference weights

5. Recommendation
   └─ Return top 3 with explanations

Scoring Formula

SCORE = w1 * TIME_SCORE + w2 * RISK_SCORE + w3 * COST_SCORE + w4 * EMISSIONS_SCORE

Where:
- w1 = 0.35 (time priority for "fast" mode)
- w2 = 0.25 (avoid high-risk routes)
- w3 = 0.20 (budget consideration)
- w4 = 0.20 (environmental impact)

TIME_SCORE = 1 - (predicted_time / max_possible_time) × 0.8 + confidence × 0.2
RISK_SCORE = 1 - congestion_probability_high
COST_SCORE = 1 - (fuel_cost / max_cost)
EMISSIONS_SCORE = 1 - (co2_kg / max_emissions)

All scores normalized to [0, 1]

User Preference Profiles

Profile 1: Speed Priority

json
{ "name": "Fast Commuter", "weights": { "time": 0.50, "risk": 0.20, "cost": 0.20, "emissions": 0.10 }, "transport_preference": ["car"], "avoid_highways": false }

Profile 2: Eco-Conscious

json
{ "name": "Green Advocate", "weights": { "time": 0.20, "risk": 0.15, "cost": 0.15, "emissions": 0.50 }, "transport_preference": ["public_transit", "bike"], "avoid_highways": true }

Profile 3: Budget-Focused

json
{ "name": "Budget Traveler", "weights": { "time": 0.20, "risk": 0.20, "cost": 0.50, "emissions": 0.10 }, "transport_preference": ["public_transit"], "max_cost": 2.00 }

Example Output

json
{ "recommendations": [ { "rank": 1, "route_id": "route_001", "name": "Via Route de Tunis", "distance": 18.5, "predicted_time": 32, "time_confidence": 0.91, "congestion_risk": "MEDIUM", "risk_probability": 0.45, "transport_mode": "car", "cost_estimate": 2.45, "emissions_kg_co2": 2.8, "score": 9.2, "explanation": "Recommended for speed. Smooth sailing on highway.", "waypoints": ["Tunis Marine", "Route de Tunis", "La Marsa"] }, { "rank": 2, "route_id": "route_002", "name": "Public Transit", "distance": 18.5, "predicted_time": 48, "time_confidence": 0.92, "congestion_risk": "NONE", "risk_probability": 0.0, "transport_mode": "public_transit", "cost_estimate": 0.95, "emissions_kg_co2": 0.4, "score": 8.1, "explanation": "Eco-friendly. Avoid traffic entirely.", "transit_legs": [ { "type": "metro", "line": "Line 1", "duration": 28 }, { "type": "walk", "duration": 8 } ] } ] }

Model Training Pipeline

Data Preparation

Dataset Size:

  • Total samples: 730 million (2 years)
  • Training: 60% (438M)
  • Validation: 20% (146M)
  • Test: 20% (146M)

Temporal Stratification:

Training: Jan 2022 - Aug 2023
Validation: Sep 2023 - Oct 2023
Test: Nov 2023 - Dec 2023

(Prevents data leakage from future information)


Feature Engineering Pipeline

python
def create_features(data): """ Transform raw data into ML features """ # 1. Temporal features data['hour'] = data['timestamp'].dt.hour data['day_of_week'] = data['timestamp'].dt.dayofweek data['is_weekend'] = data['day_of_week'].isin([5, 6]) # 2. Lag features for lag in [1, 3, 6, 12, 24]: data[f'congestion_lag_{lag}h'] = \ data.groupby('segment_id')['congestion'].shift(lag) # 3. Rolling aggregates for window in [6, 24, 168]: # hours data[f'congestion_roll_{window}'] = \ data.groupby('segment_id')['congestion'].rolling(window).mean() # 4. Interaction features data['rain_and_congestion'] = data['is_raining'] * data['congestion'] data['peak_hour_factor'] = (data['hour'].isin([8, 9, 17, 18])).astype(int) return data

Model Training Script

python
from sklearn.ensemble import RandomForestRegressor from xgboost import XGBRegressor import pickle # Load prepared data X_train, y_train = load_training_data() X_val, y_val = load_validation_data() # Train XGBoost xgb_model = XGBRegressor( n_estimators=200, max_depth=7, learning_rate=0.05, subsample=0.8, colsample_bytree=0.8 ) xgb_model.fit( X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=50, verbose=10 ) # Train Random Forest rf_model = RandomForestRegressor( n_estimators=100, max_depth=15, min_samples_split=5 ) rf_model.fit(X_train, y_train) # Ensemble predictions xgb_pred = xgb_model.predict(X_val) rf_pred = rf_model.predict(X_val) ensemble_pred = (xgb_pred + rf_pred) / 2 # Save models with open('xgb_model.pkl', 'wb') as f: pickle.dump(xgb_model, f) with open('rf_model.pkl', 'wb') as f: pickle.dump(rf_model, f)

Model Monitoring & Drift Detection

Prediction Drift

Monitor actual vs. predicted values:

python
def detect_drift(actual, predicted): """ Check if model predictions are drifting """ mae = mean_absolute_error(actual, predicted) if mae > THRESHOLD: alert("MAE exceeded threshold!") trigger_retraining()

Data Drift

Check for changes in input data distribution:

If recent_data_mean significantly differs from training_data_mean:
    → Data drift detected
    → Retrain model with recent data
    → Update feature statistics

Retraining Strategy

Schedule:

  • Weekly Full Retraining: Use 2+ weeks of recent data
  • Daily Incremental: Update model with latest 24h
  • Monthly Major Version: Full pipeline review

Triggers:

  • Prediction MAE > 7 min (threshold)
  • Data drift detected (KL divergence)
  • New model performance > current by 5%