๐Ÿ“‹ Project Overview

This project implements an end-to-end machine learning pipeline for predicting crop yield based on environmental, soil, and management factors. The system processes multiple heterogeneous datasets, performs feature engineering, trains multiple regression models, and serves predictions via a REST API.

75,000
Synthetic Samples
22
Features Used
96.27%
Rยฒ Score

Tech Stack

Component Technology Version
Language Python 3.11.9
Data Processing pandas, numpy 2.3.x, 2.3.x
ML Framework scikit-learn 1.5.x
Visualization matplotlib, seaborn 3.9.x, 0.13.x
API Framework Flask 3.1.x
Model Serialization joblib 1.4.x

๐Ÿ—๏ธ System Architecture

The system follows a modular architecture with clear separation between data processing, model training, and serving layers.

๐Ÿ“
Raw Data
(6 files)
โ†’
๐Ÿ”„
ETL
Pipeline
โ†’
๐Ÿงน
Feature
Engineering
โ†’
๐Ÿค–
Model
Training
โ†’
๐Ÿš€
REST
API
project_structure.txt
Phase-2/
โ”œโ”€โ”€ unified_dataset.csv      # 75,000 synthetic records
โ”œโ”€โ”€ generate_synthetic_dataset.py  # Dataset generator
โ”œโ”€โ”€ train_model.py           # ML training pipeline
โ”‚
โ”œโ”€โ”€ model/
โ”‚   โ”œโ”€โ”€ model.pkl            # Trained GradientBoostingRegressor
โ”‚   โ”œโ”€โ”€ scaler.pkl           # StandardScaler for normalization
โ”‚   โ”œโ”€โ”€ label_encoders.pkl   # Categorical encoders (Crop, State, etc.)
โ”‚   โ”œโ”€โ”€ feature_list.pkl     # 22 feature names for inference
โ”‚   โ””โ”€โ”€ model_info.pkl       # Model metadata and metrics
โ”‚
โ”œโ”€โ”€ api/
โ”‚   โ””โ”€โ”€ app.py               # Flask REST API
โ”‚
โ””โ”€โ”€ dashboard/
    โ”œโ”€โ”€ index.html           # User-facing prediction UI
    โ”œโ”€โ”€ technical.html       # This documentation
    โ””โ”€โ”€ style.css, script.js

๐Ÿ”„ Data Pipeline

Due to data quality issues in the original heterogeneous datasets (missing values, inconsistent schemas, incomplete coverage), a synthetic dataset of 75,000 records was generated with realistic correlations between agricultural features. The synthetic data covers 22 crops across 20 Indian states from 2015-2024.

Synthetic Dataset Properties

Property Value Description
Total Records 75,000 Complete records with no missing values
Crops 22 types Rice, Wheat, Maize, Cotton, Sugarcane, Soybean, etc.
States 20 Indian states Punjab, Maharashtra, UP, MP, Karnataka, etc.
Years 2015-2024 10-year range with seasonal variations
Features 27 columns 22 used for ML prediction

Column Mapping Strategy

column_mapping.py
COLUMN_MAPPING = {
    # Soil Nutrients
    'N_SOIL': 'Nitrogen',
    'P_SOIL': 'Phosphorus', 
    'K_SOIL': 'Potassium',
    'Nitrogen (N)': 'Nitrogen',
    
    # Temperature variations
    'TEMPERATURE': 'Temperature_C',
    'Air temperature (C)': 'Temperature_C',
    'Temperatue': 'Temperature_C',  # typo in source
    'Mean Temp': 'Temperature_C',
    
    # Humidity variations
    'HUMIDITY': 'Humidity',
    'Air humidity (%)': 'Humidity',
    'Average Humidity': 'Humidity',
    
    # Target variable mappings
    'Yield_kg_per_hectare': 'Yield_kg_per_hectare',
    'Crop Yield': 'Yield_kg_per_hectare',
    'Yeild (Q/acre)': 'Yield_kg_per_hectare',
    'millet yield': 'Yield_kg_per_hectare',
}

๐Ÿ“Š Data Field Status

With the synthetic dataset, all 75,000 records have complete values for all features. No imputation is required, ensuring high-quality predictions across all input combinations.

Complete Feature Coverage (75K samples each)

๐ŸŸข

Weather Features

All weather parameters with 100% coverage

FeatureSamplesRange
Rainfall_mm75,000200 - 2500 mm
Temperature_C75,00015 - 42ยฐC
Humidity75,00030 - 95%
Sunshine_Hours75,0004 - 10 hrs
GDD75,0001000 - 3000
Pressure_KPa75,00095 - 105 kPa
Wind_Speed_Kmh75,0005 - 30 km/h
๐ŸŸข

Soil & Nutrient Features

Complete soil parameters for all records

FeatureSamplesRange
Soil_Quality75,00040 - 100
Nitrogen75,00020 - 120 kg/ha
Phosphorus75,00010 - 80 kg/ha
Potassium75,00015 - 100 kg/ha
Soil_pH75,0005.5 - 8.5
OrganicCarbon75,0000.3 - 2.5%
Soil_Moisture75,00020 - 70%
๐ŸŸข

Management & Location Features

Categorical and management parameters

FeatureSamplesCategories/Range
Crop75,00022 crop types
State75,00020 Indian states
Year75,0002015 - 2024
Fertilizer_Amount75,00050 - 350 kg/ha
Irrigation_Type75,000Drip, Sprinkler, Canal, Rainfed
Seed_Variety75,000Local, Improved, Hybrid
Pesticide_Usage75,0000 - 20 kg/ha
โœ… Synthetic Data Advantage: Unlike the original datasets which had significant missing values requiring imputation, the synthetic dataset provides complete coverage with realistic correlations between agricultural factors, resulting in more reliable predictions.

โš™๏ธ Feature Engineering

The final model uses 22 features across weather, soil, management, and location categories. The dataset is synthetically generated with 75,000 records covering 22 crops across 20 Indian states from 2015-2024, ensuring complete data coverage with realistic correlations.

Feature Schema (22 ML Features)

Category Feature Type Range Preprocessing
๐ŸŒค๏ธ Weather Rainfall_mm float64 200 - 2500 StandardScaler
Temperature_C float64 15 - 42 StandardScaler
Humidity float64 30 - 95 StandardScaler
Sunshine_Hours float64 4 - 10 StandardScaler
GDD float64 1000 - 3000 StandardScaler
Pressure_KPa float64 95 - 105 StandardScaler
Wind_Speed_Kmh float64 5 - 30 StandardScaler
๐ŸŒฑ Soil Soil_Quality float64 40 - 100 StandardScaler
Soil_Moisture float64 20 - 70 StandardScaler
Nitrogen float64 20 - 120 StandardScaler
Phosphorus float64 10 - 80 StandardScaler
Potassium float64 15 - 100 StandardScaler
Soil_pH float64 5.5 - 8.5 StandardScaler
OrganicCarbon float64 0.3 - 2.5 StandardScaler
๐Ÿšœ Management Fertilizer_Amount float64 50 - 350 StandardScaler
Irrigation_Type categorical 4 types LabelEncoder
Seed_Variety categorical 3 types LabelEncoder
Pesticide_Usage float64 0 - 20 StandardScaler
Crop categorical 22 types LabelEncoder
State categorical 20 states LabelEncoder
๐Ÿ“… Time Year int64 2015 - 2024 StandardScaler

Preprocessing Pipeline

preprocessing.py
from sklearn.preprocessing import StandardScaler, LabelEncoder

# 1. Handle categorical features
label_encoders = {}
for col in ['Crop', 'State', 'Irrigation_Type', 'Seed_Variety']:
    le = LabelEncoder()
    df[f'{col}_Encoded'] = le.fit_transform(df[col])
    label_encoders[col] = le

# 2. Standardize numerical features (no imputation needed - synthetic data is complete)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Train-test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)

๐Ÿค– Model Training

Four regression algorithms were trained and evaluated using 5-fold cross-validation. The Gradient Boosting Regressor achieved the best overall performance with an Rยฒ of 0.9627 and was selected as the production model based on multiple criteria including generalization, inference speed, and model size.

Algorithms Compared

๐Ÿ† Gradient Boosting Regressor SELECTED
0.9627
Rยฒ Score
1,610
MAE (kg/ha)
3,574
RMSE (kg/ha)
๐ŸŽฏ Why Gradient Boosting? Selected for its superior generalization (CV Rยฒ = 0.9603), faster inference time (~3x faster than Random Forest), smaller model size, and better handling of outliers. The 96.27% Rยฒ on test data with cross-validation consistency makes it ideal for production.
Random Forest Regressor
0.9594
Rยฒ Score
1,782
MAE (kg/ha)
3,728
RMSE (kg/ha)

โš ๏ธ Good performance but slower inference, larger model size, and lower CV consistency compared to GB.

Decision Tree Regressor
0.8834
Rยฒ Score
2,514
MAE (kg/ha)
6,314
RMSE (kg/ha)
Linear Regression
0.7156
Rยฒ Score
5,221
MAE (kg/ha)
9,866
RMSE (kg/ha)

๐ŸŽฏ Model Selection Justification

Why Gradient Boosting over Random Forest?

Criteria Gradient Boosting Random Forest Winner
Test Rยฒ Score 0.9627 0.9594 โœ… GB
Cross-Val Rยฒ (5-fold) 0.9603 0.9512 โœ… GB
Inference Speed ~15ms ~45ms โœ… GB (~3x faster)
Model Size ~2.1 MB ~8.5 MB โœ… GB (~4x smaller)
Generalization Sequential boosting reduces bias Higher variance on new data โœ… GB

Conclusion: Gradient Boosting provides better generalization, faster predictions, and a more compact model while achieving higher Rยฒ on both test and cross-validation datasets.

Model Configuration

model_config.py
from sklearn.ensemble import GradientBoostingRegressor

# Best Model Configuration
model = GradientBoostingRegressor(
    n_estimators=100,      # Number of boosting stages
    learning_rate=0.1,     # Shrinks contribution of each tree
    max_depth=3,           # Maximum depth of individual trees
    min_samples_split=2,   # Min samples to split internal node
    min_samples_leaf=1,    # Min samples at leaf node
    subsample=1.0,         # Fraction of samples for fitting trees
    random_state=42        # Reproducibility seed
)

# Training
model.fit(X_train, y_train)

# Save model
import joblib
joblib.dump(model, 'model/model.pkl')

๐Ÿ“Š Model Evaluation

The model was evaluated using standard regression metrics on the held-out test set (20% of data).

Evaluation Metrics Explained

Metric Formula Value Interpretation
Rยฒ (Coefficient of Determination) 1 - (SS_res / SS_tot) 0.9627 96.27% variance explained
MAE (Mean Absolute Error) mean(|y - ลท|) 1,610 kg/ha Average prediction error
RMSE (Root Mean Squared Error) sqrt(mean((y - ลท)ยฒ)) 3,574 kg/ha Penalizes large errors
Cross-Validation Rยฒ mean(5-fold Rยฒ) 0.9603 Generalization performance
evaluation.py
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import numpy as np

# Predictions on test set
y_pred = model.predict(X_test)

# Calculate metrics
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"Rยฒ Score: {r2:.4f}")   # 0.9627
print(f"MAE: {mae:.2f}")       # 1610.29
print(f"RMSE: {rmse:.2f}")     # 3573.78

Prediction Analysis Visualization

The scatter plot below shows actual vs predicted yields on the test set. Points close to the diagonal line indicate accurate predictions.

Prediction Analysis

Figure: Actual vs Predicted yield showing Rยฒ = 0.9627

๐Ÿ” Outlier Analysis

Before creating the synthetic dataset, extensive outlier analysis was performed on the original heterogeneous datasets to understand data quality issues.

Issues Identified in Original Data

Issue Affected Records Resolution
Temperature > 50ยฐC (Fahrenheit values) ~800 records Identified, led to synthetic data generation
Missing Crop + Yield combinations All 7,109 records No row had both Crop AND Yield values
Inconsistent feature coverage Variable 11-99% coverage across features

Outlier Analysis Visualization

Outlier Analysis

Figure: Outlier detection analysis on original dataset

Decision: Synthetic Data Generation

Due to the fundamental data quality issues (no record having both crop type AND yield), the decision was made to generate a synthetic dataset with:

  • 75,000 complete records with realistic correlations
  • 22 crop types with appropriate base yields
  • 20 Indian states with regional climate variations
  • All features populated with domain-appropriate values

๐Ÿ“ˆ Feature Importance

Feature importance was extracted from the Gradient Boosting model. The top features driving predictions are Crop type, State, and various agricultural parameters.

Top Feature Importance Rankings

Crop_Encoded
~25%
Year
~18%
Rainfall_mm
~12%
Fertilizer_Amount
~11%
Temperature_C
~9%
Soil_Quality
~7%
Other 16 Features
~18%

Key Insights

  • Crop Type (~25%) - Most critical factor; different crops have vastly different base yields
  • Year (~18%) - Captures annual variations and agricultural improvements over time
  • Rainfall (~12%) - Water availability is crucial for crop growth
  • Fertilizer Amount (~11%) - Optimal fertilization significantly impacts yield
  • Temperature (~9%) - Each crop has optimal temperature ranges
  • Soil Quality (~7%) - Foundation for healthy crop growth

๐Ÿ”Œ API Reference

The Flask REST API exposes endpoints for health checks, feature information, and predictions. All responses are JSON formatted.

Base URL

http://localhost:5000

Endpoints

GET /health

Health check endpoint to verify API is running and model is loaded.

GET /features

Returns list of required input features with descriptions and valid ranges.

GET /model-info

Returns model metadata including name, performance metrics, and training date.

POST /predict

Make a single yield prediction. Accepts JSON body with feature values.

POST /predict-batch

Make batch predictions for multiple samples. Accepts array of feature objects.

Example: Single Prediction

curl_request.sh
curl -X POST http://localhost:5000/predict \
  -H "Content-Type: application/json" \
  -d '{
    "Crop": "Rice",
    "State": "Punjab",
    "Year": 2024,
    "Rainfall_mm": 1200,
    "Temperature_C": 28,
    "Humidity": 70,
    "Soil_Quality": 75,
    "Nitrogen": 45,
    "Phosphorus": 35,
    "Potassium": 42,
    "Fertilizer_Amount": 180,
    "Irrigation_Type": "Canal",
    "Seed_Variety": "Hybrid",
    "Pesticide_Usage": 8.5
  }'

Example Response

response.json
{
  "status": "success",
  "predicted_yield_kg_per_hectare": 4256.78,
  "crop": "Rice",
  "state": "Punjab"
}

๐Ÿš€ Deployment

Local Development

run_local.sh
# Activate virtual environment
source .venv/Scripts/activate  # Windows
# source .venv/bin/activate    # Linux/Mac

# Install dependencies
pip install -r api/requirements.txt

# Run Flask development server
cd Phase-2/api
python app.py

# Server runs on http://localhost:5000

Production Deployment (Gunicorn)

run_production.sh
# Install gunicorn
pip install gunicorn

# Run with gunicorn (4 workers)
cd Phase-2/api
gunicorn -w 4 -b 0.0.0.0:5000 app:app

Docker Deployment

Dockerfile
FROM python:3.11-slim

WORKDIR /app

COPY api/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY api/ ./api/
COPY model/ ./model/

EXPOSE 5000

CMD ["gunicorn", "-w", "4", "-b", "0.0.0.0:5000", "api.app:app"]

๐Ÿ’ป Code Samples

Python Client

client.py
import requests

API_URL = "http://localhost:5000"

def predict_yield(features: dict) -> float:
    """
    Make a yield prediction via the API.
    
    Args:
        features: Dictionary of input features
        
    Returns:
        Predicted yield in kg/hectare
    """
    response = requests.post(
        f"{API_URL}/predict",
        json=features
    )
    result = response.json()
    
    if result['status'] == 'success':
        return result['predicted_yield_kg_per_hectare']
    else:
        raise Exception(result.get('error', 'Unknown error'))

# Example usage
features = {
    "Crop": "Rice",
    "State": "Punjab",
    "Year": 2024,
    "Rainfall_mm": 1200,
    "Temperature_C": 28,
    "Humidity": 70,
    "Soil_Quality": 75,
    "Fertilizer_Amount": 180,
    "Irrigation_Type": "Canal",
    "Seed_Variety": "Hybrid"
}

predicted_yield = predict_yield(features)
print(f"Predicted Yield: {predicted_yield:.2f} kg/ha")

JavaScript/Fetch Client

client.js
async function predictYield(features) {
    const response = await fetch('http://localhost:5000/predict', {
        method: 'POST',
        headers: {
            'Content-Type': 'application/json',
        },
        body: JSON.stringify(features)
    });
    
    const result = await response.json();
    
    if (result.status === 'success') {
        return result.predicted_yield_kg_per_hectare;
    } else {
        throw new Error(result.error || 'Unknown error');
    }
}

// Usage
const features = {
    Crop: "Rice",
    State: "Punjab",
    Year: 2024,
    Rainfall_mm: 1200,
    Temperature_C: 28,
    Humidity: 70,
    Soil_Quality: 75,
    Irrigation_Type: "Canal",
    Seed_Variety: "Hybrid"
};

predictYield(features)
    .then(yieldValue => console.log(`Predicted: ${yieldValue} kg/ha`))
    .catch(err => console.error(err));

Load Model for Local Inference

local_inference.py
import joblib
import numpy as np

# Load artifacts
model = joblib.load('model/model.pkl')
scaler = joblib.load('model/scaler.pkl')
label_encoders = joblib.load('model/label_encoders.pkl')
features = joblib.load('model/feature_list.pkl')

def predict(input_dict):
    """Make prediction without API"""
    # Encode categorical features
    for col in ['Crop', 'State', 'Irrigation_Type', 'Seed_Variety']:
        if col in input_dict:
            input_dict[f'{col}_Encoded'] = label_encoders[col].transform([input_dict[col]])[0]
    
    # Build feature vector
    X = np.array([[input_dict.get(f, 0) for f in features]])
    
    # Preprocess and predict
    X_scaled = scaler.transform(X)
    return max(0, model.predict(X_scaled)[0])

# Usage
result = predict({
    "Crop": "Rice", "State": "Punjab", "Year": 2024,
    "Rainfall_mm": 1200, "Temperature_C": 28,
    "Humidity": 70, "Soil_Quality": 75
})
print(f"Yield: {result:.2f} kg/ha")