📊 Technical Documentation - Crop Yield Prediction

📋 Project Overview

This project implements an end-to-end machine learning pipeline for predicting crop yield based on environmental, soil, and management factors. The system processes multiple heterogeneous datasets, performs feature engineering, trains multiple regression models, and serves predictions via a REST API.

75,000

Synthetic Samples

22

Features Used

96.27%

R² Score

Tech Stack

Component	Technology	Version
Language	`Python`	3.11.9
Data Processing	`pandas`, `numpy`	2.3.x, 2.3.x
ML Framework	`scikit-learn`	1.5.x
Visualization	`matplotlib`, `seaborn`	3.9.x, 0.13.x
API Framework	`Flask`	3.1.x
Model Serialization	`joblib`	1.4.x

🏗️ System Architecture

The system follows a modular architecture with clear separation between data processing, model training, and serving layers.

📁

Raw Data
(6 files)

→

🔄

ETL
Pipeline

→

🧹

Feature
Engineering

→

🤖

Model
Training

→

🚀

REST
API

project_structure.txt

Phase-2/
├── unified_dataset.csv      # 75,000 synthetic records
├── generate_synthetic_dataset.py  # Dataset generator
├── train_model.py           # ML training pipeline
│
├── model/
│   ├── model.pkl            # Trained GradientBoostingRegressor
│   ├── scaler.pkl           # StandardScaler for normalization
│   ├── label_encoders.pkl   # Categorical encoders (Crop, State, etc.)
│   ├── feature_list.pkl     # 22 feature names for inference
│   └── model_info.pkl       # Model metadata and metrics
│
├── api/
│   └── app.py               # Flask REST API
│
└── dashboard/
    ├── index.html           # User-facing prediction UI
    ├── technical.html       # This documentation
    └── style.css, script.js

🔄 Data Pipeline

Due to data quality issues in the original heterogeneous datasets (missing values, inconsistent schemas, incomplete coverage), a synthetic dataset of 75,000 records was generated with realistic correlations between agricultural features. The synthetic data covers 22 crops across 20 Indian states from 2015-2024.

Synthetic Dataset Properties

Property	Value	Description
Total Records	75,000	Complete records with no missing values
Crops	22 types	Rice, Wheat, Maize, Cotton, Sugarcane, Soybean, etc.
States	20 Indian states	Punjab, Maharashtra, UP, MP, Karnataka, etc.
Years	2015-2024	10-year range with seasonal variations
Features	27 columns	22 used for ML prediction

Column Mapping Strategy

column_mapping.py

COLUMN_MAPPING = {
    # Soil Nutrients
    'N_SOIL': 'Nitrogen',
    'P_SOIL': 'Phosphorus', 
    'K_SOIL': 'Potassium',
    'Nitrogen (N)': 'Nitrogen',
    
    # Temperature variations
    'TEMPERATURE': 'Temperature_C',
    'Air temperature (C)': 'Temperature_C',
    'Temperatue': 'Temperature_C',  # typo in source
    'Mean Temp': 'Temperature_C',
    
    # Humidity variations
    'HUMIDITY': 'Humidity',
    'Air humidity (%)': 'Humidity',
    'Average Humidity': 'Humidity',
    
    # Target variable mappings
    'Yield_kg_per_hectare': 'Yield_kg_per_hectare',
    'Crop Yield': 'Yield_kg_per_hectare',
    'Yeild (Q/acre)': 'Yield_kg_per_hectare',
    'millet yield': 'Yield_kg_per_hectare',
}

📊 Data Field Status

With the synthetic dataset, all 75,000 records have complete values for all features. No imputation is required, ensuring high-quality predictions across all input combinations.

Complete Feature Coverage (75K samples each)

🟢

Weather Features

All weather parameters with 100% coverage

Feature	Samples	Range
`Rainfall_mm`	75,000	200 - 2500 mm
`Temperature_C`	75,000	15 - 42°C
`Humidity`	75,000	30 - 95%
`Sunshine_Hours`	75,000	4 - 10 hrs
`GDD`	75,000	1000 - 3000
`Pressure_KPa`	75,000	95 - 105 kPa
`Wind_Speed_Kmh`	75,000	5 - 30 km/h

🟢

Soil & Nutrient Features

Complete soil parameters for all records

Feature	Samples	Range
`Soil_Quality`	75,000	40 - 100
`Nitrogen`	75,000	20 - 120 kg/ha
`Phosphorus`	75,000	10 - 80 kg/ha
`Potassium`	75,000	15 - 100 kg/ha
`Soil_pH`	75,000	5.5 - 8.5
`OrganicCarbon`	75,000	0.3 - 2.5%
`Soil_Moisture`	75,000	20 - 70%

🟢

Management & Location Features

Categorical and management parameters

Feature	Samples	Categories/Range
`Crop`	75,000	22 crop types
`State`	75,000	20 Indian states
`Year`	75,000	2015 - 2024
`Fertilizer_Amount`	75,000	50 - 350 kg/ha
`Irrigation_Type`	75,000	Drip, Sprinkler, Canal, Rainfed
`Seed_Variety`	75,000	Local, Improved, Hybrid
`Pesticide_Usage`	75,000	0 - 20 kg/ha

✅ Synthetic Data Advantage: Unlike the original datasets which had significant missing values requiring imputation, the synthetic dataset provides complete coverage with realistic correlations between agricultural factors, resulting in more reliable predictions.

⚙️ Feature Engineering

The final model uses 22 features across weather, soil, management, and location categories. The dataset is synthetically generated with 75,000 records covering 22 crops across 20 Indian states from 2015-2024, ensuring complete data coverage with realistic correlations.

Feature Schema (22 ML Features)

Category	Feature	Type	Range	Preprocessing
🌤️ Weather	`Rainfall_mm`	float64	200 - 2500	StandardScaler
	`Temperature_C`	float64	15 - 42	StandardScaler
	`Humidity`	float64	30 - 95	StandardScaler
	`Sunshine_Hours`	float64	4 - 10	StandardScaler
	`GDD`	float64	1000 - 3000	StandardScaler
	`Pressure_KPa`	float64	95 - 105	StandardScaler
	`Wind_Speed_Kmh`	float64	5 - 30	StandardScaler
🌱 Soil	`Soil_Quality`	float64	40 - 100	StandardScaler
	`Soil_Moisture`	float64	20 - 70	StandardScaler
	`Nitrogen`	float64	20 - 120	StandardScaler
	`Phosphorus`	float64	10 - 80	StandardScaler
	`Potassium`	float64	15 - 100	StandardScaler
	`Soil_pH`	float64	5.5 - 8.5	StandardScaler
	`OrganicCarbon`	float64	0.3 - 2.5	StandardScaler
🚜 Management	`Fertilizer_Amount`	float64	50 - 350	StandardScaler
	`Irrigation_Type`	categorical	4 types	LabelEncoder
	`Seed_Variety`	categorical	3 types	LabelEncoder
	`Pesticide_Usage`	float64	0 - 20	StandardScaler
	`Crop`	categorical	22 types	LabelEncoder
	`State`	categorical	20 states	LabelEncoder
📅 Time	`Year`	int64	2015 - 2024	StandardScaler

Preprocessing Pipeline

preprocessing.py

from sklearn.preprocessing import StandardScaler, LabelEncoder

# 1. Handle categorical features
label_encoders = {}
for col in ['Crop', 'State', 'Irrigation_Type', 'Seed_Variety']:
    le = LabelEncoder()
    df[f'{col}_Encoded'] = le.fit_transform(df[col])
    label_encoders[col] = le

# 2. Standardize numerical features (no imputation needed - synthetic data is complete)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Train-test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)

🤖 Model Training

Four regression algorithms were trained and evaluated using 5-fold cross-validation. The Gradient Boosting Regressor achieved the best overall performance with an R² of 0.9627 and was selected as the production model based on multiple criteria including generalization, inference speed, and model size.

Algorithms Compared

🏆 Gradient Boosting Regressor SELECTED

0.9627

R² Score

1,610

MAE (kg/ha)

3,574

RMSE (kg/ha)

🎯 Why Gradient Boosting? Selected for its superior generalization (CV R² = 0.9603), faster inference time (~3x faster than Random Forest), smaller model size, and better handling of outliers. The 96.27% R² on test data with cross-validation consistency makes it ideal for production.

Random Forest Regressor

0.9594

R² Score

1,782

MAE (kg/ha)

3,728

RMSE (kg/ha)

⚠️ Good performance but slower inference, larger model size, and lower CV consistency compared to GB.

Decision Tree Regressor

0.8834

R² Score

2,514

MAE (kg/ha)

6,314

RMSE (kg/ha)

Linear Regression

0.7156

R² Score

5,221

MAE (kg/ha)

9,866

RMSE (kg/ha)

🎯 Model Selection Justification

Why Gradient Boosting over Random Forest?

Criteria	Gradient Boosting	Random Forest	Winner
Test R² Score	0.9627	0.9594	✅ GB
Cross-Val R² (5-fold)	0.9603	0.9512	✅ GB
Inference Speed	~15ms	~45ms	✅ GB (~3x faster)
Model Size	~2.1 MB	~8.5 MB	✅ GB (~4x smaller)
Generalization	Sequential boosting reduces bias	Higher variance on new data	✅ GB

Conclusion: Gradient Boosting provides better generalization, faster predictions, and a more compact model while achieving higher R² on both test and cross-validation datasets.

Model Configuration

model_config.py

from sklearn.ensemble import GradientBoostingRegressor

# Best Model Configuration
model = GradientBoostingRegressor(
    n_estimators=100,      # Number of boosting stages
    learning_rate=0.1,     # Shrinks contribution of each tree
    max_depth=3,           # Maximum depth of individual trees
    min_samples_split=2,   # Min samples to split internal node
    min_samples_leaf=1,    # Min samples at leaf node
    subsample=1.0,         # Fraction of samples for fitting trees
    random_state=42        # Reproducibility seed
)

# Training
model.fit(X_train, y_train)

# Save model
import joblib
joblib.dump(model, 'model/model.pkl')

📊 Model Evaluation

The model was evaluated using standard regression metrics on the held-out test set (20% of data).

Evaluation Metrics Explained

Metric	Formula	Value	Interpretation
R² (Coefficient of Determination)	`1 - (SS_res / SS_tot)`	0.9627	96.27% variance explained
MAE (Mean Absolute Error)	`mean(\|y - ŷ\|)`	1,610 kg/ha	Average prediction error
RMSE (Root Mean Squared Error)	`sqrt(mean((y - ŷ)²))`	3,574 kg/ha	Penalizes large errors
Cross-Validation R²	`mean(5-fold R²)`	0.9603	Generalization performance

evaluation.py

from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import numpy as np

# Predictions on test set
y_pred = model.predict(X_test)

# Calculate metrics
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"R² Score: {r2:.4f}")   # 0.9627
print(f"MAE: {mae:.2f}")       # 1610.29
print(f"RMSE: {rmse:.2f}")     # 3573.78

Prediction Analysis Visualization

The scatter plot below shows actual vs predicted yields on the test set. Points close to the diagonal line indicate accurate predictions.

Figure: Actual vs Predicted yield showing R² = 0.9627

🔍 Outlier Analysis

Before creating the synthetic dataset, extensive outlier analysis was performed on the original heterogeneous datasets to understand data quality issues.

Issues Identified in Original Data

Issue	Affected Records	Resolution
Temperature > 50°C (Fahrenheit values)	~800 records	Identified, led to synthetic data generation
Missing Crop + Yield combinations	All 7,109 records	No row had both Crop AND Yield values
Inconsistent feature coverage	Variable	11-99% coverage across features

Outlier Analysis Visualization

Figure: Outlier detection analysis on original dataset

Decision: Synthetic Data Generation

Due to the fundamental data quality issues (no record having both crop type AND yield), the decision was made to generate a synthetic dataset with:

75,000 complete records with realistic correlations
22 crop types with appropriate base yields
20 Indian states with regional climate variations
All features populated with domain-appropriate values

📈 Feature Importance

Feature importance was extracted from the Gradient Boosting model. The top features driving predictions are Crop type, State, and various agricultural parameters.

Top Feature Importance Rankings

Crop_Encoded

~25%

Year

~18%

Rainfall_mm

~12%

Fertilizer_Amount

~11%

Temperature_C

~9%

Soil_Quality

~7%

Other 16 Features

~18%

Key Insights

Crop Type (~25%) - Most critical factor; different crops have vastly different base yields
Year (~18%) - Captures annual variations and agricultural improvements over time
Rainfall (~12%) - Water availability is crucial for crop growth
Fertilizer Amount (~11%) - Optimal fertilization significantly impacts yield
Temperature (~9%) - Each crop has optimal temperature ranges
Soil Quality (~7%) - Foundation for healthy crop growth

🔌 API Reference

The Flask REST API exposes endpoints for health checks, feature information, and predictions. All responses are JSON formatted.

Base URL

http://localhost:5000

Endpoints

GET /health

Health check endpoint to verify API is running and model is loaded.

GET /features

Returns list of required input features with descriptions and valid ranges.

GET /model-info

Returns model metadata including name, performance metrics, and training date.

POST /predict

Make a single yield prediction. Accepts JSON body with feature values.

POST /predict-batch

Make batch predictions for multiple samples. Accepts array of feature objects.

Example: Single Prediction

curl_request.sh

curl -X POST http://localhost:5000/predict \
  -H "Content-Type: application/json" \
  -d '{
    "Crop": "Rice",
    "State": "Punjab",
    "Year": 2024,
    "Rainfall_mm": 1200,
    "Temperature_C": 28,
    "Humidity": 70,
    "Soil_Quality": 75,
    "Nitrogen": 45,
    "Phosphorus": 35,
    "Potassium": 42,
    "Fertilizer_Amount": 180,
    "Irrigation_Type": "Canal",
    "Seed_Variety": "Hybrid",
    "Pesticide_Usage": 8.5
  }'

Example Response

response.json

{
  "status": "success",
  "predicted_yield_kg_per_hectare": 4256.78,
  "crop": "Rice",
  "state": "Punjab"
}

🚀 Deployment

Local Development

run_local.sh

# Activate virtual environment
source .venv/Scripts/activate  # Windows
# source .venv/bin/activate    # Linux/Mac

# Install dependencies
pip install -r api/requirements.txt

# Run Flask development server
cd Phase-2/api
python app.py

# Server runs on http://localhost:5000

Production Deployment (Gunicorn)

run_production.sh

# Install gunicorn
pip install gunicorn

# Run with gunicorn (4 workers)
cd Phase-2/api
gunicorn -w 4 -b 0.0.0.0:5000 app:app

Docker Deployment

Dockerfile

FROM python:3.11-slim

WORKDIR /app

COPY api/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY api/ ./api/
COPY model/ ./model/

EXPOSE 5000

CMD ["gunicorn", "-w", "4", "-b", "0.0.0.0:5000", "api.app:app"]

💻 Code Samples

Python Client

client.py

import requests

API_URL = "http://localhost:5000"

def predict_yield(features: dict) -> float:
    """
    Make a yield prediction via the API.
    
    Args:
        features: Dictionary of input features
        
    Returns:
        Predicted yield in kg/hectare
    """
    response = requests.post(
        f"{API_URL}/predict",
        json=features
    )
    result = response.json()
    
    if result['status'] == 'success':
        return result['predicted_yield_kg_per_hectare']
    else:
        raise Exception(result.get('error', 'Unknown error'))

# Example usage
features = {
    "Crop": "Rice",
    "State": "Punjab",
    "Year": 2024,
    "Rainfall_mm": 1200,
    "Temperature_C": 28,
    "Humidity": 70,
    "Soil_Quality": 75,
    "Fertilizer_Amount": 180,
    "Irrigation_Type": "Canal",
    "Seed_Variety": "Hybrid"
}

predicted_yield = predict_yield(features)
print(f"Predicted Yield: {predicted_yield:.2f} kg/ha")

JavaScript/Fetch Client

client.js

async function predictYield(features) {
    const response = await fetch('http://localhost:5000/predict', {
        method: 'POST',
        headers: {
            'Content-Type': 'application/json',
        },
        body: JSON.stringify(features)
    });
    
    const result = await response.json();
    
    if (result.status === 'success') {
        return result.predicted_yield_kg_per_hectare;
    } else {
        throw new Error(result.error || 'Unknown error');
    }
}

// Usage
const features = {
    Crop: "Rice",
    State: "Punjab",
    Year: 2024,
    Rainfall_mm: 1200,
    Temperature_C: 28,
    Humidity: 70,
    Soil_Quality: 75,
    Irrigation_Type: "Canal",
    Seed_Variety: "Hybrid"
};

predictYield(features)
    .then(yieldValue => console.log(`Predicted: ${yieldValue} kg/ha`))
    .catch(err => console.error(err));

Load Model for Local Inference

local_inference.py

import joblib
import numpy as np

# Load artifacts
model = joblib.load('model/model.pkl')
scaler = joblib.load('model/scaler.pkl')
label_encoders = joblib.load('model/label_encoders.pkl')
features = joblib.load('model/feature_list.pkl')

def predict(input_dict):
    """Make prediction without API"""
    # Encode categorical features
    for col in ['Crop', 'State', 'Irrigation_Type', 'Seed_Variety']:
        if col in input_dict:
            input_dict[f'{col}_Encoded'] = label_encoders[col].transform([input_dict[col]])[0]
    
    # Build feature vector
    X = np.array([[input_dict.get(f, 0) for f in features]])
    
    # Preprocess and predict
    X_scaled = scaler.transform(X)
    return max(0, model.predict(X_scaled)[0])

# Usage
result = predict({
    "Crop": "Rice", "State": "Punjab", "Year": 2024,
    "Rainfall_mm": 1200, "Temperature_C": 28,
    "Humidity": 70, "Soil_Quality": 75
})
print(f"Yield: {result:.2f} kg/ha")