๐ Project Overview
This project implements an end-to-end machine learning pipeline for predicting crop yield based on environmental, soil, and management factors. The system processes multiple heterogeneous datasets, performs feature engineering, trains multiple regression models, and serves predictions via a REST API.
Tech Stack
| Component | Technology | Version |
|---|---|---|
| Language | Python |
3.11.9 |
| Data Processing | pandas, numpy |
2.3.x, 2.3.x |
| ML Framework | scikit-learn |
1.5.x |
| Visualization | matplotlib, seaborn |
3.9.x, 0.13.x |
| API Framework | Flask |
3.1.x |
| Model Serialization | joblib |
1.4.x |
๐๏ธ System Architecture
The system follows a modular architecture with clear separation between data processing, model training, and serving layers.
(6 files)
Pipeline
Engineering
Training
API
Phase-2/
โโโ unified_dataset.csv # 75,000 synthetic records
โโโ generate_synthetic_dataset.py # Dataset generator
โโโ train_model.py # ML training pipeline
โ
โโโ model/
โ โโโ model.pkl # Trained GradientBoostingRegressor
โ โโโ scaler.pkl # StandardScaler for normalization
โ โโโ label_encoders.pkl # Categorical encoders (Crop, State, etc.)
โ โโโ feature_list.pkl # 22 feature names for inference
โ โโโ model_info.pkl # Model metadata and metrics
โ
โโโ api/
โ โโโ app.py # Flask REST API
โ
โโโ dashboard/
โโโ index.html # User-facing prediction UI
โโโ technical.html # This documentation
โโโ style.css, script.js
๐ Data Pipeline
Due to data quality issues in the original heterogeneous datasets (missing values, inconsistent schemas, incomplete coverage), a synthetic dataset of 75,000 records was generated with realistic correlations between agricultural features. The synthetic data covers 22 crops across 20 Indian states from 2015-2024.
Synthetic Dataset Properties
| Property | Value | Description |
|---|---|---|
| Total Records | 75,000 | Complete records with no missing values |
| Crops | 22 types | Rice, Wheat, Maize, Cotton, Sugarcane, Soybean, etc. |
| States | 20 Indian states | Punjab, Maharashtra, UP, MP, Karnataka, etc. |
| Years | 2015-2024 | 10-year range with seasonal variations |
| Features | 27 columns | 22 used for ML prediction |
Column Mapping Strategy
COLUMN_MAPPING = {
# Soil Nutrients
'N_SOIL': 'Nitrogen',
'P_SOIL': 'Phosphorus',
'K_SOIL': 'Potassium',
'Nitrogen (N)': 'Nitrogen',
# Temperature variations
'TEMPERATURE': 'Temperature_C',
'Air temperature (C)': 'Temperature_C',
'Temperatue': 'Temperature_C', # typo in source
'Mean Temp': 'Temperature_C',
# Humidity variations
'HUMIDITY': 'Humidity',
'Air humidity (%)': 'Humidity',
'Average Humidity': 'Humidity',
# Target variable mappings
'Yield_kg_per_hectare': 'Yield_kg_per_hectare',
'Crop Yield': 'Yield_kg_per_hectare',
'Yeild (Q/acre)': 'Yield_kg_per_hectare',
'millet yield': 'Yield_kg_per_hectare',
}
๐ Data Field Status
With the synthetic dataset, all 75,000 records have complete values for all features. No imputation is required, ensuring high-quality predictions across all input combinations.
Complete Feature Coverage (75K samples each)
Weather Features
All weather parameters with 100% coverage
| Feature | Samples | Range |
|---|---|---|
Rainfall_mm | 75,000 | 200 - 2500 mm |
Temperature_C | 75,000 | 15 - 42ยฐC |
Humidity | 75,000 | 30 - 95% |
Sunshine_Hours | 75,000 | 4 - 10 hrs |
GDD | 75,000 | 1000 - 3000 |
Pressure_KPa | 75,000 | 95 - 105 kPa |
Wind_Speed_Kmh | 75,000 | 5 - 30 km/h |
Soil & Nutrient Features
Complete soil parameters for all records
| Feature | Samples | Range |
|---|---|---|
Soil_Quality | 75,000 | 40 - 100 |
Nitrogen | 75,000 | 20 - 120 kg/ha |
Phosphorus | 75,000 | 10 - 80 kg/ha |
Potassium | 75,000 | 15 - 100 kg/ha |
Soil_pH | 75,000 | 5.5 - 8.5 |
OrganicCarbon | 75,000 | 0.3 - 2.5% |
Soil_Moisture | 75,000 | 20 - 70% |
Management & Location Features
Categorical and management parameters
| Feature | Samples | Categories/Range |
|---|---|---|
Crop | 75,000 | 22 crop types |
State | 75,000 | 20 Indian states |
Year | 75,000 | 2015 - 2024 |
Fertilizer_Amount | 75,000 | 50 - 350 kg/ha |
Irrigation_Type | 75,000 | Drip, Sprinkler, Canal, Rainfed |
Seed_Variety | 75,000 | Local, Improved, Hybrid |
Pesticide_Usage | 75,000 | 0 - 20 kg/ha |
โ๏ธ Feature Engineering
The final model uses 22 features across weather, soil, management, and location categories. The dataset is synthetically generated with 75,000 records covering 22 crops across 20 Indian states from 2015-2024, ensuring complete data coverage with realistic correlations.
Feature Schema (22 ML Features)
| Category | Feature | Type | Range | Preprocessing |
|---|---|---|---|---|
| ๐ค๏ธ Weather | Rainfall_mm |
float64 | 200 - 2500 | StandardScaler |
Temperature_C |
float64 | 15 - 42 | StandardScaler | |
Humidity |
float64 | 30 - 95 | StandardScaler | |
Sunshine_Hours |
float64 | 4 - 10 | StandardScaler | |
GDD |
float64 | 1000 - 3000 | StandardScaler | |
Pressure_KPa |
float64 | 95 - 105 | StandardScaler | |
Wind_Speed_Kmh |
float64 | 5 - 30 | StandardScaler | |
| ๐ฑ Soil | Soil_Quality |
float64 | 40 - 100 | StandardScaler |
Soil_Moisture |
float64 | 20 - 70 | StandardScaler | |
Nitrogen |
float64 | 20 - 120 | StandardScaler | |
Phosphorus |
float64 | 10 - 80 | StandardScaler | |
Potassium |
float64 | 15 - 100 | StandardScaler | |
Soil_pH |
float64 | 5.5 - 8.5 | StandardScaler | |
OrganicCarbon |
float64 | 0.3 - 2.5 | StandardScaler | |
| ๐ Management | Fertilizer_Amount |
float64 | 50 - 350 | StandardScaler |
Irrigation_Type |
categorical | 4 types | LabelEncoder | |
Seed_Variety |
categorical | 3 types | LabelEncoder | |
Pesticide_Usage |
float64 | 0 - 20 | StandardScaler | |
Crop |
categorical | 22 types | LabelEncoder | |
State |
categorical | 20 states | LabelEncoder | |
| ๐ Time | Year |
int64 | 2015 - 2024 | StandardScaler |
Preprocessing Pipeline
from sklearn.preprocessing import StandardScaler, LabelEncoder
# 1. Handle categorical features
label_encoders = {}
for col in ['Crop', 'State', 'Irrigation_Type', 'Seed_Variety']:
le = LabelEncoder()
df[f'{col}_Encoded'] = le.fit_transform(df[col])
label_encoders[col] = le
# 2. Standardize numerical features (no imputation needed - synthetic data is complete)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 3. Train-test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.2, random_state=42
)
๐ค Model Training
Four regression algorithms were trained and evaluated using 5-fold cross-validation. The Gradient Boosting Regressor achieved the best overall performance with an Rยฒ of 0.9627 and was selected as the production model based on multiple criteria including generalization, inference speed, and model size.
Algorithms Compared
โ ๏ธ Good performance but slower inference, larger model size, and lower CV consistency compared to GB.
๐ฏ Model Selection Justification
Why Gradient Boosting over Random Forest?
| Criteria | Gradient Boosting | Random Forest | Winner |
|---|---|---|---|
| Test Rยฒ Score | 0.9627 | 0.9594 | โ GB |
| Cross-Val Rยฒ (5-fold) | 0.9603 | 0.9512 | โ GB |
| Inference Speed | ~15ms | ~45ms | โ GB (~3x faster) |
| Model Size | ~2.1 MB | ~8.5 MB | โ GB (~4x smaller) |
| Generalization | Sequential boosting reduces bias | Higher variance on new data | โ GB |
Conclusion: Gradient Boosting provides better generalization, faster predictions, and a more compact model while achieving higher Rยฒ on both test and cross-validation datasets.
Model Configuration
from sklearn.ensemble import GradientBoostingRegressor
# Best Model Configuration
model = GradientBoostingRegressor(
n_estimators=100, # Number of boosting stages
learning_rate=0.1, # Shrinks contribution of each tree
max_depth=3, # Maximum depth of individual trees
min_samples_split=2, # Min samples to split internal node
min_samples_leaf=1, # Min samples at leaf node
subsample=1.0, # Fraction of samples for fitting trees
random_state=42 # Reproducibility seed
)
# Training
model.fit(X_train, y_train)
# Save model
import joblib
joblib.dump(model, 'model/model.pkl')
๐ Model Evaluation
The model was evaluated using standard regression metrics on the held-out test set (20% of data).
Evaluation Metrics Explained
| Metric | Formula | Value | Interpretation |
|---|---|---|---|
| Rยฒ (Coefficient of Determination) | 1 - (SS_res / SS_tot) |
0.9627 | 96.27% variance explained |
| MAE (Mean Absolute Error) | mean(|y - ลท|) |
1,610 kg/ha | Average prediction error |
| RMSE (Root Mean Squared Error) | sqrt(mean((y - ลท)ยฒ)) |
3,574 kg/ha | Penalizes large errors |
| Cross-Validation Rยฒ | mean(5-fold Rยฒ) |
0.9603 | Generalization performance |
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import numpy as np
# Predictions on test set
y_pred = model.predict(X_test)
# Calculate metrics
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Rยฒ Score: {r2:.4f}") # 0.9627
print(f"MAE: {mae:.2f}") # 1610.29
print(f"RMSE: {rmse:.2f}") # 3573.78
Prediction Analysis Visualization
The scatter plot below shows actual vs predicted yields on the test set. Points close to the diagonal line indicate accurate predictions.
Figure: Actual vs Predicted yield showing Rยฒ = 0.9627
๐ Outlier Analysis
Before creating the synthetic dataset, extensive outlier analysis was performed on the original heterogeneous datasets to understand data quality issues.
Issues Identified in Original Data
| Issue | Affected Records | Resolution |
|---|---|---|
| Temperature > 50ยฐC (Fahrenheit values) | ~800 records | Identified, led to synthetic data generation |
| Missing Crop + Yield combinations | All 7,109 records | No row had both Crop AND Yield values |
| Inconsistent feature coverage | Variable | 11-99% coverage across features |
Outlier Analysis Visualization
Figure: Outlier detection analysis on original dataset
Decision: Synthetic Data Generation
Due to the fundamental data quality issues (no record having both crop type AND yield), the decision was made to generate a synthetic dataset with:
- 75,000 complete records with realistic correlations
- 22 crop types with appropriate base yields
- 20 Indian states with regional climate variations
- All features populated with domain-appropriate values
๐ Feature Importance
Feature importance was extracted from the Gradient Boosting model. The top features driving predictions are Crop type, State, and various agricultural parameters.
Top Feature Importance Rankings
Key Insights
- Crop Type (~25%) - Most critical factor; different crops have vastly different base yields
- Year (~18%) - Captures annual variations and agricultural improvements over time
- Rainfall (~12%) - Water availability is crucial for crop growth
- Fertilizer Amount (~11%) - Optimal fertilization significantly impacts yield
- Temperature (~9%) - Each crop has optimal temperature ranges
- Soil Quality (~7%) - Foundation for healthy crop growth
๐ API Reference
The Flask REST API exposes endpoints for health checks, feature information, and predictions. All responses are JSON formatted.
Base URL
http://localhost:5000
Endpoints
Health check endpoint to verify API is running and model is loaded.
Returns list of required input features with descriptions and valid ranges.
Returns model metadata including name, performance metrics, and training date.
Make a single yield prediction. Accepts JSON body with feature values.
Make batch predictions for multiple samples. Accepts array of feature objects.
Example: Single Prediction
curl -X POST http://localhost:5000/predict \
-H "Content-Type: application/json" \
-d '{
"Crop": "Rice",
"State": "Punjab",
"Year": 2024,
"Rainfall_mm": 1200,
"Temperature_C": 28,
"Humidity": 70,
"Soil_Quality": 75,
"Nitrogen": 45,
"Phosphorus": 35,
"Potassium": 42,
"Fertilizer_Amount": 180,
"Irrigation_Type": "Canal",
"Seed_Variety": "Hybrid",
"Pesticide_Usage": 8.5
}'
Example Response
{
"status": "success",
"predicted_yield_kg_per_hectare": 4256.78,
"crop": "Rice",
"state": "Punjab"
}
๐ Deployment
Local Development
# Activate virtual environment
source .venv/Scripts/activate # Windows
# source .venv/bin/activate # Linux/Mac
# Install dependencies
pip install -r api/requirements.txt
# Run Flask development server
cd Phase-2/api
python app.py
# Server runs on http://localhost:5000
Production Deployment (Gunicorn)
# Install gunicorn
pip install gunicorn
# Run with gunicorn (4 workers)
cd Phase-2/api
gunicorn -w 4 -b 0.0.0.0:5000 app:app
Docker Deployment
FROM python:3.11-slim
WORKDIR /app
COPY api/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY api/ ./api/
COPY model/ ./model/
EXPOSE 5000
CMD ["gunicorn", "-w", "4", "-b", "0.0.0.0:5000", "api.app:app"]
๐ป Code Samples
Python Client
import requests
API_URL = "http://localhost:5000"
def predict_yield(features: dict) -> float:
"""
Make a yield prediction via the API.
Args:
features: Dictionary of input features
Returns:
Predicted yield in kg/hectare
"""
response = requests.post(
f"{API_URL}/predict",
json=features
)
result = response.json()
if result['status'] == 'success':
return result['predicted_yield_kg_per_hectare']
else:
raise Exception(result.get('error', 'Unknown error'))
# Example usage
features = {
"Crop": "Rice",
"State": "Punjab",
"Year": 2024,
"Rainfall_mm": 1200,
"Temperature_C": 28,
"Humidity": 70,
"Soil_Quality": 75,
"Fertilizer_Amount": 180,
"Irrigation_Type": "Canal",
"Seed_Variety": "Hybrid"
}
predicted_yield = predict_yield(features)
print(f"Predicted Yield: {predicted_yield:.2f} kg/ha")
JavaScript/Fetch Client
async function predictYield(features) {
const response = await fetch('http://localhost:5000/predict', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify(features)
});
const result = await response.json();
if (result.status === 'success') {
return result.predicted_yield_kg_per_hectare;
} else {
throw new Error(result.error || 'Unknown error');
}
}
// Usage
const features = {
Crop: "Rice",
State: "Punjab",
Year: 2024,
Rainfall_mm: 1200,
Temperature_C: 28,
Humidity: 70,
Soil_Quality: 75,
Irrigation_Type: "Canal",
Seed_Variety: "Hybrid"
};
predictYield(features)
.then(yieldValue => console.log(`Predicted: ${yieldValue} kg/ha`))
.catch(err => console.error(err));
Load Model for Local Inference
import joblib
import numpy as np
# Load artifacts
model = joblib.load('model/model.pkl')
scaler = joblib.load('model/scaler.pkl')
label_encoders = joblib.load('model/label_encoders.pkl')
features = joblib.load('model/feature_list.pkl')
def predict(input_dict):
"""Make prediction without API"""
# Encode categorical features
for col in ['Crop', 'State', 'Irrigation_Type', 'Seed_Variety']:
if col in input_dict:
input_dict[f'{col}_Encoded'] = label_encoders[col].transform([input_dict[col]])[0]
# Build feature vector
X = np.array([[input_dict.get(f, 0) for f in features]])
# Preprocess and predict
X_scaled = scaler.transform(X)
return max(0, model.predict(X_scaled)[0])
# Usage
result = predict({
"Crop": "Rice", "State": "Punjab", "Year": 2024,
"Rainfall_mm": 1200, "Temperature_C": 28,
"Humidity": 70, "Soil_Quality": 75
})
print(f"Yield: {result:.2f} kg/ha")