🌾 Crop Yield Prediction

AI-Powered Agricultural Yield Forecasting System

Machine Learning Project Report - Phase 2

By Pushkarjay Ajay

75,000
Synthetic Samples
22
Features
96.27%
R² Accuracy
4
Models Tested
1

📋 Project Overview

An end-to-end machine learning system that predicts crop yield based on environmental conditions, soil properties, and farm management practices. The system integrates multiple datasets, performs intelligent feature engineering, and serves predictions via a REST API.

📊
75K
Synthetic Records
🔧
22
ML Features
🤖
4
Models Trained
🎯
96.27%
Best R² Score

ML Pipeline Workflow

📥
Data Collection
🔄
Preprocessing
⚙️
Feature Engineering
🤖
Model Training
🚀
API Deployment
2

📈 Yield Distribution Analysis

Distribution of crop yields across the unified dataset shows a right-skewed pattern, with most yields concentrated between 200-800 kg/hectare.

Yield Distribution
Figure 1: Histogram showing the distribution of crop yields. Mean yield: ~636 kg/ha | Median: ~580 kg/ha
3

🔗 Feature Correlations

Correlation analysis reveals key relationships between features and yield. Fertilizer amount and irrigation schedule show strongest positive correlations.

Correlation Matrix
Figure 2: Correlation heatmap of all numerical features. Darker colors indicate stronger correlations.
4

🌾 Yield by Crop Type

Comparison of average yields across different crop types reveals significant variations. Sugarcane shows highest yields while pulses show lowest.

Crop Yield Comparison
Figure 3: Box plot comparing yield distributions across crop types. Sugarcane leads with highest average yield.
5

📍 Yield by State

Regional analysis shows yield variations across Indian states. Punjab and Haryana show consistently higher yields due to better irrigation infrastructure.

State Yield Comparison
Figure 4: Average crop yields by state. Northern states show higher yields on average.
6

🌤️ Weather & Soil Impact

Scatter plots reveal the relationship between key environmental factors and crop yield. Optimal ranges exist for each parameter.

Weather and Soil vs Yield
Figure 5: Multi-panel scatter plots showing how weather and soil parameters influence crop yield.
7

⭐ Feature Importance

Gradient Boosting model reveals the relative importance of each feature in predicting yield. Fertilizer amount and rainfall are the most influential predictors.

Feature Importance
Figure 6: Feature importance scores from the Gradient Boosting model. Top features: Fertilizer, Rainfall, Temperature.
7b

🎯 Model Prediction Analysis

Visualization of actual vs predicted yields shows the model's accuracy across different yield ranges. The tight clustering around the diagonal indicates strong predictive performance.

Prediction Analysis
Figure 7: Actual vs Predicted yield scatter plot with regression line. R² = 0.9627 indicates excellent model fit.
7c

🔍 Data Quality & Outlier Analysis

Before generating the synthetic dataset, the original data was analyzed for outliers. Issues like temperature values in Fahrenheit (>50°C) were identified and addressed.

Outlier Analysis
Figure 8: Outlier detection analysis showing data distribution and identified anomalies in the original dataset.
8

📊 Synthetic Dataset Info

A synthetic dataset of 75,000 records was generated to ensure complete data coverage with realistic correlations between agricultural features.

🟢 Complete Coverage (75K samples)

  • Crop 22 types
  • State 20 states
  • Year 2015-2024
  • Weather (7 features) 100%
  • Soil (7 features) 100%
  • Management (4 features) 100%

📈 Dataset Properties

  • Total Records 75,000
  • Features 27 cols
  • ML Features 22 used
  • Missing Values 0

🎯 Advantages

  • No imputation needed
  • Realistic correlations
  • Balanced distribution

Data Coverage - All Features at 100%

Weather Features 100%
Soil & Nutrient Features 100%
Management Features 100%
Location Features 100%
9

🤖 Model Comparison

Four regression algorithms were trained and evaluated using 5-fold cross-validation. Gradient Boosting achieved the best overall performance and was selected for production.

1
🏆 Gradient Boosting
0.9627
R² Score
1,610
MAE (kg/ha)
3,574
RMSE (kg/ha)

🎯 Why Gradient Boosting? Best CV R² (0.9603), faster inference, smaller model, and better outlier handling.

2
Random Forest
0.9594
R² Score
1,782
MAE (kg/ha)
3,728
RMSE (kg/ha)
3
Decision Tree
0.8834
R² Score
2,514
MAE (kg/ha)
6,314
RMSE (kg/ha)
4
Linear Regression
0.7156
R² Score
5,221
MAE (kg/ha)
9,866
RMSE (kg/ha)
10

🛠️ Technology Stack

Backend & ML

  • Python 3.11 Core
  • scikit-learn ML
  • pandas & numpy Data
  • Flask API
  • joblib Model I/O

Frontend & Viz

  • HTML5 / CSS3 UI
  • JavaScript (ES6+) Logic
  • matplotlib & seaborn Plots
  • Responsive Design Mobile

API Endpoints

Method Endpoint Description
GET / API information
GET /health Health check
GET /features List required features
POST /predict Single prediction
POST /predict-batch Batch predictions
11

✅ Summary & Next Steps

🎯 Achievements

  • Generated 75K synthetic dataset
  • 96.27% prediction accuracy (R²)
  • REST API with CORS support
  • Interactive dashboard
  • 22 crops, 20 states coverage
  • Total production calculator

🚀 Future Enhancements

  • Add crop type as ML feature
  • Include seasonal patterns
  • Geo-spatial predictions
  • Weather API integration
  • Mobile app development
  • Cloud deployment

🌾 Project Complete

Ready for production deployment and further enhancements