Agricultural Crop Recommendation

Predictive Analytics · Recommendation · Machine Learning

ANALYTICS

Damilola Oshungbohun

1/1/20253 min read

The problem

Farmers making crop selection decisions based on intuition or regional habit frequently plant in suboptimal soil and climate conditions, resulting in reduced yields, wasted fertilizer, and avoidable crop failure. Precision agriculture addresses this by replacing guesswork with sensor data, but raw sensor readings across seven environmental dimensions are not immediately interpretable without a system to translate them into a recommendation.

This project builds that translation layer: an end-to-end machine learning pipeline that ingests soil and climate sensor readings and recommends the optimal crop for a given farming environment, across 22 crop types, with a confidence score attached to every recommendation.

The data

The dataset contains 2,200 records, each representing a unique set of sensor readings: Nitrogen, Phosphorus, and Potassium levels in the soil (mg/kg), ambient temperature (°C), relative humidity (%), soil pH, and annual rainfall (mm). The target variable is the crop type, 22 classes including rice, cotton, coffee, apple, banana, and various legumes. The dataset is perfectly balanced at exactly 100 records per crop, which eliminates class-imbalance bias from the classifier without any resampling.

Exploratory analysis

Before modelling, each sensor's distribution was examined for skew, outliers, and realistic operating ranges. A correlation heat map identified a 0.74 correlation between Phosphorus and Potassium, the strongest pair, while all other feature pairs showed weak correlations, confirming no significant multicollinearity risk. A Nitrogen box plot across all 22 crops demonstrated that each crop has a distinctly different chemical soil signature, establishing the core assumption the classifier relies on: that crops are separable by their environmental profiles.

Feature engineering

Four composite features were engineered from agronomic domain knowledge: total NPK load, N-to-P ratio, N-to-K ratio, and a climate index combining temperature and humidity. A second Random Forest was trained on this expanded 11-feature set and evaluated on an identical stratified holdout split to directly measure the contribution of those engineered features.

The result was that the baseline model on 7 raw sensors (99.55% accuracy, 99.59% CV mean) outperformed the engineered model (99.32% accuracy, 99.45% CV mean). This is reported as a finding, not a failure. The raw sensor readings already contain sufficient signal to separate all 22 crop classes without transformation. Reporting this honestly demonstrates that feature engineering decisions should be validated with data rather than assumed to always help, and that the simpler model is the correct deployment choice.

Model evaluation

Three models were trained and compared on identical stratified 80/20 splits: a Random Forest on raw features, a Random Forest on engineered features, and a Gradient Boosting classifier on engineered features. All three were evaluated with 10-fold stratified cross-validation, producing not just a mean accuracy but a standard deviation, confirming that the model's performance is stable across different subsets of the data and is not the result of a fortunate random split.

The selected model, the baseline Random Forest with 200 estimators, achieved 99.55% test accuracy and 99.59% mean cross-validation accuracy with a standard deviation of just 0.38%. A full 22-class confusion matrix confirmed that only jute and maize showed any misclassification, with precision of 0.95 for both. All other 20 crop types were classified perfectly.

What drives the recommendation

Feature importance analysis on the Random Forest revealed that rainfall and humidity are the two most influential predictors, together accounting for approximately 43% of the model's classification weight. This has a clear agronomic interpretation: while soil nutrients like Nitrogen and Potassium can be adjusted through fertilisation, the natural water availability and ambient moisture of a region set the fundamental ceiling on which crops can realistically thrive. Among soil nutrients, Potassium and Phosphorus outrank Nitrogen as discriminators across the 22 crop profiles.

Deployment

A recommend_crop() function wraps the trained model and accepts 7 raw sensor readings as inputs, returning the top recommended crop, its confidence percentage, and the top 3 ranked alternatives with individual probabilities. This function is production-ready: it applies the same feature engineering transformations used during training and can be integrated directly into a farm management dashboard or IoT sensor pipeline. An interactive Tableau dashboard was built alongside the Python pipeline to visualise sensor profiles, model performance, and a live recommendation panel driven by parameter sliders.

Tools Used: Python, scikit-learn, pandas, NumPy, matplotlib, seaborn, Tableau

Agricultural Crop Recommendation

Contacts