From Curiosity to Capability

10 progressive data science projects bridging soil science domain expertise with modern ML, analytics, and engineering. Each project targets skills most demanded by junior Data Scientist roles.

01
EDA
02
SQL
03
Regression
04
Classification
05
Time Series
06
NLP
07
Clustering
08
Deep Learning
09
Dashboard
10
MLOps
Completed
In Progress
Planned
01
Completed
EDA & Visualization

Soil Health Indicator Explorer

Exploratory data analysis on soil property datasets — pH, organic matter, nutrient content, and texture. Uncover patterns, distributions, and correlations through statistical summaries and rich visualizations.

Key Skills Developed
  • Data cleaning, handling missing values & outliers
  • Statistical summaries & hypothesis testing
  • Publication-quality visualizations (matplotlib, seaborn)
Python pandas NumPy matplotlib seaborn Jupyter
02
Completed
SQL & Data Engineering

Agricultural Research Database

Design and populate a relational database for agricultural field experiments. Practice writing complex analytical queries — JOINs, CTEs, window functions, and aggregations on multi-table research data.

Key Skills Developed
  • Relational schema design & normalization
  • Advanced SQL: window functions, CTEs, subqueries
  • ETL basics: extracting, transforming, loading data
PostgreSQL SQL pandas Database Design ETL
03
Completed
Supervised ML — Regression

Soil Organic Carbon Prediction

Build regression models to predict soil organic carbon (SOC) content from physical and chemical soil properties. Compare Linear Regression, Random Forest, and XGBoost with rigorous evaluation and feature importance analysis.

Key Skills Developed
  • Feature engineering & feature selection
  • Model comparison, cross-validation, hyperparameter tuning
  • Interpretability: SHAP values & feature importance
Python scikit-learn XGBoost SHAP pandas
04
In Progress
Supervised ML — Classification

Crop Suitability Classifier

Classification model predicting optimal crop types from combined soil and climate features. Implement proper train/test splitting, handle class imbalance, and evaluate with precision, recall, F1, and ROC-AUC metrics.

Key Skills Developed
  • Binary & multiclass classification workflows
  • Handling imbalanced datasets (SMOTE, class weights)
  • Confusion matrix, ROC curves, threshold tuning
Python scikit-learn imbalanced-learn matplotlib
05
Planned
Time Series & Forecasting

Commodity Price Forecasting

Time series analysis and forecasting of agricultural commodity prices (wheat, corn, sunflower). Explore seasonality, trend decomposition, and compare ARIMA, SARIMA, and Facebook Prophet for multi-step ahead predictions.

Key Skills Developed
  • Stationarity testing (ADF), differencing, seasonal decomposition
  • ARIMA/SARIMA model selection (AIC, BIC, ACF/PACF)
  • Forecast evaluation: MAPE, RMSE, prediction intervals
Python statsmodels Prophet pandas matplotlib
06
Planned
Natural Language Processing

Research Paper Topic Modeling

NLP pipeline for topic modeling on agricultural research abstracts. Build a corpus from open-access papers, preprocess text, apply TF-IDF vectorization, and discover latent themes using Latent Dirichlet Allocation (LDA).

Key Skills Developed
  • Text preprocessing: tokenization, stemming, lemmatization
  • TF-IDF, bag-of-words, word embeddings
  • Topic modeling (LDA), interactive pyLDAvis visualization
Python NLTK spaCy scikit-learn Gensim pyLDAvis
07
Planned
Unsupervised Learning

Microbial Community Clustering

Apply unsupervised learning to soil microbiome data to discover natural community patterns. Use K-Means, hierarchical clustering, and DBSCAN, then visualize high-dimensional structure with PCA and t-SNE embeddings.

Key Skills Developed
  • Clustering algorithms: K-Means, DBSCAN, hierarchical
  • Dimensionality reduction: PCA, t-SNE, UMAP
  • Silhouette analysis, elbow method, cluster validation
Python scikit-learn SciPy UMAP seaborn
08
Planned
Deep Learning & Computer Vision

Plant Stress Detection with CNN

Image classification model using Convolutional Neural Networks to detect plant stress and disease from leaf photographs. Leverage transfer learning with pre-trained models (ResNet, EfficientNet) and data augmentation techniques.

Key Skills Developed
  • CNN architecture, transfer learning, fine-tuning
  • Image preprocessing & data augmentation pipelines
  • GPU training, model checkpointing, early stopping
Python TensorFlow Keras OpenCV NumPy
09
Planned
Dashboard & Web Application

Interactive Soil Health Dashboard

Build an interactive web dashboard with Streamlit and Plotly for exploring soil health data. Include filters, dynamic charts, geographic map visualizations, and responsive layouts that non-technical stakeholders can use.

Key Skills Developed
  • Streamlit app architecture & widget system
  • Interactive Plotly charts & geographic maps
  • Cloud deployment (Streamlit Community Cloud / Render)
Python Streamlit Plotly pandas Folium
10
Planned
MLOps & End-to-End Pipeline

Drought Prediction — Full ML Pipeline

Complete production-grade ML pipeline for predicting drought conditions: automated data ingestion, feature engineering, model training with MLflow tracking, Docker containerization, and REST API deployment with FastAPI.

Key Skills Developed
  • MLflow experiment tracking & model registry
  • Docker containerization & FastAPI serving
  • Git versioning, CI/CD basics, pipeline orchestration
Python MLflow Docker FastAPI Git scikit-learn