Housing Pricing Prediction
A model to predict resale prices of HDB flats in Singapore.
DataAnalysis
MachineLearning
SciKitLearn
PyTorch
TensorFlow
Pandas
Numpy
TimeSeries
XGBoost
CatBoost

This project develops time series forecasting models for predicting housing prices in Singapore's HDB market using transaction data and geographical information.
Project Overview
The notebooks are organized sequentially (00-15), with 00-10 found under the src directory and 11-15 in the root directory. The project follows a comprehensive workflow from data collection and cleaning to advanced model building and evaluation.

Data Cleaning & Processing
- Scraped MRT stations, shopping malls, and primary schools with their coordinates and opening dates
- Cleaned and merged HDB resale flat prices datasets
- Calculated coordinates for each HDB flat using OneMapSG API
- Standardized dates and values across multiple datasets
- Processed SORA (Singapore Overnight Rate Average) dataset for financial context
Feature Engineering
- Identified MRT stations within 1km radius of each HDB flat
- Calculated BTO supply within 4km radius for each transaction
- Identified nearby malls within 1km radius
- Measured distance to Central Business District
- Created POI density vectors using word embeddings
- Applied scaling for numerical features and one-hot encoding for categorical features
Model Building & Comparison

XGBoost Regressor Performance

Random Forest Using OOB and Cross-Validation

CatBoost Gradient Boosting Analysis
Several machine learning models were implemented and compared:
- XGBoost Regressor: Trained on working dataset and evaluated on 2024 resale prices
- Random Forest: Built using both Out-of-bag (OOB) method and 10-fold Cross Validation
- CatBoost: Gradient boosting algorithm trained on cleaned and normalized data
- GNNWR: Geospatial model incorporating latitude and longitude using neural networks
- LSTM: Three-layered neural network with dropout regularization for capturing temporal dependencies
Technology Stack
- Python with Pandas, NumPy, and GeoPandas for data manipulation
- Scikit-Learn, PyTorch, and TensorFlow for machine learning models
- XGBoost, CatBoost, and custom neural networks for advanced modeling
- Spacy for natural language processing and word embeddings
- Data visualization libraries for interactive charts and model evaluation