Cine ML

A movie-rating regression system that combines TMDB metadata with verified IMDb labels and separates pre-release from post-release modeling.

01 / Overview

Cine ML is an end-to-end machine learning project that estimates IMDb movie ratings using TMDB movie metadata and verified IMDb labels collected from OMDb. The project includes resumable API data collection, feature engineering, model comparison, hyperparameter tuning, holdout evaluation, saved model artifacts, automated tests, and an interactive Streamlit dashboard.

02 / Problem

Movie rating prediction can be misleading if post-release audience signals are mixed with pre-release metadata without explanation. This project separates pre-release feature models from post-release audience-signal models, making the evaluation more transparent and honest.

03 / What I built

Built a complete Python regression pipeline using 4,803 cleaned TMDB movie records and 491 verified IMDb labels from OMDb.
Developed a resumable OMDb API collector with local caching, duplicate handling, missing-rating handling, and optional future dataset expansion.
Engineered movie features from budget, runtime, release date, language, genre, popularity, revenue, and audience-voting metadata.
Implemented genre multi-hot encoding, numeric feature transformations, train/test splitting, and model artifact generation.
Compared baseline, pre-release, engineered pre-release, post-release, and tuned audience-signal models.
Tuned an Extra Trees Regressor using 5-fold GridSearchCV only on the training partition.
Evaluated models on an untouched 80/20 holdout set using MAE, MSE, and R2.
Built an interactive Streamlit dashboard for predictions, error analysis, model results, and feature-importance visualization.
Added automated feature tests, saved predictions, metrics, trained models, and documentation through model and data cards.

04 / Key results

The tuned audience-signal model achieved 0.239 MAE, 0.148 MSE, and 0.835 R2 on the held-out test set.
The model reduced MAE by 68.5% compared with the mean-prediction baseline.
Because the champion model uses TMDB vote_average, it should be interpreted as a post-release cross-platform rating estimator rather than a pre-release movie-quality forecast.

05 / Technical focus

This project demonstrates practical ML engineering beyond fitting a model. It includes reproducible data collection, feature engineering, model comparison, hyperparameter tuning, artifact management, automated testing, dashboard development, and honest interpretation of model limitations.

06 / Tech stack

PythonPandasNumPyscikit-learnExtra Trees RegressionRidge RegressionGridSearchCVStreamlitMatplotlibSeabornAltairOMDb APITMDB datapytestjoblibJupyter NotebookFeature engineeringCross-validationMAE / MSE / R2

GitHub Back to projects