top of page
4fxxbm4opjd31.webp

Decoding Movie Genres with Vision

A machine Learning Approach to Predict Movie Genres from Posters 

Computer Vision

Role: Data Engineer

Duration: Spring 2025

Overview

This project explores whether we can predict a movie’s genre using only its poster art, without relying on text metadata. Movie posters are rich in visual cues but present challenges: multiple genre tags, class imbalance, and inconsistent resolutions. Our goal was to build and compare models that classify posters into their dominant genre, balancing accuracy, speed, and implementation complexity.

Challenge

Posters come in wildly varying resolutions often carry multiple genre tags, and exhibit severe class imbalance making single-label, purely visual prediction especially difficult.

Goal 

Develop a robust classifier that distills each poster down to its dominant genre, balancing performance, speed, and implementation complexity across simple models (Logistic/SVM) and fine-tuned CNNS. 

Application 

Enable smarter content recommendation and automated cataloging by inferring genre information directly from poster art - no text metadata required. 

Data & Labels 

Preparation

Screenshot 2025-09-22 223308.jpg
Screenshot 2025-09-22 223507.jpg
Screenshot 2025-09-22 223419.jpg
Screenshot 2025-09-22 225121_edited.jpg
Screenshot 2025-09-22 225443.jpg
Screenshot 2025-09-22 225632.jpg
Screenshot 2025-09-22 225750.jpg
  • 53,000+ posters (2000–2024, IMDb rating > 7.0)

  • Curated down to 29,265 single-labeled posters

  • Cosine similarity on semantic embeddings from movie plots used to assign the closest dominant genre

  • 21 genres consolidated into clusters (e.g., Thriller | Horror | Mystery, Action | Adventure | Animation)

  • Data cleaned to remove non-posters, missing labels, and duplicates

image.png

Extracting Features

We combined classical descriptors and deep embeddings:

  • HSV Color Histograms – captured overall palette (e.g., Sci-Fi posters with vibrant tones vs. War posters with muted colors)

  • HOG (Histogram of Oriented Gradients) – highlighted silhouettes and text edges

  • ResNet50 CNN – focused on fine textures and shapes

  • Vision Transformers (ViT) – learned global contextual relationships across poster patches

Screenshot 2025-09-22 233815_edited.jpg
Screenshot 2025-09-22 233701.jpg
Screenshot 2025-09-22 233731.jpg

Modeling

The baseline model, which predicted all classes equally, achieved only 10% accuracy, establishing a low benchmark for comparison. Among the traditional pipelines, Support Vector Machines (SVC) produced the best results when combined with Vision Transformer (ViT) features and HSV histograms. After balancing classes, applying PCA, and tuning via GridSearch with stratified cross-validation, this approach achieved a validation accuracy of 30.4% and a test accuracy of 25.7%. Logistic Regression performed more modestly, with HSV features yielding its best outcome of 17.4% validation accuracy and 15.3% test accuracy.

By contrast, the CLIP Zero-shot model proved to be the strongest overall. Using the ViT-B/32 variant, it generated genre-specific text prompts (e.g., “a movie poster for a <genre> film”), tokenized them, and compared embeddings via cosine similarity, resulting in the highest performance at 39% test accuracy. Finally, a ResNet50 CNN was also evaluated by resizing poster images to 224×224 pixels, applying a preprocessing and downsampling pipeline, and training with a GlobalAveragePooling and dropout layer. After hyperparameter tuning, this model achieved a test accuracy of 24.5%.

Overall, the results highlight CLIP Zero-shot as the top-performing model, with the ViT + HSV pipeline offering the most competitive results among traditional supervised approaches. Logistic Regression and ResNet50 provided moderate improvements over the baseline, but neither matched the strength of CLIP.

Screenshot 2025-09-22 235619.jpg
Screenshot 2025-09-23 000230.jpg
Screenshot 2025-09-22 235534.jpg
Screenshot 2025-09-23 000057.jpg

Results

Best Performing Model: CLIP Zero-shot (~39% test accuracy)

Fastest Models: HSV/HOG baselines (millisecond extraction, ~15% accuracy)

Key Challenges:

  • Many posters belong to multiple genres → single-label training limited performance

  • Minority genres under-represented → class imbalance reduced recall

Screenshot 2025-09-23 000725.jpg
Screenshot 2025-09-23 000624.jpg

Computer Vision NExt Steps

  • Multi-Label Training

    • Align the objective with real poster semantics by predicting all applicable genres, not just one.

  • End-to-End Transformer Fine-Tuning

    • Unfreeze and train larger ViT variants (e.g., ViT/16) on poster data for richer, domain-specific representations.

  • Genre-Aware Augmentation

    • Use adaptive color jitter, random erasing, or GAN-based style transfers to simulate varied poster styles.

  • Domain Adaptation & Metric Learning

    • Emphasize genre-specific visual cues via adversarial domain alignment or triplet-loss training.

  • Multimodal Fusion

    • Combine poster imagery with plot embeddings, keywords, or trailer frames to push accuracy beyond 30–40%.

The bottom line is deep embeddings unlock gains, but breaking the 30% barrier on single-genre poster classification will demand richer labels, smarter augmentations, and seamless integration of multiple modalities.

This is a portfolio of Averine Sanduku's work please attribute my work if you are inspired by the material. Thank you!

© 2025 By Averine Sanduku. All rights reserved

Follow

  • LinkedIn
bottom of page