Steps &
Screens
Predicting
Problematic Patterns

ChatGPT Image Sep 23, 2025, 10_39_07 AM.png

Predicting Problematic Internet Use in Children and Adolescents

Machine Learning

Role: Data Engineer

Duration: Fall 2024

U.C. Berkeley School of Information

Highlights

Skills Demonstrated: Exploratory Data Analysis (EDA), Data Cleaning & Imputation (KNN, SMOTE), Feature Engineering, Supervised Machine Learning, Hyperparameter Tuning, Model Evaluation

Tools & Libraries: Python, scikit-learn, XGBoost, Random Forest, Gradient Boosting, Pandas, NumPy, Matplotlib/SeabornKey

Contributions: Built predictive models using sparse multi-source health data; applied advanced imputation and oversampling to address data quality challenges; led model development and tuning for Random Forest and regression models

Impact: Showed that physical activity and behavioral patterns can help predict internet addiction severity in youth, highlighting the role of machine learning in supporting early intervention for adolescent mental health

Overview

ChatGPT Image Sep 23, 2025, 10_52_51 AM.png

This project focused on applying data science to a pressing public health challenge: problematic internet use (PIU) among youth. Our team investigated whether physical activity and behavioral data could predict the Severity Impairment Index (SII), a measure of internet addiction severity, using the Healthy Brain Network (HBN) dataset from the Child Mind Institute.

The project combined exploratory data analysis (EDA), feature engineering, imputation strategies, and multiple machine learning models to uncover patterns and develop predictive tools. By translating raw, messy, and sparse data into actionable insights, we demonstrated the potential for data-driven early interventions in adolescent mental health.

Problem

The widespread use of technology and social media has raised concerns about excessive and compulsive internet use, particularly in children and adolescents.

Problematic internet use has been linked to:
Depression and anxiety
Poor academic performance
Social withdrawal and sleep disruption
Heightened risk of addictive behaviors

Despite its prevalence, predicting who is most at risk remains difficult. Traditional assessments rely on self-reported surveys such as the Parent-Child Internet Addiction Test (PCIAT), which can be biased and inconsistent.

Our project asked:

Can we use objective behavioral and physical activity data to predict internet addiction severity, and in doing so, create a scalable early-warning system for youth mental health?

Dataset & Challenges

We used the Healthy Brain Network (HBN) dataset, which includes ~5,000 participants ages 5–22.

It integrates information from nine different sources, including:

Demographic data (age, sex)
Physical health tests (e.g., FitnessGram)
Biometric measures
Actigraphy and physical activity tracking
Internet usage and behavioral surveys

Key Challenges

1. Data Sparsity: Only basic demographic features were complete, with 40.6% of overall feature values missing. No single participant had full coverage across all sources.

2. Feature Redundancy: Many metrics measured similar outcomes (e.g., fitness scores vs. performance zones). We streamlined features to retain raw, high-value variables while removing derived fields.

3. Outliers: Implausible values (e.g., max fitness scores above 80,000) were flagged and replaced with NaN to avoid skewing scales.

4. Class Imbalance: Most participants scored “None” or “Mild” on the SII, while “Moderate” and “Severe” cases were rare. This imbalance risked models defaulting to majority-class predictions.

These challenges shaped our preprocessing pipeline and informed the techniques we applied to clean, impute, and balance the dataset.

Preprocessing & Feature Engineering

Our preprocessing steps transformed the raw HBN dataset into a structured, machine-learning-ready format.

Imputation:
- Basic field imputation using domain definitions (e.g., filling missing values with defaults where medically appropriate).
- KNN imputation, where missing values for one participant were estimated using similar “neighbors” with overlapping data. This approach preserved correlations across tests.

Outlier handling: Out-of-range or nonsensical values were flagged and replaced with NaN.

Feature Reduction:
- Removed redundant or low-correlation fields to simplify the dataset and reduce noise.
- Performed correlation analysis to drop features with minimal predictive relationship to the target.

Encoding & Expansion:
- Converted categorical fields into dummy variables.Introduced polynomial features to capture higher-order interactions between predictors.

Scaling & Normalization: Ensured features were on comparable scales for models sensitive to feature magnitude.

Imbalance Correction: Applied Synthetic Minority Oversampling Technique (SMOTE) to generate additional samples of “Moderate” and “Severe” cases, improving representation across classes.

Feature Selection: Correlation Analysis

Final dataset dimensions: 2,736 participants × 320 features.

Modeling Approach

We benchmarked multiple supervised learning algorithms against a baseline majority-class predictor. The target variable was the Severity Impairment Index (SII), categorized into:

0 = None

1 = Mild

2 = Moderate

3 = Severe

Baseline

Method: Always predicted “None.”
Accuracy: ~51% validation.
Purpose: Provided a minimum benchmark to assess whether more complex models added value.

Models & Results

1. Logistic Regression

Configured as a binary classifier (low vs. high SII).
Accuracy: 72% training, 65% validation.
Showed predictive value but limited in handling class imbalance.

2. XGBoost Classifier

Optimized with 200 estimators, max depth = 5, learning rate = 0.01.
Accuracy: 88% training, 62% validation.
Overfitted training data despite parameter tuning.

3. Random Forest Classifier

Hyperparameter tuning explored depths, estimators, and leaf sizes.
Accuracy: 97% training, 57% validation.
Mean cross-validation accuracy: 82.6%.
Best-performing model overall, balancing generalization with interpretability.

4. Gradient Boosting Classifier

High training accuracy (99%) but validation accuracy ~57%.
Sensitive to overfitting.

Baseline Model

Random Forest

Gradient Boosting

Findings

Random Forest emerged as the strongest model, outperforming logistic regression and gradient boosting in terms of cross-validation accuracy.

Physical activity data carries predictive signals that correlate with problematic internet use severity.

Validation accuracy remained lower than training, suggesting model generalization remains a challenge.

The study highlights both the potential and limitations of predictive modeling in mental health: data quality and completeness are critical.

Steps &Screens Predicting Problematic Patterns