003 — Risk Factor Analysis

Cancer Patient
Risk Factor
Analysis

Multi-variable patient risk profiling across 24 environmental, behavioral, and clinical variables — stratified into Low, Medium, and High severity tiers using correlation analysis and an interactive dashboard.

Correlation Analysis EDA Python Pandas SQLite Chart.js Kaggle Dataset
View on GitHub

2026

Data Analyst

Solo Project

Healthcare

Kaggle — Cancer Patient Dataset

1,000
Patient Records
24
Variables Analyzed
r=0.78
Top Predictor Correlation
3
Risk Tier Classes
Process
Workflow
01
Data Ingestion
Loaded Kaggle Cancer Patient Dataset — 1,000 synthetic records across 24 ordinal variables (scale 1–9) and 3 target classes.
02
EDA
Computed descriptive stats, grouped means by risk level, segmented by age and gender using Pandas and NumPy.
03
Risk Scoring
Encoded Low=1, Medium=2, High=3 ordinally. Computed average exposure scores per risk tier across all 24 variables.
04
Correlation Analysis
Ran Pearson correlations for all 18 risk and symptom variables against ordinal risk level to rank discriminating power.
05
Dashboard
Delivered a three-tab interactive HTML dashboard using Chart.js — risk factors, demographics, and correlation ranking.
Visualization
Cancer Risk
Dashboard
Cancer Patient Risk Analysis
Source: Kaggle — Cancer Patient Dataset · 1,000 records · 24 risk & symptom variables · Levels: Low / Medium / High
Total Patients
1,000
High Risk
365
Medium Risk
332
Top Predictor
Coughing
of Blood
Risk Factors
Demographics
Correlations
Average risk factor score among high-risk patients (scale 1–8)
High risk avg score (out of 8)
Alcohol Use — Strongest GapHigh-risk patients score 6.83 avg vs 2.23 for low — a 3× spread
Genetic RiskHigh-risk avg 6.38 confirms hereditary factors play a major role
Smoking AnomalyMedium-risk patients score lower (2.45) than low-risk (3.02) — mixed profiles
Age group distribution by risk level
High
Medium
Low
Gender breakdown
598
Male
High252
Medium197
Low149
402
Female
High113
Medium135
Low154
Male patients skew high-risk42% of males are high-risk vs 28% of females — a 14-point gap
Female patients skew low-risk38% of females are low-risk vs only 25% of males
Pearson correlation with cancer risk level
Interpretation
Coughing of Blood (r=0.78)Strongest single predictor — nearly 3× the correlation of lower-ranked symptoms
Lifestyle clusterAlcohol, dust allergy, passive smoking & genetic risk all cluster at r≈0.70
Wheezing & nail clubbingWeakest predictors (r<0.28) — low discriminating power
What I Found
Key Findings
Correlation
Coughing of Blood
is the Dominant Predictor
Pearson correlation of r=0.78 against ordinal risk level — the single strongest predictor across all 18 tested variables, well above the next cluster at r≈0.70.
r = 0.78Pearson vs. risk level
Gap Analysis
Alcohol Use Shows
Steepest Exposure Gap
High-risk patients average 7.9 vs. 2.05 for low-risk — a 3+ point spread that's the widest across all environmental and behavioral variables in the dataset.
3x+Exposure gap, low vs. high risk
Demographics
Males 30–39 are the
Largest High-Risk Group
Males represent 42% of the high-risk tier vs. 28% of females. The 30–39 age bracket is the single largest concentration — 141 high-risk patients.
141High-risk patients aged 30–39
Stack
Tools & Data
Data Source
  • Kaggle — Cancer Patient Data Sets
  • 1,000 synthetic patient records
  • 24 risk factor & symptom variables (1–9 ordinal)
  • 3 target classes: Low, Medium, High risk
Software Used
  • Python (Pandas, NumPy)
  • Excel
  • SQLite
  • Chart.js
  • HTML / CSS
Project Lifecycle
From Problem
to Delivery
01
Problem
Define the Question
Which risk factors most strongly predict a patient's cancer severity level and are the relationships linear across low, medium, and high tiers?
02
Collect
Load the Data
Loaded the Kaggle Cancer Patient Dataset. 1,000 records, 24 variables across environmental risks, lifestyle factors, genetic indicators, and clinical symptoms.
03
Explore
EDA & Segmentation
Ran EDA with Pandas. Computed descriptive stats, grouped means by risk level, and segmented the cohort by age and gender to surface demographic patterns.
04
Analyze
Correlation Ranking
Encoded risk levels ordinally (Low=1, Medium=2, High=3) and ran Pearson correlations across all 18 predictor variables to rank by discriminating power.
05
Ship
Interactive Dashboard
Delivered a three-tab interactive HTML dashboard displaying risk factors, demographics, and correlation ranking.
← Previous Project
Amazon Products
Catalog Analytics
View project →
Next Project →