Gene Expression from Chromatin (GM12878)
Chromatin-derived features drive a two-stage ML pipeline for predicting GM12878 gene activity with high accuracy.
- Problem: Predict ON/OFF status and expression level of genes from chromatin features around TSS (CAGE-based), validating a simplified Dong et al. 2012 pipeline.
- Approach: BigWig histone/DNase signals → TSS-centric bins; D1 for bin selection, D2 for training; two-step models {LR/RF/SVM} → {Lasso/RF/MARS/SVR}; 10-fold CV.
- Result: Classification AUC around 0.9+; best two-step combo RF-Classifier + RF-Regressor with strong RMSElog and Pearson correlation on chr1.
- Repro: R (tidyverse, randomForest/MARS); fixed seeds; `renv::init()` to capture versions; Quarto report renders end-to-end.
Overview
Brief description of GM12878, feature construction, and the two-step pipeline.
Methods
- Signal extraction, binning strategy, model families, CV scheme.
Results
- AUC/RMSE/Correlation; plots (regularization path, feature importance), subgroup checks.
Links: GitHub Repo · Report (PDF)