Chuting Xu

Gene Expression from Chromatin (GM12878)

Chromatin-derived features drive a two-stage ML pipeline for predicting GM12878 gene activity with high accuracy.

rgenomicsmachine-learning

  • Problem: Predict ON/OFF status and expression level of genes from chromatin features around TSS (CAGE-based), validating a simplified Dong et al. 2012 pipeline.
  • Approach: BigWig histone/DNase signals → TSS-centric bins; D1 for bin selection, D2 for training; two-step models {LR/RF/SVM} → {Lasso/RF/MARS/SVR}; 10-fold CV.
  • Result: Classification AUC around 0.9+; best two-step combo RF-Classifier + RF-Regressor with strong RMSElog and Pearson correlation on chr1.
  • Repro: R (tidyverse, randomForest/MARS); fixed seeds; `renv::init()` to capture versions; Quarto report renders end-to-end.

Overview

Brief description of GM12878, feature construction, and the two-step pipeline.

Methods

  • Signal extraction, binning strategy, model families, CV scheme.

Results

  • AUC/RMSE/Correlation; plots (regularization path, feature importance), subgroup checks.

Links: GitHub Repo · Report (PDF)