Paper Title
Fast and Robust Algorithm for Learning a Causal Graph of Genes
Abstract
Graphical models provide a powerful mathematical framework to represent different types of dependencies that
may be interpreted as causality. Directed and undirected graphs have been used extensively in genomics to capture the
relationships among genes. Directed Acyclic Graphs (DAGs), also known as Bayesian networks, are a class of graphical
models with directed edges. The PC algorithm (Spirtes et al., 2000) is one of the first algorithms proposed for the inference
of DAGs through a series of statistical tests of marginal and conditional independence. This algorithm involves two main
steps: the first step (undirected skeleton) starts with a fully connected graph and infers an undirected graph, which is the
skeleton of the DAG. The second step determines the direction of the edges in the skeleton. Several improvements of the
original PC algorithm have been developed, such as those implemented in theR packagepcalg (Kalischet al., 2012).
However, challenges remain when these algorithms are applied to noise genomic data. Here we present MPC, a novel
algorithm, which can be applied to genotype and gene expression data and efficiently learn a causal graph of genes. The
inferred causal graph is mostly directed but may also contain undirected edges: whereas a directed edge from gene A to gene
B may be interpreted as gene A regulates the expression of gene B, an undirected edge is equivalent to a bidirected edge,
indicating that although we have evidence that the two genes are not independent, we do not have enough information to
determine which gene is the regulator and which the target. Our MPC algorithm further tackles several challenges. Firstly,
the gene expression data often have outliers, even after normalization, which may drastically alter the topology of the
inferred graph. We take a robust approach and calculate the robust correlation matrix on which the series of hypothesis
testing is performed. Secondly, the number of statistical tests is unknown beforehand in all existing variants of PC
algorithms. It is unclear how these algorithms control the false discovery rate (FDR). We adopt a sequential method
(Javanmard and Montanari, 2015) that controls the FDR or marginal FDR (mFDR) in an online manner. Thirdly, genotype
data at genetic variants (e.g., SNPs and copy number variation) provide additional information that helps to distinguish the
casual direction between two genes; this is the rationale behind the Mendelian Randomization (MR). We incorporate MR
into the second main step of the PC algorithm in determining edge direction in the graph skeleton. MR can greatly reduce
the space of possible graphs and increase the inference efficiency for inference of genomic data. Applying MPC to synthetic
and real data, we demonstrate that our algorithm has superior performance compared to existing PC algorithms. Our
algorithm is implemented in the R package MPCALGO.
Key words- Gene regulatory networks, Graphical models, Causal inference, Marginal and conditional independence,
Multiple hypotheses testing, Mendelian randomization.