Fast and Robust Algorithm for Learning a Causal Graph of Genes

Graphical models provide a powerful mathematical framework to represent different types of dependencies that may be interpreted as causality. Directed and undirected graphs have been used extensively in genomics to capture the relationships among genes. Directed Acyclic Graphs (DAGs), also known as Bayesian networks, are a class of graphical models with directed edges. The PC algorithm (Spirtes et al., 2000) is one of the first algorithms proposed for the inference of DAGs through a series of statistical tests of marginal and conditional independence. This algorithm involves two main steps: the first step (undirected skeleton) starts with a fully connected graph and infers an undirected graph, which is the skeleton of the DAG. The second step determines the direction of the edges in the skeleton. Several improvements of the original PC algorithm have been developed, such as those implemented in theR packagepcalg (Kalischet al., 2012). However, challenges remain when these algorithms are applied to noise genomic data. Here we present MPC, a novel algorithm, which can be applied to genotype and gene expression data and efficiently learn a causal graph of genes. The inferred causal graph is mostly directed but may also contain undirected edges: whereas a directed edge from gene A to gene B may be interpreted as gene A regulates the expression of gene B, an undirected edge is equivalent to a bidirected edge, indicating that although we have evidence that the two genes are not independent, we do not have enough information to determine which gene is the regulator and which the target. Our MPC algorithm further tackles several challenges. Firstly, the gene expression data often have outliers, even after normalization, which may drastically alter the topology of the inferred graph. We take a robust approach and calculate the robust correlation matrix on which the series of hypothesis testing is performed. Secondly, the number of statistical tests is unknown beforehand in all existing variants of PC algorithms. It is unclear how these algorithms control the false discovery rate (FDR). We adopt a sequential method (Javanmard and Montanari, 2015) that controls the FDR or marginal FDR (mFDR) in an online manner. Thirdly, genotype data at genetic variants (e.g., SNPs and copy number variation) provide additional information that helps to distinguish the casual direction between two genes; this is the rationale behind the Mendelian Randomization (MR). We incorporate MR into the second main step of the PC algorithm in determining edge direction in the graph skeleton. MR can greatly reduce the space of possible graphs and increase the inference efficiency for inference of genomic data. Applying MPC to synthetic and real data, we demonstrate that our algorithm has superior performance compared to existing PC algorithms. Our algorithm is implemented in the R package MPCALGO. Key words- Gene regulatory networks, Graphical models, Causal inference, Marginal and conditional independence, Multiple hypotheses testing, Mendelian randomization.