ActiveDriver is a computational method for identifying post-translational modification (PTM) sites (i.e., active sites) in proteins that are significantly mutated in cancer genomes. ActiveDriver provides signalling-related interpretation of single nucleotide variants (SNVs) identified in cancer genome sequencing.
ActiveDriver is based on a gene-centric logistic regression model that considers multiple factors in estimating significance of mutation enrichment in PTM sites. The factors include mutation frequency, distribution of active sites in protein sequence, their position with respect to mutations (directly in the PTM-associated amino acid or near the PTM site), and structured and disordered regions of proteins.
Please refer to the following publications:
- Jüri Reimand, Gary D. Bader: Systematic analysis of somatic mutations in phosphorylation signaling predicts novel cancer drivers. (2013) Molecular Systems Biology, 9:637. doi:10.1038/msb.2012.68 [PDF].
See also a Research Highlight in Genome Medicine.
Supplementary data (tables S1-S9) [ZIP].
- Jüri Reimand, Omar Wagih, Gary D. Bader: The mutational landscape of phosphorylation signaling in cancer. (2013) Scientific Reports, 2:2651. doi:10.1038/srep02651 [PDF].
Supplementary data (tables S1-S8) [ZIP].
Supplementary data (Synapse at syn2237931).
Original pan-cancer 12 mutations from TCGA (Synapse at syn1729383).
The following code shows an example analysis with ActiveDriver comprising seven genes with mutations in the TCGA pancancer project. The required input files can be found here. Uncompress the ZIP file as folder “pancan12_example” into the working directory of R.
# load library library(ActiveDriver) # load required datasets muts = read.delim("pancan12_example/mutations.txt") sites = read.delim("pancan12_example/phosphosites.txt") seqs = read_fasta("pancan12_example/sequences.fa") disorder = read_fasta("pancan12_example/sequence_disorder.fa") # run ActiveDriver psnv_info = ActiveDriver(seqs, disorder, muts, sites) # save gene-based p-values and merged report as CSV files write.csv(psnv_info$all_gene_based_fdr, "pancan12_results_pvals.csv") write.csv(psnv_info$merged_report, "pancan12_results_merged.csv") # look at first few lines of every table in results lapply(psnv_info, head)
The above example produces an R list with six tables:
- all_active_mutations – table of PTM-related mutations in sites or regions. The column active_region identifies the mutated protein region (see below) and status defines mutation type relative to the closest PTM site (DI-direct, N1-close flanking, N2-distant flanking).
- all_active_sites – table of all PTM sites in proteins, identified by the field active region. Position indicates first residue in protein sequence.
- all_region_based_pval – table of site-based significance tests. Sites are identified by the column region. The fields med, low, high show expected mutation counts (+/- s.d.) and obs shows observed mutation counts.
- all_gene_based_fdr – gene-based significance scores before and after FDR multiple testing correction.
- all_active_regions – sequence coordinates of PTM-related sequence regions in proteins. The field ‘reg’ corresponds to region ID as shown in table all_region_based_pval.
- merged_report – table with each mutated PTM-associated sequence region, PTM sites in the region, and corresponding mutations.
Example data [Reimand et al, Mol Sys Biol 2013]
ActiveDriver requires the following four types of input. Example data originate from our first phosphosite paper.
- non-synonymous point mutations
[example: SNVs of 797 cancer genomes];
- active sites in protein sequences
[example: 73,873 phosphosites] OR
- active regions
[example: kinase domains];
- protein sequences
[example: longest isoforms for 18,422 human genes from CCDS];
- predicted disorder of protein sequences
[example: DISOPRED2 predictions for above protein sequences].
- Phosphorylated and mutated protein sequences [CSV].
- List of protein isoforms used in the study (mapping of gene symbols and Consensus CDS (CCDS) IDs) [TXT].
TCGA pan-cancer data [Reimand et al, Nat Sci Rep 2013]
The following files relate to our phosphosite analysis of TCGA pan-cancer mutations (published Oct. 2013).
- 241,701 non-synonymous point mutations in in 3,185 tumor samples from TCGA pancancer12 project [ZIP] NB! these were re-mapped using Annovar to RefSeq sequences, see below. The original mutation files can be found in Synapse at syn1729383;
- 87,898 phosphosites in protein sequences [ZIP];
- protein sequences – longest isoforms for 18,671 human genes [ZIP];
- Predicted disorder of protein sequences (from Disopred2) [ZIP].
- Map of gene symbols and corresponding protein isoforms (RefSeq IDs) [ZIP].
Phosphosites and mutations mapped to human proteins in Ensembl 70 [28.09.15]
Zipped archive with protein sequences, sequence disorder, and phosphorylation sites for human proteins in Ensembl 70 can be downloaded from here.
ActiveDriver input files for HG38 [06.04.2015]
Zipped archive with protein sequences, sequence disorder, four types of PTM sites (phosphorylation, ubiquitination, acetylation, methylation), and pancan12 mutations converted to HG38 using LiftOver can be downloaded from here.
We have identified a mistake in one of the formulas describing the ActiveDriver model in the original publication (Reimand and Bader, Mol Sys Biol 2013). In the image below, the first flawed formula occurs in the paper and the second, corrected formula should be referred to instead. Thanks to Xiaohe Li @ NUS.EDU for pointing this out!