SignS is web-based tool for gene selection and signature finding in problems where the dependent variable is patient survival or, more generally, a right-censored variable. Four approaches have been implemented: the threshold gradient descent method (TGD) of Gui & Lui (2005), a method that uses a combination of gene filtering, clustering and survival model building (FCMS), very similar to the one used in Dave et al. (2004), a method that uses random forests with conditional inference tress by Hothorn et al. (2006a), and a method that uses boosting with component-wise univariate Cox models (Hothorn et al., 2006a) .
There are other approaches that will analyze gene expression (in general, high-dimensional) data when the response is a right-censored variable. However, few other methods explicitly attempt to recover subsets of the original genes, preserving relevant genes even when these are highly correlated. Other approaches (e.g., PLS) will return predictive models, but offer little guidance as to selection of genes (i.e., prediction will be based on the complete set of original variables ---genes), or be intrinsically limited as to the maximum number of genes that can be used for fitting (such as the lasso).
The "Filter, cluster, and stepwise model selection" method uses a series of steps very similar to those in Dave et al.
This method, developed by Gui and Li, uses penalized likelihood, a general approach (with other examples such as the lasso and ridge regression) for obtaining estimates of the regression coefficients in high-dimensional spaces such as these (few subjects, thousands of features). This approach has close connections to the Lasso, but should overcome some of its limitations; for instance, the largest number of genes that can be selected by the Lasso can be no larger than the number of cases, and it tends to return only one of a set of genes that show high correlations.
Two parameters have to be set for this method: a threshold parameter τ (tau) and an infinitesimal increment ∆ν (Delta nu). The threshold parameter (where 0 ≤ τ ≤ 1) is related to the amount of penalization, and larger values lead to larger numbers of coefficients being set to 0. Delta nu is an infinitesimal increment that affects the rate of change of the coefficients at each iteration of the algorithm.
In their paper, Gui and Li suggest using cross-validation to choose τ and ν. Actually, an operational approach is to decide in advance on a set of τ to examine (in our case, 0, 0.2, 0.4, 0.6, 0.8, 1), and use cross-validation to select the ν for each of the pre-specified τ. For each τ, we will chose the ν that minimizes the cross-validated partial likelihood (CVPL), a function of the difference in the log Cox's partial likelihoods for the models with and without the the ith group of subjects. Once we find the ν for a given τ, it is inmediate to obtain the fitted coefficients for the genes and the scores for subjects not in the sample.
From the original paper, it seems that the authors used the CVPL to find the ν for a given τ, but it is not clear that this is the criterion ultimately used to select among τ. Rather, in the example discussed in their paper, the authors choose the final model based on the area under the ROC curve, and they use a set of data that includes both training and testing data subsets. We have (so far) only implemented selection based on the CVPL. Using roc curves has not yet been implemented, because of the additional computational demands, but we are working on the implementation of this (and another) approach for selecting the final τ, which can be used when we have no external testing data set.
Note that we scale (subtract the mean and divide by the standard deviation) each gene to prevent numerical problems that arise with some data sets when exponentiating the score function. This is in contrast to the R code from the authors, which performed no scaling or centering. In the future, this scaling of the data will be optional, possibly based on the existence of numerical problems. Note that when using scaled data the interpretation of the effects of coefficients does not depend on the variance of a gene, and all coefficients refer to the effect of a change in 1 s.d. unit of the gene.
NEW We have re-enabled this method. As this method is slow, following a suggestion from Jiang Gui, we now do not examine the CVPL over a range of threshold parameters τ (tau), but instead use a fixed one. By default, it is set at 1. You can vary it, but you are likely to want large values (between 0.9 and 1.). The main reason is that this method can add genes but not remove them, so small values of the threshold will result in hundreds or thousands of genes selected in the very first steps of the method, that are then carried along all the way.
The approach of Hothorn et al. for using random forests utilizes conditional inference trees as base learners. A conditional inference tree is built by recursively splitting the data into two groups according to the value of the covariables. First, test the global null hypothesis of independence between any of the variables and the response. If this hypothesis is rejected, select the variable with strongest association with the response, and implement a binary split in the selected input variable. Continue recursively (in each branch of the tree) until the global null is no longer rejected. The tests are carried out in a conditional inference framework (for details see Hothorn et al., 2006b). A forest is built by fitting a conditional inference tree to each of the many (in our case, 500) bootstrap samples of the data. The type of response returned by these forests are Kaplan-Meir curves. In contrast to the previous two methods, there is no variable (gene) selection performed here although, implicitly, the construction of each tree involves choosing the best variable for a split.
Following van Wieringen et al. (2007), we fit the forests to the best genes, as determined from gene-wise Cox models. The actual number of genes used is the parameter "Number of genes", that can be chosen by the user of the application. The default is 200, as in the reference above.
Boosting is a general iterative optimization method that, at each iteration, fits the best predictor variable reducing the appropriate criterion (Bühlmann and Hothorn, 2008) and updating the estimates; the algorithm used is equivalent to refitting residuals multiple times.
For survival data, Hothorn et al. (2006), and Bühlmann and Hothorn (2008) use component-wise univariate Cox models. Boosting requires choosing the number of boosting iterations, but as shown in those references, we can use cross-validation to select this parameter. This method performs variable selection, similar to the Lasso and other L1-penalty methods. We select for the final results the variables (genes) with non-zero coefficients at the optimal number of iterations as chosen via cross-validation.
Assessing predictive performance with censored data is more complex than with, say, classification data, and several alternative approaches have been suggested. Here, we provide a simple graphical approach, popular in the gene expression literature: for each case, we obtaine its out-of-bag cross-validated prediction (i.e., the prediction from a model in which that subject was not used at all in any of the steps of the model building). Using the out-of-bag predictions we divide all subject into two groups, those with high scores and those with low scores (i.e. we split them into two equally sized groups). Then, we plot the survival curvesand compare the survival of the two groups with a log-rank test.
(Note that with random forests, we use the expected survival time, not the linear predictor scores).
It must be emphasized that these are out-of-bag predictions, so these are predictions for each subject obtained from a complete model building process where that subject was not involved. For comparison, we also provide the "overfitt" plots where the scores used to split the subjects are "resubstitution" scores.
Be very careful when using the above plots!! In fact, that way of measuring or assessing predictive ability is extremely limited, which is a soft way of saying that it is wrong, and those p-values mean nothing, and a small p-value does not mean you really can predict survival well, etc, etc, etc. I do not have time to write about this in any detail right now, but do yourself a favor and look, for instance, at rules 36 to 38 of the paper "Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting" by Alain Dupuy, Richard M. Simon published at the JNCI (January 17, 2007).
How should we measure predictive ability then? There are a range of approaches. I like (and have used in published papers ---see, e.g., Montes et al., 2011, "miRNA expression in diffuse large B-cell lymphoma treated with chemoimmunotherapy.", Blood, 118: 1034-40, and go to the supplementary material) what are basically extensions to the survival data case of things such as Brier's score and the concordance index (of course, from cross-validated/bootstrapped runs). These indices can really address the question of the predictive quality of your model. The above plots don't.
Why aren't Brier's and the concordance index in SignS/will they ever be in SignS? They aren't because I started working in SignS when I was a lot younger and, thus, a lot un-wiser. They might make their way to SignS when/if I get the time to work on SignS again.
Why don't you remove these plots then? Ah, great question. I think they give you a (bad) hint of what is going on. And taking them out would require me to rewrite a lot of the help and change the examples, etc, and I just don't have the time now. However, I've warned you not to use them for anything serious. Oh, and I use this in class as an example of how NOT to do things.
Be very careful when looking at the legend!! So this I just noticed (April 2015) when a friend of mine at CNIO (Gonzalo Gomez) asked me an innocent question. What does this "high/low" score in the legend really means? Get ready, because this is a confusing mess I should some day fix. If you are using something like the FCMS (which I recommend you don't use, but whatever) or the TGD method, or boosting with component-wise Cox models we get the linear scores from the model (remember, the higher the linear score, the higher the hazard) and we split them into two groups; the blue ones are those from the individuals with low linear scores (yes, low), and thus those from individuals that should die later (those that have smaller hazards). In contrast, if you use something like random forest, the scores are from the predicted survival time, so the individuals in the blue group are those in the group of smaller predicted survival times. Sorry, I am aware this is very confusing.
Stability is assessed showing the parameters chosen in each crossvalidation run (for TGD) and the signatures, their coefficients, and the composing genes (for FCMS), or the selected genes (random forests and boosting). For all we show the frequency with which each selected gene appears in the output, and the percentage of coincidence between the different runs. (See output section, below).
The file with the covariates; generally the gene expression data. In this file, rows represent variables (generally genes), and columns represent subjects, or samples, or arrays.
The file for the covariates should be formated as:
#Name ge1 ge2 ge1 ge1 ge2 gene1 23.4 45.6 44 76 85.6 genW@ 3 34 23 56 13 geneX# 23 25.6 29.4 13.2 1.98
A tab-separated text file with the survival or duration time (e.g., time to death, time to relapse). These data can be right-censored. Separate values by tab (\t), and finish the file with a carriage return or newline. No missing values are allowed here.
Please note that we do not allow any data set with 10 or fewer cases. In terms of statistical analysis, the "true sample size" of right-censored data is often much smaller than the number of cases, and even 20 cases are way too few.
It is your responsability to make sure the assumptions for these data hold. The models work for right-censored data. See usual caveats/discussions on other types of censoring, informative censoring, etc.
A tab-separated text file indicating if the event (e.g., death, relapse) was actually observed (this is indicated with a 1) or not (a 0); cases for which the event was not observed are the right-censored cases.
Same format as above: Separate values by tab (\t), and finish the file with a carriage return or newline. No missing values are allowed.
If you have validation data, you can enter them here. It is rarely a good idea to keep a fraction of the data for validation; statistics has shown repeatedly that cross-validation and the bootstrap are much better procedures (specially in terms of variance of estimates) than setting appart a chunck of the data. (A quick, and not quite fully correct, explanation: Setting appart a chunk of the data is, if you think of it, like a "poor man's cross-validation", because you do once what 10-fold CV does 10 times; and means of 10 samples have smaller variances than one sample). Thus, unless you have a very good reason, do not use the validation data option.
But there are situations when you might have validation data. Suppose you have a sample of 200 from a study, and another sample of, say, 100 from a similar study but with samples from a different hospital or different country. It is probably a bad idea to put together all 300 cases as if they were a homogenous sample. You can, instead, train your model with the 200 samples and validate it on the 100. This provides also some indication about whether what you find on the first hospital/country is useful for the second hospital/country.
However, DO NOT (I repeat DO NOT) keep playing around with the parameters until things look nice with the validation data. That would be a serious case of overfitting. Why? Well, obviusly because you would be using your validation sample to choose the "best parameters", and thus the validation sample would not be validating anything at all.
Of coruse, you can play, to your hearts content, with the parameters using the training data (ONLY the training data). You can do this as many times as you wish (or you have patience for). Then, once you are satisfied with the parameters found with your training data, run SignS, using training and validation data, with the very same parameters you just found.
Please remember that the validation data are just that, validation data, and should be used to validate a model (not to find good settings of the parameters). Of course, we cannot prevent you from doing this, and the problem of this tool is that it is open to abuse. But if you do abuse it, you must be fully aware that this is a clear case of overfitting and cannot be called validation.
If you use any of the currently standard identifiers for your gene IDs for either human, mouse, or rat genomes, you can obtain additional information by clicking on the names in the output figures. This information is returned from IDClight based on that provided by our IDConverter tool.
The plots depend on the method used. For all methods we show the survival plots as explained above.
For FCMS we also show the dendrograms from the clustering of the positive and negative coefficient genes (after the p-value filtering), indicating with colors the groups that meet the restrictions of size and minimum correlation. Again, these are the clusters that survive the filtering for p-value. And we use color to show those that suvive the filtering in terms of maximum and minimum size of clusters. All, some, or none of these might be finally selected by the Cox model.
Additionally, no particular meaning is to be assigned to the cluster names. We simply name all clusters for genes with positive coefficients starting with a "P", and those for genes with negative coefficients starting with a "N". But the numbers themselves mean nothing in particular (they have to do with depth and layout in the plot). Please do not try to interpret or assign any meaning to them; they are just names.
Few "automatically generated" and static dendrograms are to everybody's satisfaction; how well things look depends on the dendrogram, whether or not we annotate with labels, the size of the labels, etc. NEW We now provide dendrograms with all nodes labeled or only the selected clusters labeled. In addition, you can get additional information about each node by clicking on it, if you provided information about the organism and the type of id (and those identifiers are among the ones used by IDConverter). Moreover, you can choose among three different sizes for the dendrograms. The names of the cluster/signature in the labels of the figures correspond to the names in the textual output.
For TGD we show a plot of the CVPL (cross-validate partial likelihood) versus ν for different values of τ. You would want to see six U shaped lines, with the bottom of the U far from the maximum you set for Maximum iterations; note that this maximum might not be shown in the figure. You might see that different lines finish at differetn ν: one algorithm we have implemented (and which might be running when you use the program) allows for "early stopping" when it is obvious we are getting away from the best ν, and this is likely to happen at different ν for different τ. For TGD we also showm, following the original paper of Gui and Li, the values of the coefficients for different settings of τ, to allow to obtain a quick idea of the amount of regularization and sets of genes likely to be important.
NEW We now provide several tables (ordered using different criteria) that show the p-values and coefficients for the single-gene (or gene-by-gene) Cox models that are the first part of the FCMS method.
For random forests the statistics shown are those corresponding to single-gene Cox models. No FDR information is shown, to emphasize that we choose, by decree, and regardless of p-value, the best K genes (K, by default, is 200).
Likewise, for boosting, only the coefficients have any meaningful value.
For each sample, its out-of-bag score. This is just the matrix product of the coefficients and the value of the covariates. It is unlikely to provide any insight by itself, but allows to order cases, see which are very far away from others, etc. Again, we emphasize that these are out-of-bag predictions, so these are predictions for each subject obtained from a complete model building process where that subject was not involved. In other words, as in Tnasas and GeneSrF we do what is sometimes called "double or full cross-valdation".
Each method has somewhat slightly different output, that shows the results of the models fitted, parameters used, etc. The output should be self-explanatory. (If it is not, please let me know, and I'll do my best to fix it).
Some explanatory notes for FCMS:
For TGD, the complete list of all the genes selected for the six different values of the threshold (tau) are listed in the file "genes.all.out", which you can download with the rest of the output (see below).
We provide the linear scores; these are just the matrix product of the coefficients times the value of the covariates (i.e., the selected clusters). It is unlikely to provide any insight by itself, but allows to order cases, see which are very far away from others, etc.
For random forests, the values provided are not linear scores but expected survival time.
You can download a single compressed file with all figures, in both the png format showed in the web results as well as in pdf format ---which gives better quality for printing---, and results (a single text file). The format is tar.gz, understood by GNU's tar, standard in GNU/Linux distributions, and also understood by the common uncompressors/unzipers/etc in other operating systems.
Examples of several runs, one with fully commented results, are available here.
This program was developped by Ramón Díaz-Uriarte. This tool uses Python for the CGI and R for the computations. The code for the FCMS approach was written from scracth by Ramón Díaz-Uriarte, with collaboration from David Casado de Lucas. For TGD, I started with the R code provided by Li and Gui, and rewrote large parts of it to increase its speed and parallelize it (using Rmpi and Snow, see below), and to add additional functionality. For random forests I use the party package by Torsten Hothorn, Kurt Hornik and Achim Zeileis. For boosting I use mboost by Torsten Hothorn and Peter Buhlmann with contributions by Thomas Kneib and Matthias Schmid. I have also used the following R packages: CGIwithR by David Firth, snow, by Luke Tierney, A. J. Rossini, Na Li and H. Sevcikova, Rmpi, by Hao Yu, and rsprng, by Na (Michael) Li, class by W. Venables and B. Ripley, survival, S original by Terry Therneau, ported by Thomas Lumley, GDD by Simon Urbanek, imagemap by Barry Rowlingson, R2HTML by Eric Lecoutre, papply, by D. Currie, and combinat, by S. Chasalow. We have also used the JavaScript aqtree code by Stuart Langridge, for the expandable lists.
This application is running on a cluster of machines using Debian GNU/Linux as operating system and Apache as web server.
We want to thank the authors and contributors of these great (and open source) tools that they have made available for all to use. If you find this useful, and since R and Bioconductor are developed by a team of volunteers, we suggest you consider making a donation to the R foundation for statistical computing.
Funding partially provided by Project TIC2003-09331-C02-02 of the Spanish Ministry of Education and Science. This application is running on a cluster of machines purchased with funds from MINECO project BIO2009-12458.
Uploaded data set are saved in temporary directories in the server and are accessible through the web until they are erased after some time. Anybody can access those directories, nevertheless the name of the directories are not trivial, thus it is not easy for a third person to access your data.
In any case, you should keep in mind that communications between the client (your computer) and the server are not encripted at all, thus it is also possible for somebody else to look at your data while you are uploading or dowloading them.
This software is experimental in nature and is supplied "AS IS", without
obligation by the authors or the IIB to provide accompanying services or
support. The entire risk as to the quality and performance of the software is
with you. The authors expressly disclaim any and all warranties regarding the
software, whether express or implied, including but not limited to warranties
pertaining to merchantability or fitness for a particular purpose.
Bühlmann P, Hothorn T. (2008). Boosting Algorithms: Regularization, Prediction and Model Fitting. Statistical Science, in press.
Dave, SS, Wrigth, G, Tan B, Rosenwald A, Gascoyne RD, et al. (2004). Prediction of survival in follicular lymphoma based on molecualr features of tumor-inflitrating immune cells. New England Journal of Medicine 351: 2159--219.
Diaz-Uriarte, R. (2008) SignS: a parallelized, open-source, freely available, web-based tool for gene selection and molecular signatures for survival and censored data. BMC Bioinformatics, 2008, 9:30
Gui J, Li H. (2005) Threshold Gradient Descent Method for Censored Data Regression with Applications in Pharmacogenomics. Pacific Symposium on Biocomputing. pdf.
Hothorn T, Bühlmann P, Dudoit S, Molinaro A, van~der Laan MJ. (2006a). Survival Ensembles. Biostatistics 2006,7:355--373.
Hothorn T, Hornik K, Zeileis A. (2006b). Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and Graphical Statistics, 15:651--674.
Simon, R. M., E. L. Korn, L. M. McShane, M. D. Radmacher, G. W. Wright, and Y. Zhao (2003). Design and analysis of DNA microarray investigations. New York: Springer.
van Wieringen WN, Kun D, Hampel R, Boulesteix AL: (2007). Survival prediction using gene expression data: a review and comparison. In review.