puma: an R/Bioconductor package for Propagating Uncertainty in Microarray Analysis

Most analyses of Affymetrix GeneChip data are based on point estimates of expression levels and ignore the uncertainty of such estimates. By propagating uncertainty to downstream analyses we can improve results from microarray analyses. For the first time, the puma package makes a suite of uncertainty propagation methods available to a general audience. puma also offers improvements in terms of scope and speed of execution over previously available uncertainty propagation methods. Included are summarisation, differential expression detection, clustering and PCA methods, together with useful plotting and data manipulation functions. It is a part of the PUMA project.

Why is puma different from other Affymetrix analysis packages?

puma incorporates the methods mmgmos, pplr and pumaclust. The following sections show why these methods are different to other methods

Why is mmgmos different from other Affy probe-level analysis methods?

Affymetrix microarrays adopt multiple probes to measure the abundance of transcription, so it is possible to apply various statistical and probabilistic methods to provide confident gene expression results. The most popular probe-level analysis methods are statistic models which are able to calculate gene expression levels accurately. However, these methods are incapable of providing the credibility of the expression values that may be very useful for further statistical analyses. mmgmos is specifically designed to address this limitation.

There are two version of gMOS implemented in this package, modified gMOS (mgMOS) and multi-chip modified gMOS (multi-mgMOS). The original gMOS uses two gamma distributions to model Perfect Match intensities and Mismatch intensities with shared scale parameters on each chip. The mgMOS changes the scale parameters into latent variables to reflect the different binding affinity of probes within the probe-set. This modified distribution accurately captures the correlated changes in the binding affinity of probe-pairs within the probe-set. Both gMOS and mgMOS are single chip models. The multi-mgMOS is an extended version of gMOS and mgMOS. It shares the scale parameters in gamma distributions across all chips to reflect the intrinsic characteristic of probe sequences of the same type of chip. It also allows for a fraction of true signal binding to Mismatch probe. The likelihood function of all versions of gMOS can be written in closed form and the computation is therefore very fast compared with other probabilistic models.

The package mmgmos implements mgMOS in function mgmos and multi-mgMOS in function mmgmos. The fast C program donlp2 is used to optimise parameters. Both mgmos and mmgmos functions output the mean, median, standard deviation, 5%, 25%, 75% and 95% credibility intervals of the expression level for each gene.

Why is pplr different from other detecting differential gene expression methods?

There are two main reasons that make the detection of differential gene expression difficult. One is that the noisy nature of microarray data requires a reasonable probabilistic model to characterise the variability in probe data (within-chip variance). Another is that the small number of replicates makes it difficult to obtain an accurate variance estimate for each gene across replicates (between-replicate variance). Many approaches have been devised to address the second difficulty and obtain accurate between-replicate variance. Most of these methods are based on single point estimates of gene expression values. Few methods include within-chip variance in finding differential gene expression. pplr is used to include probe-level measurement error into the variance estimate of gene expression levels and makes use of this improved variance to detecting down and up-regulated genes by the calculation of the PPLR. The probe-level measurement error are calculated from the function mmgmos.

pplr uses a Bayesian hierarchical model to combine probe-level measurement error and between-replicate variance and adopts the variational method to estimate the parameters. Following a whole Bayesian approach, pplr calculates the probability of positive log-ratio (PPLR) to detect the up-regulated genes rather than calculates p-value. Down-regulated genes can also be found by calculating the probability of negative log-ratio.
pplr is implemented in the pumaDE function.

Why is pumaclust different from other clustering methods?

Clustering is an important analysis performed on microarray gene expression data since it groups genes which have similar expression patterns and enables the exploration of unknown gene functions. Due to the complicated multi-step microarray experiments, the resulting gene expression data are very noisy. Many heuristic and model-based clustering approaches have been developed to cluster this noisy data. However, few of them include consideration of probe-level measurement error which provides rich information about technical variability. We augment a standard model-based clustering method to incorporate probe-level measurement error. Using probe-level measurements from a recently developed Affymetrix probe-level model, multi-mgMOS, we include the probe-level measurement error directly into the standard Gaussian mixture model. The performance of model-based clustering of gene expression data is improved by including probe-level measurement error and more biologically reasonable clustering results are obtained. The probe-level measurement error are calculated from the function mmgmos.

Download

puma is free software; you can redistribute if and/or modify it under the terms of the GNU General Public License. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY. We do appreciate your citation of our publications or website.

It is recommended that puma is downloaded from Bioconductor using biocLite. The latest version of puma can be found here. A copy of user guide can be downloaded here. It is also included in the distributions.

The following table gives access to historic versions of puma. In general, these are not recommended as they are not supported. Please use instead the latest version from Bioconductor.

Version	Linux add-on package	Windows add-on package	Mac OS X add-on package	R version requirement	Description
1.2.1	puma_1.2.1.tar.gz	puma_1.2.1.zip	puma_1.2.1.tgz	2.5.0	New DEResults class output from pumaDE. Changed default normalisation in mmgmos and mgmos to "median" (from "none"). New functions calcAUC, numFP, removeUninformativeFactors. createContrastMatrix now creates "1-vs-others" contrasts. Various other minor changes and bug fixes (see svn for full details).
1.2.0	puma_1.2.0.tar.gz	puma_1.2.0.zip	puma_1.2.0.tgz	2.5.0	Original Bioconductor release of puma.

FAQ and bug report

1. What is the requirement of the installation of mmgmos?

In order to install puma, you need to have R 2.5.0 and BioConductor 2.0 installed. For the installation of R and BioConductor please refer to R project and BioConductor.org respectively.

2. How to install puma?

Start R. Type
>source("http://bioconductor.org/biocLite.R")
>biocLite("puma")

To download historic versions of puma (not recommended) download the add-on package from the links above and save it to your local disk. For Linux and Mac OS Xusers, at the directory where it is saved type

>R CMD INSTALL puma_x.x.x.tar.gz

to install it. For Windows users, use 'Install package(s) from local zip files ...' item in 'packages' menu to install.

3. What if I spot a fault in puma?

We are keen for feedback on puma. If you experience a problem or bug, please report it via mailto:richard.pearson@postgrad.manchester.ac.uk, or send a message to the main Bioconductor mailing list. Any suggestions and comments are welcome.

Back to PUMA project.