Bioinformatics, Vol. 25, No. 22. (15 November 2009), pp. 3026-3027. Summary: Saint is a web application which provides a lightweight annotation integration environment for quantitative biological models. The system enables modellers to rapidly mark up models with biological information derived from a range of data sources. Availability and Implementation: Saint is freely available for use on the web at http://www.cisban.ac.uk/saint. The web application is implemented in Google Web Toolkit and Tomcat, with all major browsers supported. The Java source code is freely available for download at http://saint-annotate.sourceforge.net. The Saint web server requires an installation of libSBML and has been tested on Linux (32-bit Ubuntu 8.10 and 9.04). Contact: helpdesk@cisban.ac.uk; a.l.lister@ncl.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. 10.1093/bioinformatics/btp523 Allyson Lister, Matthew Pocock, Morgan Taschuk, Anil Wipat
- Daniel Swan
BMC Bioinformatics, Vol. 10, No. 1. (2009), 354. BACKGROUND:The microarray data analysis realm is ever growing through the development of various tools, open source and commercial. However there is absence of predefined rational algorithmic analysis workflows or batch standardized processing to incorporate all steps, from raw data import up to the derivation of significantly differentially expressed gene lists. This absence obfuscates the analytical procedure and obstructs the massive comparative processing of genomic microarray datasets. Moreover, the solutions provided, heavily depend on the programming skills of the user, whereas in the case of GUI embedded solutions, they do not provide direct support of various raw image analysis formats or a versatile and simultaneously flexible combination of signal processing methods.RESULTS:We describe here Gene ARMADA (Automated Robust MicroArray Data Analysis), a MATLAB implemented platform with a Graphical User Interface. This suite...
- Daniel Swan
BMC Bioinformatics, Vol. 10, No. 1. (2009), 330. BACKGROUND:Microarray experiments are increasing in size and samples are collected asynchronously over long time. Available data are re-analysed as more samples are hybridized. Systematic use of collected data requires tracking of biomaterials, array information, raw data, and assembly of annotations. To meet the information tracking and data analysis challenges in microarray experiments we reimplemented and improved BASE version 1.2.RESULTS:The new BASE presented in this report is a comprehensive annotable local microarray data repository and analysis application providing researchers with an efficient information management and analysis tool. The information management system tracks all material from biosource, via sample and through extraction and labelling to raw data and analysis. All items in BASE can be annotated and the annotations can be used as experimental factors in downstream analysis. BASE stores all microarray experiment...
- Daniel Swan
ArrayMining: a modular web-application for microarray analysis combining ensemble and consensus methods with cross-study normalization - http://www.citeulike.org/user...
BMC Bioinformatics, Vol. 10, No. 1. (2009), 358. BACKGROUND:Statistical analysis of DNA microarray data provides a valuable diagnostic tool for the investigation of genetic components of diseases. To take advantage of the multitude of available data sets and analysis methods, it is desirable to combine both different algorithms and data from different studies. Applying ensemble learning, consensus clustering and cross-study normalization methods for this purpose in an almost fully automated process and linking different analysis modules together under a single interface would simplify many microarray analysis tasks.RESULTS:We present ArrayMining.net, a web-application for microarray analysis that provides easy access to a wide choice of feature selection, clustering, prediction, gene set analysis and cross-study normalization methods. In contrast to other microarray-related web-tools, multiple algorithms and data sets for an analysis task can be combined using ensemble feature...
- Daniel Swan
Bioinformatics, Vol. 25, No. 21. (1 November 2009), pp. 2872-2877. Motivation: Whole transcriptome shotgun sequencing data from non-normalized samples offer unique opportunities to study the metabolic states of organisms. One can deduce gene expression levels using sequence coverage as a surrogate, identify coding changes or discover novel isoforms or transcripts. Especially for discovery of novel events, de novo assembly of transcriptomes is desirable. Results: Transcriptome from tumor tissue of a patient with follicular lymphoma was sequenced with 36 base pair (bp) single- and paired-end reads on the Illumina Genome Analyzer II platform. We assembled [~]194 million reads using ABySS into 66 921 contigs 100 bp or longer, with a maximum contig length of 10 951 bp, representing over 30 million base pairs of unique transcriptome sequence, or roughly 1% of the genome. Availability and Implementation: Source code and binaries of ABySS are freely available for download at...
- Daniel Swan
Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis - http://www.citeulike.org/user...
Bioinformatics, Vol. 25, No. 22. (15 November 2009), pp. 2906-2912. Motivation: The molecular complexity of a tumor manifests itself at the genomic, epigenomic, transcriptomic and proteomic levels. Genomic profiling at these multiple levels should allow an integrated characterization of tumor etiology. However, there is a shortage of effective statistical and bioinformatic tools for truly integrative data analysis. The standard approach to integrative clustering is separate clustering followed by manual integration. A more statistically powerful approach would incorporate all data types simultaneously and generate a single integrated cluster assignment. Methods: We developed a joint latent variable model for integrative clustering. We call the resulting methodology iCluster. iCluster incorporates flexible modeling of the associations between different data types and the variance-covariance structure within data types in a single framework, while simultaneously reducing the...
- Daniel Swan
Bioinformatics, Vol. 25, No. 21. (1 November 2009), pp. 2855-2856. Motivation: With the availability of many omics' data, such as transcriptomics, proteomics or metabolomics, the integrative or joint analysis of multiple datasets from different technology platforms is becoming crucial to unravel the relationships between different biological functional levels. However, the development of such an analysis is a major computational and technical challenge as most approaches suffer from high data dimensionality. New methodologies need to be developed and validated. Results: integrOmics efficiently performs integrative analyses of two types of omics' variables that are measured on the same samples. It includes a regularized version of canonical correlation analysis to enlighten correlations between two datasets, and a sparse version of partial least squares (PLS) regression that includes simultaneous variable selection in both datasets. The usefulness of both approaches has been...
- Daniel Swan
Wolfram|Alpha releases a webservice API
- Daniel Swan
It looks a little overpriced at the low end to me. I would have thought the way to do it would be to give a limited number of requests away for free to stimulate development, then charge more for the high-volume, because that probably has a revenue model behind it.
- Mr. Gunn
Nature, Vol. 461, No. 7266. (14 October 2009), pp. 881-881. Google Wave is the kind of open-source online collaboration tool that should drive scientists to wire their research and publications into an interactive data web, says Cameron Neylon. Cameron Neylon
- Daniel Swan
Nature, Vol. advance online publication (14 October 2009) DNA cytosine methylation is a central epigenetic modification that has essential roles in cellular processes including genome regulation, development and disease. Here we present the first genome-wide, single-base-resolution maps of methylated cytosines in a mammalian genome, from both human embryonic stem cells and fetal fibroblasts, along with comparative analysis of messenger RNA and small RNA components of the transcriptome, several histone modifications, and sites of DNA–protein interaction for several key regulatory factors. Widespread differences were identified in the composition and patterning of cytosine methylation between the two genomes. Nearly one-quarter of all methylation identified in embryonic stem cells was in a non-CG context, suggesting that embryonic stem cells may use different methylation mechanisms to affect gene regulation. Methylation in non--CG contexts showed enrichment in gene bodies and depletion...
- Daniel Swan
Molecular Systems Biology, Vol. 5 (13 October 2009) The advent of cost-effective genotyping and sequencing methods have recently made it possible to ask questions that address the genetic basis of phenotypic diversity and how natural variants interact with the environment. We developed Camelot (CAusal Modelling with Expression Linkage for cOmplex Traits), a statistical method that integrates genotype, gene expression and phenotype data to automatically build models that both predict complex quantitative phenotypes and identify genes that actively influence these traits. Camelot integrates genotype and gene expression data, both generated under a reference condition, to predict the response to entirely different conditions. We systematically applied our algorithm to data generated from a collection of yeast segregants, using genotype and gene expression data generated under drug-free conditions to predict the response to 94 drugs and experimentally confirmed 14 novel gene–drug...
- Daniel Swan
BMC Bioinformatics, Vol. 10, No. 1. (22 September 2009), 305. BACKGROUND:Chromatin immunoprecipitation on tiling arrays (ChIP-chip) has been employed to examine features such as protein binding and histone modifications on a genome-wide scale in a variety of cell types. Array data from the latter studies typically have a high proportion of enriched probes whose signals vary considerably (due to heterogeneity in the cell population), and this makes their normalization and downstream analysis difficult.RESULTS:Here we present strategies for analyzing such experiments, focusing our discussion on the analysis of Bromodeoxyruridine (BrdU) immunoprecipitation on tiling array (BrdU-IP-chip) datasets. BrdU-IP-chip experiments map large, recently replicated genomic regions and have similar characteristics to histone modification/location data. To prepare such data for downstream analysis we employ a dynamic programming algorithm that identifies a set of putative unenriched probes, which we use...
- Daniel Swan
BMC Genomics, Vol. 10, No. 1. (2009), 439. BACKGROUND:With the increasing number of expression profiling technologies, researchers today are confronted with choosing the technology that has sufficient power with minimal sample size, in order to reduce cost and time. These depend on data variability, partly determined by sample type, preparation and processing. Objective measures that help experimental design, given own pilot data, are thus fundamental.RESULTS:Relative power and sample size analysis were performed on two distinct data sets. The first set consisted of Affymetrix array data derived from a nutrigenomics experiment in which weak, intermediate and strong PPARalpha agonists were administered to wild-type and PPARalpha-null mice. Our analysis confirms the hierarchy of PPARalpha-activating compounds previously reported and the general idea that larger effect sizes positively contribute to the average power of the experiment. A simulation experiment was performed that mimicked...
- Daniel Swan
Nature, Vol. advance online publication (07 October 2009) Structural variations of DNA greater than 1 kilobase in size account for most bases that vary among human genomes, but are still relatively under-ascertained. Here we use tiling oligonucleotide microarrays, comprising 42 million probes, to generate a comprehensive map of 11,700 copy number variations (CNVs) greater than 443 base pairs, of which most (8,599) have been validated independently. For 4,978 of these CNVs, we generated reference genotypes from 450 individuals of European, African or East Asian ancestry. The predominant mutational mechanisms differ among CNV size classes. Retrotransposition has duplicated and inserted some coding and non-coding DNA segments randomly around the genome. Furthermore, by correlation with known trait-associated single nucleotide polymorphisms (SNPs), we identified 30 loci with CNVs that are candidates for influencing disease susceptibility. Despite this, having assessed the completeness of...
- Daniel Swan
Bioinformatics, Vol. 25, No. 20. (15 October 2009), pp. 2692-2699. Motivation: With the proliferation of microarray experiments and their availability in the public domain, the use of meta-analysis methods to combine results from different studies increases. In microarray experiments, where the sample size is often limited, meta-analysis offers the possibility to considerably increase the statistical power and give more accurate results. Results: A moderated effect size combination method was proposed and compared with other meta-analysis approaches. All methods were applied to real publicly available datasets on prostate cancer, and were compared in an extensive simulation study for various amounts of inter-study variability. Although the proposed moderated effect size combination improved already existing effect size approaches, the P-value combination was found to provide a better sensitivity and a better gene ranking than the other meta-analysis methods, while effect size methods...
- Daniel Swan
BMC Bioinformatics, Vol. 10, No. 1. (2009), 292. BACKGROUND:The disparate results from the methods commonly used to determine differential expression in Affymetrix microarray experiments may well result from the wide variety of probe set and probe level models employed. Here we take the approach of making the fewest assumptions about the structure of the microarray data. Specifically, we only require that, under the null hypothesis that a gene is not differentially expressed for specified conditions, for any probe position in the gene's probe set: a) the probe amplitudes are independent and identically distributed over the conditions, and b) the distributions of the replicated probe amplitudes are amenable to classical analysis of variance (ANOVA). Log-amplitudes that have been standardized within-chip meet these conditions well enough for our approach, which is to perform ANOVA across conditions for each probe position, and then take the median of the resulting (1 - p) values as a...
- Daniel Swan
TargetMiner: microRNA target prediction with systematic identification of tissue-specific negative examples - http://www.citeulike.org/user...
Bioinformatics, Vol. 25, No. 20. (15 October 2009), pp. 2625-2631. Motivation: Prediction of microRNA (miRNA) target mRNAs using machine learning approaches is an important area of research. However, most of the methods suffer from either high false positive or false negative rates. One reason for this is the marked deficiency of negative examples or miRNA non-target pairs. Systematic identification of non-target mRNAs is still not addressed properly, and therefore, current machine learning approaches are compelled to rely on artificially generated negative examples for training. Results: In this article, we have identified [~]300 tissue-specific negative examples using a novel approach that involves expression profiling of both miRNAs and mRNAs, miRNA-mRNA structural interactions and seed-site conservation. The newly generated negative examples are validated with pSILAC dataset, which elucidate the fact that the identified non-targets are indeed non-targets.These high-throughput...
- Daniel Swan
BMC Bioinformatics, Vol. 10, No. 1. (2009), 299. BACKGROUND:High-throughput sequencing technology has become popular and widely used to study protein and DNA interactions. Chromatin immunoprecipitation, followed by sequencing of the resulting samples, produces large amounts of data that can be used to map transcription factor binding sites and histone modifications.METHOD:Our proposed statistical algorithm, BayesPeak, uses a fully Bayesian hidden Markov model to detect enriched locations in the genome. The structure accommodates the natural features of the Solexa/Illumina sequencing data and allows for overdispersion in the abundance of reads in different regions. Moreover, a control sample can be incorporated in the analysis to account for experimental and sequence biases. Markov chain Monte Carlo algorithms are applied to estimate the posterior distributions of the model parameters, and posterior probabilities are used to detect the sites of interest.CONCLUSIONS:We have presented a...
- Daniel Swan
Bioinformatics, Vol. 25, No. 20. (15 October 2009), pp. 2685-2691. Motivation: Microarray normalization is a fundamental step in removing systematic bias and noise variability caused by technical and experimental artefacts. Several approaches, suitable for large-scale genome arrays, have been proposed and shown to be effective in the reduction of systematic errors. Most of these methodologies are based on specific assumptions that are reasonable for whole-genome arrays, but possibly unsuitable for small microRNA (miRNA) platforms. In this work, we propose a novel normalization (loessM), and we investigate, through simulated and real datasets, the influence that normalizations for two-colour miRNA arrays have on the identification of differentially expressed genes. Results: We show that normalizations usually applied to large-scale arrays, in several cases, modify the actual structure of miRNA data, leading to large portions of false positives and false negatives. Nevertheless, loessM...
- Daniel Swan
Bioinformatics, Vol. 25, No. 20. (15 October 2009), pp. 2737-2738. Summary: Microorganisms are ubiquitous in nature and constitute intrinsic parts of almost every ecosystem. A culture-independent and powerful way to study microbial communities is metagenomics. In such studies, functional analysis is performed on fragmented genetic material from multiple species in the community. The recent advances in high-throughput sequencing have greatly increased the amount of data in metagenomic projects. At present, there is an urgent need for efficient statistical tools to analyse these data. We have created ShotgunFunctionalizeR, an R-package for functional comparison of metagenomes. The package contains tools for importing, annotating and visualizing metagenomic data produced by shotgun high-throughput sequencing. ShotgunFunctionalizeR contains several statistical procedures for assessing functional differences between samples, both for individual genes and for entire pathways. In addition to...
- Daniel Swan
Social tagging in the life sciences: characterizing a new metadata resource for bioinformatics - http://www.citeulike.org/user...
BMC Bioinformatics, Vol. 10, No. 1. (2009), 313. BACKGROUND:Academic social tagging systems, such as Connotea and CiteULike, provide researchers with a means to organize personal collections of online references with keywords (tags) and to share these collections with others. One of the side-effects of the operation of these systems is the generation of large, publicly accessible metadata repositories describing the resources in the collections. In light of the well-known expansion of information in the life sciences and the need for metadata to enhance its value, these repositories present a potentially valuable new resource. Here we characterize the current and prospective contents of two scientifically relevant metadata repositories created through social tagging. This investigation helps to establish how such socially constructed metadata might be used as it stands currently and to suggest ways that new social tagging systems might be designed that would yield better aggregate...
- Daniel Swan
Genomic analysis reveals a tight link between transcription factor dynamics and regulatory network architecture - http://www.citeulike.org/user...
Molecular Systems Biology, Vol. 5 (18 August 2009) Although several studies have provided important insights into the general principles of biological networks, the link between network organization and the genome-scale dynamics of the underlying entities (genes, mRNAs, and proteins) and its role in systems behavior remain unclear. Here we show that transcription factor (TF) dynamics and regulatory network organization are tightly linked. By classifying TFs in the yeast regulatory network into three hierarchical layers (top, core, and bottom) and integrating diverse genome-scale datasets, we find that the TFs have static and dynamic properties that are similar within a layer and different across layers. At the protein level, the top-layer TFs are relatively abundant, long-lived, and noisy compared with the core- and bottom-layer TFs. Although variability in expression of top-layer TFs might confer a selective advantage, as this permits at least some members in a clonal cell population...
- Daniel Swan