Bioinformatics (Oxford, England) (2 February 2012) SUMMARY: pymzML is an extension to Python that offers a) an easy access to mass spectrometry (MS) data which allows the rapid development of tools, b) a very fast parser for mzML data, the standard data format in mass spectrometry and c) a set of functions to compare or handle spectra.Availability and Implementation: pymzML requires Python2.6.5+ and is fully compatible with Python3. The module is freely available on http://pymzml.github.com or pypi, is published under LGPL license and requires no additional modules to be installed. CONTACT: christian@fufezan.net. Till Bald, Johannes Barth, Anna Niehues, Michael Specht, Michael Hippler, Christian Fufezan
- Simon Cockell
Nucleic acids research (28 January 2012) A flexible statistical framework is developed for the analysis of read counts from RNA-Seq gene expression studies. It provides the ability to analyse complex experiments involving multiple treatment conditions and blocking variables while still taking full account of biological variation. Biological variation between RNA samples is estimated separately from the technical variation associated with sequencing technologies. Novel empirical Bayes methods allow each gene to have its own specific variability, even when there are relatively few biological replicates from which to estimate such variability. The pipeline is implemented in the edgeR package of the Bioconductor project. A case study analysis of carcinoma data demonstrates the ability of generalized linear model methods (GLMs) to detect differential expression in a paired design, and even to detect tumour-specific expression changes. The case study demonstrates the need to allow for...
- Simon Cockell
Nucleic Acids Research, Vol. 38, No. 3. (01 January 2010), pp. e17-e17. Illumina BeadArrays are among the most popular and reliable platforms for gene expression profiling. However, little external scrutiny has been given to the design, selection and annotation of BeadArray probes, which is a fundamental issue in data quality and interpretation. Here we present a pipeline for the complete genomic and transcriptomic re-annotation of Illumina probe sequences, also applicable to other platforms, with its output available through a Web interface and incorporated into Bioconductor packages. We have identified several problems with the design of individual probes and we show the benefits of probe re-annotation on the analysis of BeadArray gene expression data sets. We discuss the importance of aspects such as probe coverage of individual transcripts, alternative messenger RNA splicing, single-nucleotide polymorphisms, repeat sequences, RNA degradation biases and probes targeting genomic...
- Simon Cockell
Molecular ecology, Vol. 19 Suppl 1, No. s1. (March 2010), pp. 212-227. Differences in gene expression are thought to be an important source of phenotypic diversity, so dissecting the genetic components of natural variation in gene expression is important for understanding the evolutionary mechanisms that lead to adaptation. Gene expression is a complex trait that, in diploid organisms, results from transcription of both maternal and paternal alleles. Directly measuring allelic expression rather than total gene expression offers greater insight into regulatory variation. The recent emergence of high-throughput sequencing offers an unprecedented opportunity to study allelic transcription at a genomic scale for virtually any species. By sequencing transcript pools derived from heterozygous individuals, estimates of allelic expression can be directly obtained. The statistical power of this approach is influenced by the number of transcripts sequenced and the ability to unambiguously...
- Simon Cockell
Science, Vol. 318, No. 5853. (16 November 2007), pp. 1136-1140. Monoallelic expression with random choice between the maternal and paternal alleles defines an unusual class of genes comprising X-inactivated genes and a few autosomal gene families. Using a genome-wide approach, we assessed allele-specific transcription of about 4000 human genes in clonal cell lines and found that more than 300 were subject to random monoallelic expression. For a majority of monoallelic genes, we also observed some clonal lines displaying biallelic expression. Clonal cell lines reflect an independent choice to express the maternal, the paternal, or both alleles for each of these genes. This can lead to differences in expressed protein sequence and to differences in levels of gene expression. Unexpectedly widespread monoallelic expression suggests a mechanism that generates diversity in individual cells and their clonal descendants. Alexander Gimelbrant, John Hutchinson, Benjamin Thompson, Andrew Chess
- Simon Cockell
SeqGene: a comprehensive software solution for mining exome- and transcriptome- sequencing data - http://www.citeulike.org/user...
BMC Bioinformatics, Vol. 12, No. 1. (29 June 2011), 267. BACKGROUND:The popularity of massively parallel exome and transcriptome sequencing projects demands new data mining tools with a comprehensive set of features to support a wide range of analysis tasks.RESULTS:SeqGene, a new data mining tool, supports mutation detection and annotation, dbSNP and 1000 Genome data integration, RNA-Seq expression quantification, mutation and coverage visualization, allele specific expression (ASE), differentially expressed genes (DEGs) identification, copy number variation (CNV) analysis, and gene expression quantitative trait loci (eQTLs) detection. We also developed novel methods for testing the association between SNP and expression and identifying genotype-controlled DEGs. We showed that the results generated from SeqGene compares favourably to other existing methods in our case studies.CONCLUSION:SeqGene is designed as a general-purpose software package. It supports both paired-end reads and...
- Simon Cockell
Mantel statistics to correlate gene expression levels from microarrays with clinical covariates - http://www.citeulike.org/user...
Genet. Epidemiol., Vol. 23, No. 1. (June 2002), pp. 87-96. Mantel statistics provide an additional step to standard approaches in the analysis of gene expression and covariate data, allow the calculation of standard statistics such as correlation, partial correlation, and regression coefficients, and, with permutation tests, provide P values for these statistics to relate the sample covariates to the expression levels. In this article we describe the Mantel statistics and illustrate their use and interpretation with data from a study of seven human oligodendrogliomas (brain tumors) where expression levels of 1,013 genes and five covariates were previously analyzed using standard approaches. In the previous analysis of these data, qualitative relationships were found between gene expressions and two of the clinical covariates. We show in this article how the Mantel statistics are able to formally quantify and provide P values to determine statistical significance of these...
- Simon Cockell
Nat Biotech, Vol. 30, No. 1. (18 January 2012), pp. 78-82. Whole-genome sequencing is becoming commonplace, but the accuracy and completeness of variant calling by the most widely used platforms from Illumina and Complete Genomics have not been reported. Here we sequenced the genome of an individual with both technologies to a high average coverage of ∼76×, and compared their performance with respect to sequence coverage and calling of single-nucleotide variants (SNVs), insertions and deletions (indels). Although 88.1% of the ∼3.7 million unique SNVs were concordant between platforms, there were tens of thousands of platform-specific calls located in genes and other genomic regions. In contrast, 26.5% of indels were concordant between platforms. Target enrichment validated 92.7% of the concordant SNVs, whereas validation by genotyping array revealed a sensitivity of 99.3%. The validation experiments also suggested that >60% of the platform-specific variants were indeed present in the...
- Simon Cockell
Nature, Vol. 478, No. 7370. (27 October 2011), pp. 483-489. Brain development and function depend on the precise regulation of gene expression. However, our understanding of the complexity and dynamics of the transcriptome of the human brain is incomplete. Here we report the generation and analysis of exon-level transcriptome and associated genotyping data, representing males and females of different ethnicities, from multiple brain regions and neocortical areas of developing and adult post-mortem human brains. We found that 86 per cent of the genes analysed were expressed, and that 90 per cent of these were differentially regulated at the whole-transcript or exon level across brain regions and/or time. The majority of these spatio-temporal differences were detected before birth, with subsequent increases in the similarity among regional transcriptomes. The transcriptome is organized into distinct co-expression networks, and shows sex-biased gene expression and exon usage. We also...
- Simon Cockell
Genome Research, Vol. 21, No. 9. (01 September 2011), pp. 1506-1511. An international, peer-reviewed genome sciences journal featuring outstanding original research that offers novel insights into the biology of all organisms Shahar Alon, Francois Vigneault, Seda Eminaga, Danos Christodoulou, Jonathan Seidman, George Church, Eli Eisenberg
- Simon Cockell
Artemis: An integrated platform for visualisation and analysis of high-throughput sequence-based experimental data. - http://www.citeulike.org/user...
Bioinformatics (Oxford, England) (22 December 2011) MOTIVATION: High-Throughput Sequencing (HTS) technologies have made low-cost sequencing of large numbers of samples commonplace. An explosion in the type, not just number, of sequencing experiments has also taken place including genome re-sequencing, population-scale variation detection, whole transcriptome sequencing and genome-wide analysis of protein-bound nucleic acids. RESULTS: We present Artemis as a tool for integrated visualisation and computational analysis of different types of HTS datasets in the context of a reference genome and its corresponding annotation. AVAILABILITY: Artemis is freely available (under a GPL licence) for download (for MacOSX, UNIX and Windows) at the Wellcome Trust Sanger Institute web sites: http://www.sanger.ac.uk/resourc.... CONTACT: Further information is available by joining the Artemis mailing list or by emailing artemis@sanger.ac.uk. We welcome any comments, suggestions and...
- Simon Cockell
Briefings in Functional Genomics (30 December 2011) RNA-sequencing (RNA-seq) technologies have not only pushed the boundaries of science, but also pushed the computational and analytic capacities of many laboratories. With respect to mapping and quantifying transcriptomes, RNA-seq has certainly established itself as the approach of choice. However, as the complexities of experiments continue to grow, there is still no standard practice that allows for design, processing, normalization, efficient dimension reduction and/or statistical analysis. With this in mind, we provide a brief review of some of the key challenges that are general to all RNA-seq experiments, namely experimental design, statistical analysis and dimensionality reduction. Paul Auer, Sanvesh Srivastava, RW Doerge
- Simon Cockell
Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph. - http://www.citeulike.org/user...
Briefings in functional genomics (19 December 2011) Since the completion of the cucumber and panda genome projects using Illumina sequencing in 2009, the global scientific community has had to pay much more attention to this new cost-effective approach to generate the draft sequence of large genomes. To allow new users to more easily understand the assembly algorithms and the optimum software packages for their projects, we make a detailed comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph, from how they match the Lander-Waterman model, to the required sequencing depth and reads length. We also discuss the computational efficiency of each class of algorithm, the influence of repeats and heterozygosity and points of note in the subsequent scaffold linkage and gap closure steps. We hope this review can help further promote the application of second-generation de novo sequencing, as well as aid the future development of assembly...
- Simon Cockell
Nucleic acids research (22 December 2011) Next-generation sequencing (NGS) technologies-based transcriptomic profiling method often called RNA-seq has been widely used to study global gene expression, alternative exon usage, new exon discovery, novel transcriptional isoforms and genomic sequence variations. However, this technique also poses many biological and informatics challenges to extracting meaningful biological information. The RNA-seq data analysis is built on the foundation of high quality initial genome localization and alignment information for RNA-seq sequences. Toward this goal, we have developed RNASEQR to accurately and effectively map millions of RNA-seq sequences. We have systematically compared RNASEQR with four of the most widely used tools using a simulated data set created from the Consensus CDS project and two experimental RNA-seq data sets generated from a human glioblastoma patient. Our results showed that RNASEQR yields more accurate estimates for gene...
- Simon Cockell
Briefings in functional genomics (19 December 2011) Genome sequencing has been revolutionized by next-generation technologies, which can rapidly produce vast quantities of data at relatively low cost. With data production now no longer being limited, there is a huge challenge to analyse the data flood and interpret biological meaning. Bioinformatics scientists have risen to the challenge and a large number of software tools and databases have been produced and these continue to evolve with this rapidly advancing field. Here, we outline some of the tools and databases commonly used for the analysis of next-generation sequence data with comment on their utility. Hong Lee, Kaitao Lai, Michal Tadeusz Lorenc, Michael Imelfort, Chris Duran, David Edwards
- Simon Cockell
Bioinformatics (Oxford, England) (22 December 2011) SUMMARY: We provide a Bioconductor package with quality assessment, processing and visualization tools for high-throughput sequencing data, with emphasis in ChIP-seq and RNA-seq studies. It includes detection of outliers and biases, inefficient immuno-precipitation and overamplification artifacts, de novo identification of read-rich genomic regions and visualization of the location and coverage of genomic region lists. AVAILABILITY: www.bioconductor.org CONTACT: david.rossell@irbbarcelona.org SUPPLEMENTARY INFORMATION: Supplementary data available at Bioinformatics online. Evarist Planet, Camille Stephan-Otto Attolini, Oscar Reina, Oscar Flores, David Rossell
- Simon Cockell
Bioinformatics (Oxford, England) (23 December 2011) SUMMARY: ART is a set of simulation tools that generate synthetic next-generation sequencing reads. This functionality is essential for testing and benchmarking tools for next-generation sequencing data analysis including read alignment, de novo assembly, and genetic variation discovery. ART generates simulated sequencing reads by emulating the sequencing process with built-in, technology-specific read error models and base quality value profiles parameterized empirically in large sequencing datasets. We currently support all three major commercial next-generation sequencing platforms: Roche's 454, Illumina's Solexa, and Applied Biosystems' SOLiD.ART also allows the flexibility to use customized read error model parameters and quality profiles. AVAILABILITY: Both source and binary software packages are available at http://www.niehs.nih.gov/researc.... CONTACT: WH at weichun.huang@nih.gov or GTM at...
- Simon Cockell
Bioinformatics (Oxford, England) (16 December 2011) miRNAs are small regulatory molecules that act by mRNA degradation or via translational repression. Although many miRNAs are ubiquitously expressed, a small subset have differential expression patterns that may give rise to tissue-specific complexes. We find that, when a pair of miRNAs are not expressed in the same tissues, there is a higher tendency for them to target the direct partners of the same hub proteins. At the same time, they also avoid targeting the same set of hub spokes. Moreover, the complexes corresponding to these hub spokes tend to be specific and non-overlapping. This suggests that the effect of miRNAs on the formation of complexes is specific. CONTACT: wongls@comp.nus.edu.sg. Wilson Wen Bin Goh, Hirotaka Oikawa, Judy Chia Ghee Sng, Marek Sergot, Limsoon Wong
- Simon Cockell
Nucleic acids research (15 November 2011) To support transcriptional regulation studies, we have constructed DBTSS (DataBase of Transcriptional Start Sites), which contains exact positions of transcriptional start sites (TSSs), determined with our own technique named TSS-seq, in the genomes of various species. In its latest version, DBTSS covers the data of the majority of human adult and embryonic tissues: it now contains 418 million TSS tag sequences from 28 tissues/cell cultures. Moreover, we integrated a series of our own transcriptomic data, such as the RNA-seq data of subcellular-fractionated RNAs as well as the ChIP-seq data of histone modifications and the binding of RNA polymerase II/several transcription factors in cultured cell lines into our original TSS information. We also included several external epigenomic data, such as the chromatin map of the ENCODE project. We further associated our TSS information with public or original single-nucleotide variation (SNV) data, in...
- Simon Cockell
Human genetics, Vol. 130, No. 4. (23 October 2011), pp. 505-516. Next-generation sequencing (NGS) will likely facilitate a better understanding of the causes and consequences of human genetic variability. In this context, the validity of NGS-inferred single-nucleotide variants (SNVs) is of paramount importance. We therefore developed a statistical framework to assess the fidelity of three common NGS platforms. Using aligned DNA sequence data from two completely sequenced HapMap samples as included in the 1000 Genomes Project, we unraveled remarkably different error profiles for the three platforms. Compared to confirmed HapMap variants, newly identified SNVs included a substantial proportion of false positives (3-17%). Consensus calling by more than one platform yielded significantly lower error rates (1-4%). This implies that the use of multiple NGS platforms may be more cost-efficient than relying upon a single technology alone, particularly in physically localized sequencing...
- Simon Cockell
Analytical chemistry, Vol. 83, No. 12. (15 June 2011), pp. 4327-4341. Thomas Niedringhaus, Denitsa Milanova, Matthew Kerby, Michael Snyder, Annelise Barron
- Simon Cockell