Michael Kuhn
Does anyone know how to take a GFF3 file (gene prediction) and a nucleotide fasta and get the translated coding sequences (protein fasta)?
I ran GlimmerHMM on the nucleotide fasta file, but now I'm stuck with GFF3, but want to have the proteins. It looks like this should be possible with BioPerl, but I get stuck. - Michael Kuhn
have you tried using the Ensembl API? http://www.ensembl.org/info... - Manuel
I've looked at the Ensembl API, but since my genome is not in Ensembl, I couldn't figure out how to apply it to my problem - Michael Kuhn
Michael, I could put together a script with Biopython. Could you post a sample of the GFF to Pastebin (http://pastebin.com/) to give a sense of how the output looks? - Brad Chapman
Brad, that would be really nice. Here's the GFF3: http://pastebin.com/HFQAEqiC and here the corresponding contig: http://pastebin.com/tcyaZdZp - Michael Kuhn
Thanks Michael: here's a script that does this: http://github.com/chapman.... You need Biopython (http://biopython.org/) installed, and the output is: http://gist.github.com/raw.... The first prediction is strange -- just a stop codon -- and the last one appeared truncated. Does the documentation call that GlimmerHMM file GFF3? Unfortunately, it looks like an invented format. - Brad Chapman
Do you mind if we use a bit of the Glimmer output and Fasta for an example? This would be a good cookbook entry: http://biopython.org/wiki... - Brad Chapman
gff3 to genbank (no clue how), then coderet from EMBOSS? - Darek Kedra
Definitely can do this in BioPerl. If none of the above options work for you let me know and I can walk you through it. - Morgan Langille
Brad, I pasted the wrong file. Sorry! This was indeed the custom Glimmer format, here's the correct GFF version: http://pastebin.com/fzeWHA6d - Michael Kuhn
Yes, you can use this in a cookbook example. This is from Trichinella Spiralis: http://genomeold.wustl.edu/genome... - one of those genome projects that apparently got stalled at the assembly stage (contigs from 2006) - Michael Kuhn
Michael, here is the GFF3 version of that script: http://github.com/chapman.... I kept it structured identically to the custom output one, which highlights how nice it is to deal with standard formats using standard libraries. The code is much more general now as well, and could handle predictions for multiple contigs. In addition to Biopython, you also need the in-progress GFF parsing library: http://github.com/chapman.... Output is here: http://gist.github.com/321721 - Brad Chapman
Morgan, it would be cool to see a BioPerl version to compare and contrast. Here is a cookbook entry: http://biopython.org/wiki... - Brad Chapman
well, I'm totally fine with the Biopython version, so I won't solicit another version :) - Michael Kuhn
The simplest way do this would be to load the data in artemis, Select->All CDS Features, and then File->Write->Amino Acids of Selected Features . - Morgan Langille