HL14: David Pollock - Visualizing Genomic Dark Matter: Repeat Probability Clouds in the Human Genome
Distribution of sequence types in the human genome: 1.5% exons; 44% repetitive sequences; 54.5% unknown (dark matter). - Gabriele Sales
RepeatMasker is not performing too well at identifying repeats; it misses ~50% of MIR fragments in the human genome, for example. - Gabriele Sales
The dominant approach to repeat identification is based on the search of consensus sequences for repeat families (e.g. RepBase). Drawbacks: slow; not very sensitive to fragmentary matches; unable to identify novel repeats. - Gabriele Sales
De novo methods: RECON, PILER, RepeatFinder, RepeatScout. - Gabriele Sales
Evolutionary de novo repeat analysis: better detection of imperfect repeats and fragment left over from extensive duplication and divergence. - Gabriele Sales
Consensus based methods reduce multiple sequences to a single one; P-clouds group similar, high-copy L-mers together (16-mers for mammals, E<1 per genome). - Gabriele Sales
Next step: mapping P-clouds onto the genome using a sliding window. - Gabriele Sales
Example: human chromosome X. Time required on a single workstation: 45 min (vs. >10 days for RECON and 8hrs for RepeatScout). - Gabriele Sales
Application to the whole human genome. 38% repetitive (P-clouds and RepeatMasker), 7% (only RM), 27% (PC only), 28% non-repetitive. - Gabriele Sales