Sign in or Join FriendFeed
FriendFeed is the easiest way to share online. Learn more »
Chris Miller

Chris Miller

Bioinformatics Grad student at Baylor College of Medicine. My online home is at
BlogTwitterBlogCustom RSS/AtomCustom RSS/Atom
C: How to randomly sample a subset of lines from a bed file -
You're not wrong, but if you really want to maintain the proportion with high fidelity, then you have to randomly sample an appropriate number from each chromosome. Whether random sampling from a shuffled file is 'good enough' will depend on the specific application. - Chris Miller
A: Sequencing your own genome! How can it be done? Has anyone here done it? -
You can get quotes for custome sequencing through the Illumina sequencing network: I know Illumina has done fee-for-service type sequencing before. You might also look into one of the small sequencing companies that are popping up all over the place these days. - Chris Miller
C: Zero-read depth in samtools mpileup causing problems in VarScan2 with Somatic module -
Dan Koboldt, the author of Varscan, is away at a conference right now, but I passed this along to him - he may be able to offer more insight or a bug fix when he returns. - Chris Miller
C: Visualization tools for NGS analysis results, suitable for biologists -
The visualizations that will be most useful depend to a huge extent upon the design of the study. There are many, many things you might want to explore in this data, so you're going to have to narrow it down to get reasonable recommendations. - Chris Miller
C: genome error rate calulation for mutation -
This sounds like a homework question. If it is, then pasting your assignment here is not the way to get help. If it is not homework, please explain why you're trying to figure this out and we may be able to provide assistance. - Chris Miller
C: Can anyone suggest me a script based pipeline for exome sequencing with paired end reads generated by Illumina for tumor samples. -
This would probably be better off posted as a new question. It's likely that only the two or three people involved in this thread will notice that you've posted it here. That said, This previous question may help: What is the default quality encoding expected by BWA? - Chris Miller
A: Can anyone suggest me a script based pipeline for exome sequencing with paired end reads generated by Illumina for tumor samples. -
What you're asking here is probably beyond the scope of a Q/A site. To properly review all of these steps and provide feedback and suggestions would take hours. If you really need that level of support, then you're going to want to pay someone a consulting fee to help you get your pipeline set up. If you have specific questions about individual steps or commands, then Biostar can be a great resource, and please do feel free to ask questions. I'd encourage you to look through old posts first, as many of these topics have been addressed individually in the past. - Chris Miller
A: BreakDancer + SquareDancer -
The squaredancer code is buried in one of the other repos. Here's a direct link to the perl script: - Chris Miller
A: Asking the developers before asking Biostars -
We actually discussed this with Istvan about a year ago, and have been pointing users of our software here for support. We then have a tool that monitors the RSS feed for specific keywords or tags and then notifies the appropriate people to come answer it. (rss2jira) We do try to be clear that bug reports don't belong here and patches and such should go on github. This has had a number of positive effects. The answers to commonly asked questions about our tools are now publicly accessible, indexed, and on a site where lots of people can find them (as opposed to buried away in some obscure mailing list archives). It also drives traffic to Biostar and makes lots of users aware of the community. Finally, several people who have started out asking questions about Breakdancer or Somatic Sniper here have gone on to become regular contributors to the site. - Chris Miller
A: to open file from dbGAP -
The first result for a search on "ncbi_enc" is this page: It says: "The data files distributed through the dbGaP are all encrypted by NCBI’s special encryption algorithm. These files have a file suffix “.ncbi_enc”, indicating that they are NCBI encrypted files." That page also contains a link to the archive and encryption utilities. - Chris Miller
C: Epigenomics Contest: How Many Flaws Can You Find in this Paper? -
You're posting this article hoping to pick some sort of fight, as you've done previously (Why Does Biostar Cover Questions on Epigenetics, but not Intelligent Design?). That's not cool. There is probably a good post to be made that outlines specific criticisms of the paper and starts a healthy discussion about the merits of the science. This, unfortunately, is not that post. - Chris Miller
C: My NCBI Curriculum Vitae Web Application: SciENcv -
Aha - I missed the biosketch export at the bottom. That seems fairly useful. - Chris Miller
It's just a convention. In most applications normalizing for higher coverage that correlates to GC regions is essentially the same as normalizing for lower coverage in AT regions. The point is that base composition correlates with some parameter that's being adjusted. - Chris Miller
C: Epigenomics Contest: How Many Flaws Can You Find in this Paper? -
This kind of axe-grinding post doesn't have any place here. If you'd like to offer critiques of this science and start a discussion, that's fine, but just posting a link and calling it "junk" isn't really constructive. - Chris Miller
A: My NCBI Curriculum Vitae Web Application: SciENcv -
Anybody tried it yet? Is this going to be worth filling out, or would we be better off just pointing people to a Google Scholar page? - Chris Miller
A: What does the term low pass mean? -
That doesn't really make sense. "Low-pass" generally refers to a genome that's sequenced to a depth under 10x. With this data, you can call germline SNPs, find structural variants, etc. It's not particularly useful for cancer sequencing though, as somatic variants are difficult to discern and forget about finding subclonal variants. - Chris Miller
A: liftOver bam file -
This is a bad idea. Since the genome assembly that the reads were mapped to are different, you really need to realign your data. There will undoubtedly be many places where reads map to different places than where liftover would place them, due to the differences between the assemblies. Convert the bam back to a fastq with picard, then redo the mapping with the aligner of your choice. - Chris Miller
C: What is a typical salary range for bioinformatician in Singapore -
Removed the second link - funny, but probably inappropriate :) - Chris Miller
A: Retreiving data from TCGA database -
There is no webpage that currently displays the exact information you're looking for. That said, it would be pretty straightforward to do it like this: 1) for each cancer type, download all of the MAF files that describe the somatic mutations in that cancer 2) combine them into one big list and pull out per-gene counts. Ideally, you'd do this with some scripting, but you could even use a spreadsheet program (but be careful! Alternately, find someone with basic scripting skills and get them to help you. Many bioinformaticians would be happy to help you out for money, authorship, or booze (but not necessarily in that order!) Edit: You may also find information worth exploring on Synapse, which the TCGA Pan-cancer project is using to track files:!Synap... - Chris Miller
C: Mapping of Bisulfite Sequencing reads -
Do you have any sense of how bismark, bison, and BSmap stack up against each other, in terms of accuracy/runtime/general performance? - Chris Miller
A: Mapping of Bisulfite Sequencing reads -
BSmap is another bisulfite aligner, as is Methylcode - Chris Miller
A: Somatic Mutations in microRNA genes -
You should probably use a somatic mutation caller, rather than calling tumor and normal separately, then trying to filter out germline events. Varscan, Somatic Sniper, GATK, Strelka, and many other tools will do this. - Chris Miller
C: Question: Question about in Varscan 2 -
Do a search on Biostar for "mergeSegment" I'm almost positive that this has been asked and answered before. - Chris Miller
A: Staff Scientist (Cancer Genomics) - The Genome Institute, Washington University, STL -
Just chiming in to say that WashU is a great place to work. We've got some fantastically interesting projects going on that will help to shape the future of genomic medicine. - Chris Miller
C: Intersect gene annotation with specific position or genomic interval -
I'm not immediately familiar with that format, but it probably contains lines labelled "intron", "cds_exon", "rna", etc. Grep out the lines you want, do your intersection, then collapse by gene name if necessary - Chris Miller
A: Intersect gene annotation with specific position or genomic interval -
Instead of intersecting with some monolithic gene track containing everything, you're going to want to intersect with a track containing only exons. Since I don't know exactly what your data looks like, I can't tell you exactly how to accomplish this. If you don't have coordinates for specific exons in your data, you can download such tracks from UCSC genome browser or Ensembl easily enough. - Chris Miller
C: Lolliplot link on the TVAP website -
Glad to hear that you got it working, Christian. If you have a stand-alone version that you'd like to contribute back to the community, we'd be happy to put it up on our site. (with proper credit given to you, naturally). One of our motivations for open-sourcing all of our code is to enable people to use our tools - even the ones we haven't had time to package up neatly yet! Even rough code might give someone else a cleaner place to start. - Chris Miller
C: Where do I start to make career in bioinformatics? -
Yes. You should not have to pay for a PhD in the hard sciences. In fact, you will get a modest stipend to cover your living expenses and such. One more suggestions - your english seems passable in writing, but brushing up on speaking that language certainly won't hurt your chances if granted an interview - Chris Miller
C: CNVNator deletion calls all based on mapping quality zero reads? -
I'm afraid I don't quite understand what you're asking. Can you try rephrasing a little and giving a more thorough example? Defining the columns in the output you pasted would help as well. - Chris Miller
C: Retreiving data from TCGA database -
I think you misread my comment. Several years ago, I was not in a TCGA group and still got access and published using TCGA data. No one is hoarding data - it is freely accessible via the TCGA data portal and CGHub. (Here, for example, are all somatic mutations found in the AML cohort: When such data contains information that is potentially identifying (like raw sequence reads and germline variant calls), the NIH requires that you fill out a short form so that they can verify you're using the data for research. This is not a difficult hurdle. - Chris Miller
Other ways to read this feed:Feed readerFacebook