Sign in or Join FriendFeed
FriendFeed is the easiest way to share online. Learn more »
Dawei lin
Discussion Session
Portable Virtual Machine OVF http://www.vmware.com/applian... - Dawei lin
Amazon provides several granularity of firewall controls. - Dawei lin
Responsibility: Amazon: hardware, building, security, network. User: policy - Dawei lin
AWS hosts insurance operations which means it comply all the requirements - Dawei lin
AWS provides department level billing that can differentiate individual users. - Dawei lin
Dawei lin
Ntino Krampis, J. Craig Venter Institute: From AMIs Running Monolithic Software, to Real Distributed Bioinformatics Applications in the Cloud
JCVI has 400 employee - Dawei lin
most of computing clusters are using SGE, mainly do BLAST, genome assemblies and script piplines - Dawei lin
FOG: Flexible organizational Grid - Dawei lin
Bio-Linux: VM with 100+ tools - Dawei lin
VM available for download on SourceForge, runs on Eucalyptus - Dawei lin
calculation for one genome is 6 hrs and about $5 - Dawei lin
How to leverage the capability of cloud computing? - Dawei lin
use seeds as key of SimpleDB. Seeds usually are 11 bases, so 4^^11 seeds. but in reality, it only has 2000 major seeds. - Dawei lin
Use SimpleDB to solve parallel seeds look up using the unlimited scalability of SimpleDB - Dawei lin
A suggestion to add join to simpleDB. Answer: People are working on it. - Dawei lin
Dawei lin
Brad Chapman, Massachusetts General Hospital: Developing an Open Source Community for Cloud Bioinformatics
a blogger at bcbio.wordpress.com - Dawei lin
A new word Cloud Bioinformatics - Dawei lin
It is hard to go from 90% to 100% automation - Dawei lin
Recognizing contribution is a problem for Bioinformatician (in the middle of many authors) - Dawei lin
Open Source efforts: OpenBio, Biopython, but not very much reused - Dawei lin
Common theme: Aimed at developers. Biologists benefit indirectly - Dawei lin
Lowering activation energy: complex Gbrowse vs. easy Galaxy installation - Dawei lin
Cloud computing support Building block for scaling - Dawei lin
JCVI Cloud BioLinux - Dawei lin
Top level YAML configuration - Dawei lin
www.open-bio.org/wiki/Codefest_2010 - Dawei lin
Dawei lin
John Hogenesch, University of Pennsylvania Angel Pizarro, University of Pennsylvania: How we went from, ”’omics data, cool”, to ”’omics data, uh oh”
The coolest title so far. - Dawei lin
20 transcript factors regulate biological clock - Dawei lin
Bowtie to map easy reads and Blat to map splice reads - Dawei lin
Bowtie does not do well for increasing number of mismatches. - Dawei lin
With Illumina read lengths of 115bp, 40-50% of reads map - Dawei lin
Cost $25/lane, 25M reads, 115bp - Dawei lin
RUM pipline: Bowtie 40-60% -> BLAT 99% -> but has (multiple mapping ) Uniquely Mapped reads 85% (two ends map to same genomic region) - Dawei lin
RUM have 10% mapping than BWA - Dawei lin
6 hours for 25M, 115bp reads and costs $25 - Dawei lin
It founds Stable, spliced intronic sequences (SSIS) - Dawei lin
Simple job distribution systems: CloudCrowd, disco, Resque are good for simple jobs - Dawei lin
Hadoop and Cycle Computing are advanced options - Dawei lin
A Bioinformatician is both a researcher and a service person - Dawei lin
The pipeline is actually two short pieces of Ruby code. - Dawei lin
If that can be done, why needs to learn Pig, Hive - Dawei lin
Penn date center caught fire! - Dawei lin
Dawei lin
James Taylor, Emory: Dynamically Scalable Accessible Analysis for Next Generation Sequence Data Anton Nekrutenko, Penn State University
NGS add complexity at Data level - Dawei lin
Mapping Pipeline for Illumina, 454, and SOLiD - Dawei lin
Tools are the basic unit of analysis in Galaxy - Dawei lin
Bioinformatics tools are diverse in programming languages and implementations - Dawei lin
Most of them are command line tool - Dawei lin
Workflow can be constructed from scratch or extracted from existing analysis ones - Dawei lin
facilitate sharing and publishing - Dawei lin
smaller analyses: http://usegalaxy.org, larger analysis: http://getgalaxy.org - Dawei lin
NSF fund a pilot project on using Galaxy on a generic computing resource - Dawei lin
VM images for starting Galaxy on EC2 have been available for several months - Dawei lin
Persistence: metadata store and datasets live on persistent volumes - Dawei lin
Architecture: Galaxy master VM + Galaxy worker VMs - Dawei lin
Main challenges: need for all users to provide their credentials - Dawei lin
Mitochondrial heteroplasmy: 100s mito, each with 10 of copies of genome - Dawei lin
Use UCSC browse to visualize data - Dawei lin
Scaling up and down based on usage, models for scaling within tools - Dawei lin
It is hard to estimate the cost when true cost is hidden in an academic environment - Dawei lin
Dawei lin
Dione Bailey, Complete Genomics: Expanding Sequencing Service to the Cloud
CompeteGenomics is doing "Cloud Sequencing" - Dawei lin
Complete Human Genome Seqeuncing Service - Dawei lin
CGI uploaded data to S3 - Dawei lin
Entire data set is 500GB, - Dawei lin
File types: summary file, variant file, gene annotaiton, dbSNP annotation, gene summary, evidence files, coverage, and reference score, reads (sequence & quality score) and mapping - Dawei lin
Current delivery mechanism -> S3 -> burn a drive to customer - Dawei lin
have a way to bypass 5G limitation - Dawei lin
plan to add Download and Bucket transfer optoins to our service - Dawei lin
S3 to harddrive now 5 days - Dawei lin
Pharms are interested in using AWS for long term storages - Dawei lin
produce 500 genomes/months this year - Dawei lin
Next Month to launch CGA Tools as open source - Dawei lin
Compare SNP calls to CGI result, two CGI results, multiple genomes comparison - Dawei lin
Map2SAM and Evidence2SAM conversion. - Dawei lin
QC, filtering, visualization tool - Dawei lin
Dawei lin
Jon Sorenson, Pacific Biosciences: Cloud Computing Strategies for Next Generation of Sequence Analysis
Began by acknowledging PacBio is a part of effort creating data problems. - Dawei lin
1-3 bases incorporated per second, 80K ZMW monitored parallelly - Dawei lin
Dataflow: 30 min- 4TB, Movie2Trace -> Trace 100GB -> Trace2Pulse -> Pulse (heigh, width, interbase time) -> 30 Min -1GB - Dawei lin
Filter -> Mapping (De novo assembly, reference alignment) -> consensus (Simple, Bayesian, HMM) -> identify variants > finished human genome 150GB (for 30X human) - Dawei lin
because quick reaction time, it makes sense to do the analysis at real time too - Dawei lin
It has 96 well. Steps Design job, monitor jobs, view data. - Dawei lin
SMRT View. PacBio's genome browser - Dawei lin
customers: Genome Centers, Service labs, Genomics institutes, Core labs, individual PIs, Clinical lab - Dawei lin
It targets to have data analysis results back in 15 minutes - Dawei lin
Software in Cloud is more maintainable , budget-able - Dawei lin
Circular consensus is a way to make high quality sequences - Dawei lin
Strobe sequencing: u=620 sigma=40 - Dawei lin
Event-based information model - Dawei lin
visualization and standardization make complexity manageable - Dawei lin
10,000 genomes is ~2PB of data. - Dawei lin
Hadoop: wasn't good for generic customer depolyment, did not play well with existing scheduler, but these have all been changed. - Dawei lin
PacBio still use a lot of structure binary files. - Dawei lin
Recreational and commercial genomics. - Dawei lin
Cloud computing infrastructure, application stack, niche genomics Saas - Dawei lin
Dawei lin
Andreas Sundquist, DNANexus: A Web 2.0 Platform to Store, Visualize, and Analyze Your Sequence Data in the Cloud
Raising the level of abstraction, DNAnexus and technical changes. - Dawei lin
how much information in 200G HiSeq ? - Dawei lin
Useful information is about 10 - 20 Megabytes - Dawei lin
So the abstraction is needed - Dawei lin
Gap: Data storage, IT, Cluster, Bioinforamtics support - Dawei lin
Core facility and research labs are the target users - Dawei lin
interactive manipulation and visualization is important - Dawei lin
Free for 3 sample at DNANexus - Dawei lin
Web -> Relational database (data consistency) + S3 (large files) - Dawei lin
EC2 node may fail, disk storage lost, No S3 distributed consistency guarantees - Dawei lin
EBS for persistent node storage, relational db (AID), implement operation on S3 (allow to rerun the operation), any operation can be restartable. - Dawei lin
Transient failures: faulure allocating EC2 node, long delay allocating EC2 node, S3 service outage. Solutions: rely on retry logic and have a good contact - Dawei lin
Each S3 request takes 20-100ms - Dawei lin
Getting data into AWS. 5GB limitation is important for Bio applications - Dawei lin
Web browser limits (2GB) - Dawei lin
script to upload data directly to Amazon cloud - Dawei lin
Question: how can DNAnexus make its infrastructure standard? - Dawei lin
not yet has a plan yet, but love to take suggetions - Dawei lin
Dawei lin
Deepak Singh: Hadoop and Amazon Elastic MapReduce
Hadoop is good at text processing - Dawei lin
Align a set of reads, shuffle to group and sort - Dawei lin
CROSSBOW: Genome Biology, 10, R134 - Dawei lin
Myrna: RNA-seq application - Dawei lin
MapReduce thinks about parallel for you - Dawei lin
Align and aggregate is MapReduce's work - Dawei lin
The scripts need to do: paralle read, genome bin, sample, gene and statistics - Dawei lin
Bioinformatics does not use a lot of HIVE, Pig mostly Java right now - Dawei lin
Amazon is working NCBI to get data into cloud. WIll have nr soon and later this year to have Ensembl - Dawei lin
Dawei lin
Peter Sirota, AWS Sr. Manager of Software Development: Hadoop and Amazon Elastic MapReduce
A lot of people in the audience have already knew MapReduce - Dawei lin
MapReduce is restricted parallel programming model meant for large clusters - Dawei lin
It takes care of fault tolerant, load balancing - Dawei lin
Useful for data analysis, image/file processing, Bioinformatics, statistical modeling, Web indexing - Dawei lin
Large data: user data (behavior, images etc.), server/application data ( log files), monitoring data - Dawei lin
Useful public data sets (Genome, Economic census data .. ) - Dawei lin
Elastic MapReduce removed MUCK from large scale data processing - Dawei lin
mange compute clusters, tune Hadoop, mointoring job flows, Hadoop issues prevent smooth operation in the cloud. - Dawei lin
Eharmony is using Amazon Cloud - Dawei lin
On average 236 members married every year - Dawei lin
Allocate cluster_. Verify cluster -> Push application to cluster -> Run a control script on the master -> kick off each job step on the master -> Create and detect a job completion -> shut the cluster down. - Dawei lin
EMR replaces the above operation with one line of code - Dawei lin
RAZORFISH example: 500B record, 100 clients, 100 markets, 25 data sources, 13TB data. - Dawei lin
RAZORFISH is doing targeted Ad. - Dawei lin
It uploaded 100GB per day. - Dawei lin
3.5 Billion Records, 71 M unique cookies a day. 100 Machine cluster created on deman. 5X cost reduced. - Dawei lin
Use MapReduce: 1. upload data and application to S3, 2. Create a job flow on Amazon Elastic MapReuce, 3 Get the data - Dawei lin
Wizard based monitoring and management: - Dawei lin
Fail jobs because of bugs can be troubleshooted by new debug function - Dawei lin
The function lists each step of the operation. - Dawei lin
People want to run arbitrary scripts before job flow begins such as old version of R and a different package - Dawei lin
Cascading (Java API) , Apache Hive (SQL like language, store schema in S3, JDBC/ODC support for SQl tools), Apache Pig (Script language), Hadoop Streaming (take in different programming languages) - Dawei lin
Support Hadoop 0.20 + Custom patches - Dawei lin
improvements for small file processing, mutlipe input and output formats, improvements to S3N file system, support native compression libraries (LZO) - Dawei lin
Now support version of Hive 0.5, Pig 0.6 and comuto to support Hadoop 0.18/Hive 0.3/Pig 0.4 - Dawei lin
Karmasphere Studio: IDE for development and debugging of EMR jobs, Amazon S3 file system browser. It has desktop interface. - Dawei lin
Datameer: a spreadsheet application. It makes transparent use MapReduce to deal with spreadsheets. - Dawei lin
System integrators can build from concepts to actual implements in weeks. - Dawei lin
Dawei lin
Matt Tavis, AWS Solutions architect: Architecting for the AWS Cloud
Has a German Literature degree - Dawei lin
working with customer to discuss what works and work do not works - Dawei lin
Cloud Best practice while paper is online - Dawei lin
Scalability: a scalable architecture to take advantage of scalable infrastructure (Amazon) - Dawei lin
Design for Failure ( HW/SW) - Dawei lin
Everything fails, all the time, Werner Vogels, CTO Amazon.com - Dawei lin
Avoid single points of failture, assume everything fails, and design backwards. Application should run even HW fails - Dawei lin
2. Loose coupling sets you free - Dawei lin
decouple layers can make architecture more resilient - Dawei lin
It is a function of Amazon SNS. - Dawei lin
3. Implement Elasticity, a fundamental property of the cloud - Dawei lin
Configuration overhead is a significant for cloud computing because the time to startup a server is short - Dawei lin
Self configuration is important - Dawei lin
4. Build Security in every later - Dawei lin
each machine can be locked down in cloud - Dawei lin
With cloud, physical control reduced but not ownership - Dawei lin
Three tier approach: Web, App, DB security group - Dawei lin
5. Do not fear constraints - Dawei lin
More RAM, can it be split? - Dawei lin
That may drive to good architectural change - Dawei lin
Better IOPS on my database? multiple readlony/sharding/DB - Dawei lin
6. Think Parallel (serial and sequential is now history) - Dawei lin
Run parallel MapReduce jobs, use Elastic Load Balancing to distribute load across multiple servers. - Dawei lin
7. Leverage many storage options - Dawei lin
S3 large static objects, CloudFront: content distribution, SimpleDB: simple data indexing/querying, EC2 local disc drive: transient data - Dawei lin
EBS for off-instance persistent - Dawei lin
Dawei lin
David Dooling, Washington University – St. Louis: Architecture for the Next Challenges in Genomics
From batch to steady state from large data perspective - Dawei lin
Get a free song of the day from Amazon - Dawei lin
from free to buy $8 album -> $4.00 credit to buy video - Dawei lin
Interface make the spending money easily - Dawei lin
HMP (Human Metagenomics Project): nasal, oral, skin, gastro-intestinal, urogenital - Dawei lin
Metagenomics Shotgun: to sequence 300 samples to generate 10Gb per sample -> 3TB reads - Dawei lin
to get rid of contamination by aligning to human and then align to bacteria reference strains, and align to known viruses, and align in protein space to nr - Dawei lin
At the Genome Center, it is 3c/core/hr - Dawei lin
Local Data Center and electricity does not cost individual researchers - Dawei lin
The reagent cost is $300K. - Dawei lin
5000 cores and 2100 hours will cost > $3M in AWS - Dawei lin
19 hrs to calculate one lane of data (18M reads) - Dawei lin
Advocate again hybrid cloud solution is the way to go - Dawei lin
Public resources: Open Science Grid, TerGrid, GPCPU, RENC, Amazon - Dawei lin
Bioinformatics locks into batch operation. - Dawei lin
When scale increase (more servers and operations), it is similar to steady_state operation. - Dawei lin
People are working on architecture that can leverage different resources. - Dawei lin
How to submit data to AWS? Streaming is the answer. If the data is large, it can be divided into small chunks and then is transfered. - Dawei lin
Dawei lin
Chris Dagdigian, The Bioteam: Scriptable Infrastructures for Scientific Computing
A slide begins with apologies - Dawei lin
Google buys indel computer that can run >95 degree - Dawei lin
a lot of efficiently using power solutions are kept secretly for business reasons - Dawei lin
He speaks really fast - Dawei lin
Private clouds can be just a name change not a real change for some people and operations - Dawei lin
intrastructure like open source - Dawei lin
Cloud computing will rewrite many of our roles and job description. - Dawei lin
Cloud computing is very relevant to money - Dawei lin
5GB auto managed MySQL databae in Amazon cloud for 11c per hour - Dawei lin
Servers, storage, OS, Network, Provision, Management, monitoring, scaling, accounting are scriptable - Dawei lin
SSH through iPad - Dawei lin
It is possible to have 100% virtual and entirely controllable via scripts and APIs - Dawei lin
AWS enables us to orchestrate vast arrays of complex systems, pipelines, workflows and applications - Dawei lin
opsCode chef : http://www.opscode.com for System Orchestration - Dawei lin
Chef is a lib for configuration management - Dawei lin
Different reservation IDs and index IDs can be managed by Chef. - Dawei lin
A training class is provided. - Dawei lin
AWS gives on-demand resources, scripting provides agility, Chef orchestrates and delivers capability - Dawei lin
Cloud computing let IT professionals spend less time "keeping the lights running" and more time enabling/assiting with the actual goals of the organization. - Dawei lin
IT people will become the system orchestrators and architects. - Dawei lin
Money: Amazon, Microsoft and Google are at extreme scale operations - Dawei lin
Need to analyze where the cost is for IaaS. How to leverage the Economics of Scale - Dawei lin
Amylin Pharmaceuticals example is good one to understand how to find where the costs are and how to improve them. - Dawei lin
Dawei lin
James Hamilton, AWS Vice President & Distinguished Engineer
Has a blog - Dawei lin
Trace: where the money goes in big data center - Dawei lin
Power is the most important expense - Dawei lin
Economies of Scale: discover a few years ago. learn from high scale buyers and medium buyers - Dawei lin
Network: $13/Mb/s/Mth: Medium %95/Mb/s/mth - Dawei lin
Storage: $4.6?GB/year (2x in 2 Data centers: Medium: $26/B/year - Dawei lin
Admin: Large Service: Over 1000 servers/admin: enterprise: 140 servers/admin - Dawei lin
Medium size operations pay much more for the same thing - Dawei lin
Power Related costs [Will] Dominate: Estimate for 45,000 server center - Dawei lin
~88M for 8MW facility: Server opwer draw at 30% load, load 80%, Commerical Power: 7c/kWhr - Dawei lin
3 yr server, 4yr net gear, 10yr infrastructure amortization. - Dawei lin
Monthly cost is used to normalize different data - Dawei lin
34% costs functionally related to power, 8% Networking & 19% of total server cost - Dawei lin
Cost of high V transformer, power are going up, cost of computing are going down. - Dawei lin
Increase efficiency of server is very important since it consists 54% of the cost - Dawei lin
To shut services down does not reduce the cost very much 13% - Dawei lin
Should keep the services all up. - Dawei lin
Keep the workload flat is an advantage for cloud computing - Dawei lin
bring in diverse unrelated load from various geography is important - Dawei lin
People begin to think about workflow when it is moved somewhere - Dawei lin
Where the power go? Used servers use a lot of power - Dawei lin
A good data center with PUE ~1.5 - Dawei lin
each server watt loses ~0.5W to power distribution - Dawei lin
Most centers are measured by availability not cost - Dawei lin
Power 115KVolt - HV utility distribution - 0.3 loss -> 13.2kv -> Rotary or Battery (6%) -> 13.2v -> transformer (2% loss) -> 480v Total lost 11% not a big component - Dawei lin
480 -> 208v -> 12V (server) -> 1.x V - Dawei lin
12V to 1.V is not well known. have room to increase efficiency - Dawei lin
Power Suuply: offen <80% at typical load, On board step-down (VRM/VRD) - Dawei lin
The mechanic and cooling solution has been changed for 30 yrs - Dawei lin
Air flow is expensive: - Dawei lin
Cooling will end-up in water cooling. - Dawei lin
There are five conversations in cooling process. - Dawei lin
Air-side Economization -> just open the window - Dawei lin
bring the cold air into server. It is cheaper even a hot day (105F) - Dawei lin
Servers can run well at 90F. - Dawei lin
use cooling towers rather than A/C - Dawei lin
Deep automation only affordable when amortized over large user base - Dawei lin
If you have 1k or 10K, automation becomes necessity - Dawei lin
High Scale system make Software investment worthwhile - Dawei lin
Special skills with deep focus, such as power efficiency experts - Dawei lin
Average PUE over 2.0 - Dawei lin
Cloud computer allow servers to be placed in different datacenters. - Dawei lin
This allows low latency (close to customer) and redundancy - Dawei lin
Hardware Cost can be reduced by bulk purchase - Dawei lin
Utilization: 30% usage is good, 10% - 20% common - Dawei lin
Solutions: pool number of heterogeneous services - Dawei lin
Pay as you grow model - Dawei lin
Give Analysts access to analyze your data. - Dawei lin
AWS Pace of Innovation: release new product nearly every month - Dawei lin
Slides will be posted to http://mvdirona.com/jrh/work (this week) - Dawei lin
Hybrid which combine public and private cloud is going to be dominated model - Dawei lin
Private connectivity, close to customers are good things to try - Dawei lin
Renting resources does not mean you can use it efficiently. - Dawei lin
Monthly bill, resource usage reports will increase visibility and help people do efficiently - Dawei lin
Dawei lin
Welcome: Dr. Deepak Singh, AWS Sr. Business Developer
Most people think about AWS is EC2 (Amazon Elastic Compute) - Dawei lin
Storage - S3 AWS Import/Export - Dawei lin
Database RDS (MySQL): Main challenge is to manage them. SImpleDB (for scalability) - Dawei lin
On-Demand Workforce: Trained people behind - Dawei lin
SNS: real time response - Dawei lin
AWS principles: Reliable , scalable low latency, easy to use - Dawei lin
Monthly costs: 54% server, 8 network equipment, 21 power distribution and cooling, 13% power, 5% other - Dawei lin
Other ways to read this feed:Feed readerFacebook