Web -> Relational database (data consistency) + S3 (large files)
- Dawei lin
EC2 node may fail, disk storage lost, No S3 distributed consistency guarantees
- Dawei lin
EBS for persistent node storage, relational db (AID), implement operation on S3 (allow to rerun the operation), any operation can be restartable.
- Dawei lin
Transient failures: faulure allocating EC2 node, long delay allocating EC2 node, S3 service outage. Solutions: rely on retry logic and have a good contact
- Dawei lin
On average 236 members married every year
- Dawei lin
Allocate cluster_. Verify cluster -> Push application to cluster -> Run a control script on the master -> kick off each job step on the master -> Create and detect a job completion -> shut the cluster down.
- Dawei lin
EMR replaces the above operation with one line of code
- Dawei lin
RAZORFISH example: 500B record, 100 clients, 100 markets, 25 data sources, 13TB data.
- Dawei lin
3.5 Billion Records, 71 M unique cookies a day. 100 Machine cluster created on deman. 5X cost reduced.
- Dawei lin
Use MapReduce: 1. upload data and application to S3, 2. Create a job flow on Amazon Elastic MapReuce, 3 Get the data
- Dawei lin
Wizard based monitoring and management:
- Dawei lin
Fail jobs because of bugs can be troubleshooted by new debug function
- Dawei lin
The function lists each step of the operation.
- Dawei lin
People want to run arbitrary scripts before job flow begins such as old version of R and a different package
- Dawei lin
Cascading (Java API) , Apache Hive (SQL like language, store schema in S3, JDBC/ODC support for SQl tools), Apache Pig (Script language), Hadoop Streaming (take in different programming languages)
- Dawei lin
improvements for small file processing, mutlipe input and output formats, improvements to S3N file system, support native compression libraries (LZO)
- Dawei lin
Now support version of Hive 0.5, Pig 0.6 and comuto to support Hadoop 0.18/Hive 0.3/Pig 0.4
- Dawei lin
Karmasphere Studio: IDE for development and debugging of EMR jobs, Amazon S3 file system browser. It has desktop interface.
- Dawei lin
Datameer: a spreadsheet application. It makes transparent use MapReduce to deal with spreadsheets.
- Dawei lin
System integrators can build from concepts to actual implements in weeks.
- Dawei lin
Metagenomics Shotgun: to sequence 300 samples to generate 10Gb per sample -> 3TB reads
- Dawei lin
to get rid of contamination by aligning to human and then align to bacteria reference strains, and align to known viruses, and align in protein space to nr
- Dawei lin
At the Genome Center, it is 3c/core/hr
- Dawei lin
Local Data Center and electricity does not cost individual researchers
- Dawei lin
AWS gives on-demand resources, scripting provides agility, Chef orchestrates and delivers capability
- Dawei lin
Cloud computing let IT professionals spend less time "keeping the lights running" and more time enabling/assiting with the actual goals of the organization.
- Dawei lin
IT people will become the system orchestrators and architects.
- Dawei lin
Money: Amazon, Microsoft and Google are at extreme scale operations
- Dawei lin
Need to analyze where the cost is for IaaS. How to leverage the Economics of Scale
- Dawei lin
Amylin Pharmaceuticals example is good one to understand how to find where the costs are and how to improve them.
- Dawei lin
each server watt loses ~0.5W to power distribution
- Dawei lin
Most centers are measured by availability not cost
- Dawei lin
Power 115KVolt - HV utility distribution - 0.3 loss -> 13.2kv -> Rotary or Battery (6%) -> 13.2v -> transformer (2% loss) -> 480v Total lost 11% not a big component
- Dawei lin