Hadoop World NYC is Oct 2: http://www.cloudera.com/hadoop-.... Learn how enterprises like JP Morgan, Visa, eBay, IBM, Booz Allen, and more are using Hadoop.
if you get tired of your other stuff, feel free to come express your opinions about scalable data storage and processing over at Cloudera some time!
- jeff hammerbacher
Yes, yes. I have been thinking about a social experiment related to this, to answer the question: which is more efficient, closed lab research or an open collaboration? We should engineer a race.
- Matthew Todd
Wasn't the genome project kinda like this - open academic labs working on sequencing vs. Venter's private efforts?
- Mr. Gunn
Not sure. The open project was still specifically funded, no? Was it open to all-comers? Wasn't there a technological restriction to entry, in that you had to have specific, new, equipment? A competition in an area that's more mainstream would be better.
- Matthew Todd
I think about this stuff all the time; in fact, my last presentation (at WWW conference) references research from sociology of science. I am fascinated by some of the open science movements, but I think significant reform in the criteria used for professional advancement in academia will be required before we see a complete shift to openness.
- jeff hammerbacher
Right, but no need to plea for change if we can demonstrate unequivocally that open research is faster/more efficient, and leads to what would normally be termed high impact publications.
- Matthew Todd
I agree with Mat - no need to change the criteria - Open Science is an efficient way to find new collaborators and does not prevent traditional peer reviewed publication - you just have to pick publishers more carefully
- Jean-Claude Bradley
deepak, was thinking it would be cool to have emr integration from eclipse. drop me a line if you guys would be interested in hacking it up!
- jeff hammerbacher
Jeff .. agreed. Let me ping the team on this
- Deepak Singh
Greenplum claims very fast load speeds, and Fox still throws away most of its MySpace data | DBMS2 -- DataBase Management System Services - http://www.dbms2.com/2009...
Seriously. When Thrift was going into Apache, we asked to use git, but they were having none of it. Glad to see the tide is turning...
- jeff hammerbacher
Deepak, I thought you were a bioinformatics guy? Are you secretly getting into data warehousing? It's not as interesting as it looks, I promise.
- jeff hammerbacher
anyone have a petabyte-scale solution at their organization and care to comment on what vendor/technology they use?
- Andrew Su
I guess the solution is in distributed databases... However for handling a large amount of data I've tried to use BerkeleyDB. It works fine but it tends to be slower for inserting data. Note: At #biohackathon2009, Jessica Severin introduced her technology named EdgeExpressDB (about to be published)
- Pierre Lindenbaum
distributed database or filesystem, something like hadoop or tahoe should work, right? 6pb is "only" about $600,000 in drives at today's prices
- Mike Chelen
i've seen several multi-petabyte data processing solutions built on hadoop: yahoo, facebook, quantcast are the public ones.
- jeff hammerbacher
600,000 for the drives, what about the controllers and all the other infrastructure you might need? We were quoting 1 Pb for our storage and it was close to a 1,000,000.
- Paulo Nuin
yes, surrounding infrastructure is also needed, they have 2 more years for prices to come down though right? facebook also uses a p2p distributed filesystem they developed called cassandra http://code.google.com/p...
- Mike Chelen
@ Egon bittorrent distribution would be a great thing for them to consider, as it has advantages over ftp including bandwidth savings and integrated error checking
- Mike Chelen
An interesting feature of Disco is that it does not handle the mapping of a set of files to a set of input splits (i.e. unit of work for a task). Also, in Disco, input splits are specified using a URI. So to make it run over HDFS, you'd have to grab the output of a getSplits() call and turn those block locations into URIs. If someone hacks this up, please post it!
- jeff hammerbacher
Jeff, that's a really interesting point, which reminded me of another implementation I've come across in the past: http://code.google.com/p... I'd have to think it through, but the HTTP model certainly has a lot of appeal. For one, you could imagine taking advantage of all the load balancing software / hardware to distribute the work (as well as failover).
- Ilya Grigorik
an often undersold component of the google technology stack is their sophisticated library of compression algorithms. here's one they presented at vldb in 2007. - http://www.vldb.org/conf...
interesting. in the database community, management of uncertain data (mud) has become quite popular, e.g. http://mud.cs.utwente.nl. there's a marriage of the two threads of research here somewhere...
- jeff hammerbacher
The image above is used by Google to implement a common technique known as CSS spriting. This technique is used by most major websites to minimize HTTP requests and hence improve page load times. As an example, Facebook loads all of the News Feed icons in a single PNG file, displayed below. And yes, reblogging this post was an excuse for me to... - http://www.alistapart.com/article...
Neil - This example came up (at a meetup in Waterloo, where both Ilya and I live) when Ilya pointed out an extremely nice feature this code has: to distribute the mapper function across the network, the code actually _sends the function_, as opposed to sending a text file containing the function. This is very nice because it allows you to use dynamically defined mappers. I'll use a similar idea in my distributed PageRank / MapReduce implementation.
- Michael Nielsen
(Calling it MapReduce was certainly a bit... undescriptive, though, on the part of the post author.)
- Michael Nielsen
I spoke too soon. Looks like one can't pickle function objects in Python, which was how I intended to send the function over the network. Oh well.
- Michael Nielsen
Thanks, jeff. From what's written at that page it's not 100% clear to me that it'll work for dynamically defined functions, but I'll try it out and see.
- Michael Nielsen
"I've been impressed by the sustained dedication the Yoga Bear team has demonstrated over the past two years. Watching the organization grow from an idea to a cross-country collective of yoga studios,…"
- jeff hammerbacher
The company, with its remaining 20 employees, will focus on BitTorrent DNA, a content delivery network that helps media and video game companies distribute their products cheaply over the Internet.
- Amr Awadallah
no wonder we haven't heard from ping in a while...
- jeff hammerbacher
I have been working with Chris Wensel to try out cascading, but Streaming + Python has been good enough.
- Elias Torres
yeah, that was most of what we needed at facebook. hive let us hand the keys to business analysts, which is great. i never really got which problems were appropriate for cascading; it's basically a language for writing query plans, and so is pig. i guess it could serve as the basis for an etl framework, but there are a lot of those that exist already.
- jeff hammerbacher