Sign in or Join FriendFeed
FriendFeed is the easiest way to share online. Learn more »
Paul Buchheit
A practical scalable distributed B-tree - http://www.hpl.hp.com/techrep...
"Moreover, our algorithm is conceptually simple: we use transactions to manipulate B-tree nodes so that clients need not use complicated concurrency and locking protocols used in prior work. To execute these transactions quickly, we rely on three techniques: (1) We use optimistic concurrency control, so that B-tree nodes are not locked during transaction execution, only during commit. This well-known technique works well because B-trees have little contention on update. (2) We replicate inner nodes at clients. These replicas are lazy, and hence lightweight, and they are very helpful to re- duce client-server communication while traversing the B-tree. (3) We replicate version numbers of inner nodes across servers, so that clients can validate their transactions efficiently, without creating bottlenecks at the root node and other upper levels in the tree." - Paul Buchheit
Paul, I think many of us are going to trust your opinion on this white paper. All Greek to me. - Jon-Paul Bussoli
All I understand is that it is in my best interests to cheer for the way you access B-tree nodes in order to continue to enjoy friendfeed reliably. Go friendfeed algorithm go! - Jon-Paul Bussoli
What, no comparison to BigTable? - ⓞnor
@nor It's really not the same thing, unless somehow you're using a distributed B-tree on hash collision, however, if you're getting that many collisions, then the hash algorithm is probably wrong or your key width is too small. Then again, I really don't know what I'm talking about. - Eric Florenzano
Curious as to what problem Paul is looking at... My default data toolkit these days would probably include sqlite for in-memory data, sharded bdb's for btrees that are too big for memory, and hbase/hypertable for a distributed store. I wonder where this fits in... - DeWitt Clinton
Ok this is a really *nerdy* post! :*) - Susan Beebe
DeWitt, I just thought that it looked like an interesting paper. As for the several solutions you mention, I don't know that any of them have distributed transactions (maybe bdb, but that doesn't really work). - Paul Buchheit
B-Trees and Prof. Bayer http://wwwbayer.informatik.tu-... - would be interesting to know what he'd say, unfortunately he's retired a few years ago. Used to be fairly approachable in all matters B-Tree. - Mustafa K. Isik
@DeWitt - no room for a traditional SQL based database except as an in memory database? - Nick Lothian
we had designed and implemented distributed tree control, but transactions were considered "too much" for near-real-time, and they were already in protocol... the rest you know as xGSN boxes in GPRS/3G/HSDPA - dynamic routing for mobile packet networks. I'd left team in 2003... - A.T.
@paul - I'll readily admit to being out of my depth, but it depends on what the definition of "distribution transaction" is. With bdb a combination of local transactions and guaranteed consistent replication you can approximate a distributed transaction at the cost of speed. See http://www.oracle.com/technol... and http://www.oracle.com/technol.... But those won't work across bdb shards. - DeWitt Clinton
@paul - A table-based distributed store can do this via a lock on entity groups, where entity groups are defined by relationship formed by instances of similar models that belong to the same parent-based ancestry chain. This is how App Engine transactions work -- see http://code.google.com/appengi... and http://code.google.com/appengi.... Ping ryan for some background there. Not sure if hbase or hypertable support this via their api. - DeWitt Clinton
@nlothian - I dunno. Offline maybe? - DeWitt Clinton
DeWitt: have you ever successfully used BDB with millions of newly written entries and transaction support turned on? We kept getting transaction logs with millions of entries that were never consumed, so restarts would take hours as it replayed the logs. Configuring BDB to work for large databases is insanely esoteric to say the least, and it may be impossible to get it to work acceptably in some cases. - Bret Taylor
@bret -- no, definitely not with large databases. We used bdb's heavily at my last company, though. Aggressive sharding is the key if you want to support either transactions or replication, which matches intuition about how it is implemented. - DeWitt Clinton
But your comment about millions of entries makes me wonder about which data is getting written to which place. I suspect a lot of problems like this end up with the bulk of the data being written transactionless + replicated to a table-based store (or a transactionless bdb), and only a small subset of the data gets transaction support. So multiple datastores. But you guys know this better than I do, so why am I rambling? : ) - DeWitt Clinton
DeWitt, you can also look into all the trouble that Gaia had with bdb - I simply wouldn't trust any fancy bdb functionality. - Paul Buchheit
Also, AppEngine transactions are limited to a single "entity group", which I assume means a single BigTable tablet. Essentially, they solved distributed transactions by not having them -- all transactions must be local to a single tablet. From the docs: "Every entity belongs to an entity group, a set of one or more entities that can be manipulated in a single transaction. Entity group relationships tell App Engine to store several entities in the same part of the distributed network." - Paul Buchheit
@paul - yup, that's the trade-off. Entity groups ensure locality, locality makes transactions fast(er). Same old lever problem -- speed of consistency vs. scope of the transactions. - DeWitt Clinton
DeWitt, there's nothing wrong with having local transactions -- I'm just pointing out that they aren't distributed transactions. - Paul Buchheit
Point taken. I got way off-topic regarding your original post anyway. - DeWitt Clinton
The design seems reasonable. The only part that is under-specified is the way they switch from a master node to a slave. I'm curious why they don't use transactions to maintain replicas but instead rely on some unspecified master/slave replication scheme. - Private Sanjeev