At Twitter we are significantly ramping up usage of Hadoop to help us analyze the massive amounts of data that our platform generates each day. We are happy users of Cloudera’s free distribution of Hadoop; we’re currently running Hadoop 0.20.1 with Pig 0.4. In this first of a small series of posts about our architecture and the open source software we’re working on around it, we’d like to focus on an infrastructure-level solution we use to make our cluster more efficient: splittable LZO for Hadoop. Using LZO compression in Hadoop allows for reduced data size and shorter disk read times, and LZO’s block-based structure allows it to be split into chunks for parallel processing in Hadoop. Taken together, these characteristics make LZO an excellent compression format to use in your cluster.
- Blake Matheny
Hadoop World: Rethinking the Data Warehouse with Hadoop and Hive from Ashish Thusoo » Cloudera Hadoop & Big Data Blog - http://www.cloudera.com/blog...
First, Facebook developers are encouraged to push code often and quickly. Pushes are never delayed and applied directly to parts of the infrastructure. The idea is to quickly find issues and their impacts on the rest of system and surely fixing any bugs that would result from these frequent small changes. Second, there is limited QA (quality assurance) teams at Facebook but lots of peer review of code. Since the Facebook engineering team is relatively small, all team members are in frequent communications. The team uses various staging and deployment tools as well as strategies such as A/B testing External Link, and gradual targeted geographic launches.
- Blake Matheny
An interesting review of a few of the popular NoSQL solutions available. In particular it seems to point out how far behind everyone else Voldamort seems to be. Voldamort can't add machines live and is key/value (as opposed to column family), which both entirely drive me crazy.
- Blake Matheny
“Only one who devotes himself to a cause with his whole strength and soul can be a true master. For this reason mastery demands all of a person.” - Albert Einstein
- Blake Matheny
Why an article on managing people? And one written by someone with training in computer science rather than business administration? There are thousands of books on the best ways to manage people. Many of these books are excellent, having been written by people who've devoted their lives to the discipline. Software engineering is different. Software engineering is different because only the best people significantly contribute to achievement.
- Blake Matheny
Why an article on managing people? And one written by someone with training in computer science rather than business administration? There are thousands of books on the best ways to manage people. Many of these books are excellent, having been written by people who've devoted their lives to the discipline. Software engineering is different. Software engineering is different because only the best people significantly contribute to achievement.
- Blake Matheny
Resque is our Redis-backed library for creating background jobs, placing those jobs on multiple queues, and processing them later. Background jobs can be any Ruby class or module that responds to perform. Your existing classes can easily be converted to background jobs or you can create new classes specifically to do work. Or, you can do both. All the details are in the README. We've used it to process over 10m jobs since our move to Rackspace and are extremely happy with it.
- Blake Matheny
Recently, there has been a lot of chitchat about the eventual consistency model as illustrated in the famous Amazon Dynamo paper, and today employed by several non-relational databases such as Voldemort or Cassandra. Everything starts with this blog post by the Facebook Infrastructure Lead, claiming: "Dynamo: A flawed architecture", where he makes a few points against the eventual consistency model and the related "sloppy" quorum approach. However, his points seems to be based on a few misconceptions which the Amazon paper doesn't help to clarify, so let's try to spread some light by first giving a few definitions, and then a simple example.
- Blake Matheny
Mongo (from "humongous") is a high-performance, open source, schema-free, document-oriented database. It is also used by some serious companies. Features include: * Collection oriented storage: easy storage of object/JSON -style data * Dynamic queries * Full index support, including on inner objects and embedded arrays * Query profiling * Replication and fail-over support * Efficient storage of binary data including large objects (e.g. photos and videos) * Auto-sharding for cloud-level scalability (currently in early alpha)
- Blake Matheny
ComparingProtocols - pubsubhubbub - Comparison of PubSubHubbub to light-pinging protocols - Project Hosting on Google Code - http://code.google.com/p...
People want a comparison of the concrete differences between fat pinging (PubSubHubbub, XMPP pubsub) and light pinging (rssCloud, XML-RPC pings, changes.xml, SUP, SLAP). This document aims to construct and convey an evaluation of these protocols that's easy to understand.
- Blake Matheny
Distributed, scalable databases are desperately needed these days. From building massive data warehouses at a social media startup, to protein folding analysis at a biotech company, “Big Data” is becoming more important every day. While Hadoop has emerged as the de facto standard for handling big data problems, there are still quite a few distributed databases out there and each has their unique strengths.
- Blake Matheny
Mobile has been a hard problem space for testing: a humongous browser, phone, capability combination which is changing fast as the underlying technology evolves. Add to this poor tool support for the mobile platform and the rapid evolution of the device and you'll understand why I am so interested in advice on how to do better test design. We've literally tried everything, from checking screenshots of Google's properties on mobile phones to treating the phone like a collection of client apps and automating them in the UI button-clicking traditional way.
- Blake Matheny
Someone at work recently asked how he should go about studying machine learning on his own. So I’m putting together a little guide. This post will be a living document…I’ll keep adding to it, so please suggest additions and make comments.
- Blake Matheny
Most Haskell tutorials on the web seem to take a language-reference-manual approach to teaching. They show you the syntax of the language, a few language constructs, and then have you construct a few simple functions at the interactive prompt. The "hard stuff" of how to write a functioning, useful program is left to the end, or sometimes omitted entirely. This tutorial takes a different tack. You'll start off with command-line arguments and parsing, and progress to writing a fully-functional Scheme interpreter that implements a good-sized subset of R5RS Scheme. Along the way, you'll learn Haskell's I/O, mutable state, dynamic typing, error handling, and parsing features. By the time you finish, you should be fairly fluent in both Haskell and Scheme.
- Blake Matheny
In my investigations into string metrics, similarity metrics and the like I have developed an open source library of Similarity metrics called SimMetrics. SimMetrics is an open source java library of Similarity or Distance Metrics, e.g. Levenshtein distance , that provide float based similarity measures between String Data. All metrics return consistant measures rather than unbounded similarity scores.
- Blake Matheny
In games or educational programs you often don't want real randomness. Unsurprisingly random it's often a bit too random. You may get some items several times in a row and others too rarely. It's either too easy or too hard. Balancing the weights is hard as well, because each run is too different. And if that wouldn't be already bad enough; it's also difficult to verify the results. Note: The continuous shuffled sequence algorithm which I'll explain in the second half of the article can be easily ported to other languages. All you need are arrays (and a copy function or resizable arrays) and a random number generator. Before I'll dive into shuffled sequences I'll outline the other issues of the obvious approaches. Because that's apparently more fun to do.
- Blake Matheny
In games or educational programs you often don't want real randomness. Unsurprisingly random it's often a bit too random. You may get some items several times in a row and others too rarely. It's either too easy or too hard. Balancing the weights is hard as well, because each run is too different. And if that wouldn't be already bad enough; it's also difficult to verify the results....
more...
- Blake Matheny
Most websites don’t need to do complicated things like store polygon data. They just need to store points on a map and then retrieve those points. They also need to be able to ask the database for all points within a rectangle. So I’m going to run you through schema creation, inserting data, getting your lat/lon data out of the database again, and querying the database for all points within a rectangle. We’re also going to deal with the nasty little issue of asking the database for points in a rectangle that crosses the 180 degree boundary (or the International Date Line).
- Blake Matheny
But deadlines are deadlines. If you have a project due Tuesday — then there’s no choice but to complete it by Tuesday. Sure, you can try to push the deadline but there’s still one! To help with the situation and motivate me to clear the work that are currently pending, I’ve come up with 6 great (and free!) tools, categorized into 3 groups.
- Blake Matheny
But deadlines are deadlines. If you have a project due Tuesday — then there’s no choice but to complete it by Tuesday. Sure, you can try to push the deadline but there’s still one! To help with the situation and motivate me to clear the work that are currently pending, I’ve come up with 6 great (and free!) tools, categorized into 3 groups.
- Blake Matheny
Sorry everyone. Some kind of bug bit me on twitter. Weird.