Scalability and Memory Limits

Scalability questions may seem intimidating, but they can be some of the most straightforward and easy questions to answer.

Step 1: Solve the problem without worrying about memory.

Step 2: How much data can you fit on one machine?  What problems are there with splitting up the data?

Step 3: Solve the problems from step 2.  You may find that your solutions introduce new problems.  Just keep solving until you have resolved all the issues.


Find all documents that contain a list of words

First, we need to clarify our requirements by finding out if we will find the words in our document store multiple times or whether it is just one time.  Let’s assume for this exercise that we are calling it multiple times.

Step 1: Don’t worry about the massive amount of textual data we are searching through, just solve the problem.

Do you have an answer? Yes?


One way is to create a hash table to map from word to all the documents that contain that word.  To search for two words we simply do an intersection of the two words and return back any documents where both words are present.

Step 2: Now see what changes when we introduce millions of documents.  We may start by dividing the documents among multiple computers.  Also, depending on the number of possible words and repetition of words in a document we might not be able to fit the hash table on one machine.

We introduce at least three concerns when we divide up the data:

  1. How will we divide the hash table?  We could divide it up by words so that one machine contains all the documents for a set of particular words, or we could divide it up by documents so that one machine contains a subset of the documents.
  2. We may have to process a document on one machine and push the result off to other machines.
  3. We will need a method of discovering which machine holds a piece of data.  What does the lookup table look like and how is it stored?

Step 3: Now find solutions to the issues discovered in step 2.  We could divide up the data by keywords.  We would iterate through the keywords storing only enough to fill up a given machine and moving on to the next machine.  This allows the lookup table to be small because we simply list a range of values for each machine.  The disadvantage is that we will have to shift values every time new data is added.