Scalability questions may seem intimidating, but they can be some of the most straightforward and easy questions to answer.
Step 1: Solve the problem without worrying about memory.
Step 2: How much data can you fit on one machine? What problems are there with splitting up the data?
Step 3: Solve the problems from step 2. You may find that your solutions introduce new problems. Just keep solving until you have resolved all the issues.
Find all documents that contain a list of words
First, we need to clarify our requirements by finding out if we will find the words in our document store multiple times or whether it is just one time. Let’s assume for this exercise that we are calling it multiple times.
Step 1: Don’t worry about the massive amount of textual data we are searching through, just solve the problem.
Do you have an answer? Yes?
One way is to create a hash table to map from word to all the documents that contain that word. To search for two words we simply do an intersection of the two words and return back any documents where both words are present.
Step 2: Now see what changes when we introduce millions of documents. We may start by dividing the documents among multiple computers. Also, depending on the number of possible words and repetition of words in a document we might not be able to fit the hash table on one machine.
We introduce at least three concerns when we divide up the data:
Step 3: Now find solutions to the issues discovered in step 2. We could divide up the data by keywords. We would iterate through the keywords storing only enough to fill up a given machine and moving on to the next machine. This allows the lookup table to be small because we simply list a range of values for each machine. The disadvantage is that we will have to shift values every time new data is added.