Recently, a friend of mine asked me a Facebook system design question: design a web crawler. The requirements are:
- One billion URLs from 1 seed URL,
- No duplicated crawling,
- Distribute the load evenly,
- 10k hacked machines,
- Minimize the communication between machines.
Before we start to answer this question, let's see how a standard crawler is supposed to work. Usually, crawler systems are centralized, which means some components are shared by all crawler machine. For example, a simplified crawler system is like below:
URL Frontier is usually implemented with message queue and a centralized database is used for URL de-duplication. The above system is a typical centralized crawler.
Let's revisit the requirements: #1 and #2 are pretty clear and easy to understand. What do requirements #4 and #5 really mean? Sometimes, the requirement #4 can be expressed as below:
4a. All machines are deployed with same software.
Requirement #4 actually implies that you are not suppose to design a centralized crawler system. With hacked machines, you are not able to set up a centralized database or a message queue. With this mind, we need to design a decentralized crawler system. Requirement 4a makes it even clearer: every machine has the same software which means every machine does the exact same job. You won't be able to have a dedicated database or message queue.
What does requirement #5 mean? This requirement implies what crawler assignment strategy to use. A centralized crawler system can use a central system (URL Frontier in above graph) to determine URL assignment based on some predefined strategy. In contrast, a decentralized crawler system has no central point of control. Each crawler knows the assignment strategy and will assign the URLs to the correct crawlers responsible for crawling.
One of the criteria to evaluate assignment strategies is communication overhead [1]. It is defined as follows:
Communication overhead is the total number of pages downloaded by the system divided by the number of URLs that the crawlers have exchanged. The smaller the communication overhead is, the better the assignment strategy is.
With the centralized crawler, one URL will be transfer a few time before downloading the page: The crawler first sends the URL to database to check if it is crawled or not, if not, crawler sends it to the message queue, then one crawler consumes it from the queue to start downloading. So to download one page, the same URL is transferred at least 3 times within the system.
Another criteria is balancing: Each crawler should be assigned to the approximately same number of URLs, which essentially is the requirement #3.
With above information, we know we will design a decentralized crawler system. Then the issue becomes finding an appropriate assignment strategy. The easiest assignment strategy is to mod the URLs to the corresponding machines assuming 10k machines are always up. Below is the design:
The key point of the above design is each machine does exactly the same job.
Here is the flow:
- The URL Frontier receives the URLs. It checks if the URLs are in the cache. If not, it adds the URLs to queue, and the cache.
- Crawler consumes the messages, downloads HTML, extracts URLs from documents, and hash URLs and assigns them to corresponding machines using mod.
With the above design, URLs are evenly distributed within 10k machines, and for 1 page download, only 1 transfer of URL between machines. The communication between machines are minimized. The in memory HashSet guarantees no duplicate crawl. So all requirements are met.
The above design transfer URLs immediately after they are discovered. The crawler can wait for a while then send URLs in a batch, which further reduces the communication overhead. For example, dedup can be implemented before assignment.
One assumption we made is that 10k machines are always up so we can use mod to hash URLs. What if this assumption doesn't hold? The above design still works but we need to use a different hashing algorithm: consistent hashing. If you don't know what it is, please visit wiki.
However, with consistent hashing, no duplicate crawl can't be maintained. The reason is the info of crawled URLs in memory cache is gone. You need to discuss it with your interviewer.
References:
[1]: Junghoo Cho, Hector Garcia-Molina. Parallel Crawlers. WWW2002, May 7-11, 2002, Honolulu, Hawaii, USA.