Posted by Matt Mazur (@mhmazur)
I've been spending a lot of time lately trying to improve the quality of Lean Domain Search by decreasing the number of registered domain names that appear in the available search results. After all, there's nothing more frustrating than getting your hopes up because you found a great name only to find out that it has already been registered.
The causes of the false-positives (the registered domains that appear in the available search results) are complicated, but there are things I can do to mitigate it. The primary mechanism I have for improving the quality of the results is a script that I continuously run that double-checks that the available search results are accurate. If the script comes across an available domain name that is actually registered, it notifies Lean Domain Search so that that domain is not included in future search results.
The problem though is that the script is slow because it has to perform a WHOIS query for every domain that it needs to double-check. This yields itself to an interesting optimization problem: given that the script can only double-check so many results per day, which results do I check?
I could, for example, get a list of all the searches performed in the last few hours and then just go one by one through them and double-check the results. A better approach is to focus on the queries that people search for the most because inaccuracies in those results are going to affect more people than something that's rarely searched for.
The question then becomes how many of the most popular queries should I have the script check?
By performing a little kung fu with the analytics data I can get a much better idea of how to allocate my resources:
Along the x-axis are the percentage of queries that it's taking into account, on the y-axis is what percentage of the overall searches that those queries account for.
Some interesting results:
- The top 10 queries (or 0.11% of all the queries) account for more than 4% of the searches performed
- The top 1% of the queries account for 42% of the searches performed
- The top 10% of queries account for 70% of the searches performed
- The top 20% of the queries account for 77% of the searches performed (noted by the red lines in the chart)
This last result is particularly interesting because it conforms to the Pareto Principle, also known as the 80-20 rule, which says that for many events 80% of the effects come from 20% of the causes. Examples include 80% of the land in an area being owned by 20% of the population, 80% of a company's sales coming from 20% of its customers, etc. In this case, this distribution follows suit: 77% of the searches come from 20% of the queries.
Using this information I can focus the script that double-checks the results on queries that are going to affect the most people which in turn is going to create a better experience and hopefully higher conversion rates.
Data is fun (and profitable!). :)