A Google patent, originally filed in 2003, and granted today (with Matt Cutts as one of the listed inventors) describes this problem in more detail and provides some ways that Google could potentially act to lessen the value of keywords included in domain names by recognizing when queries are commercial in nature and using a different ranking algorithm for those queries that might lessen the value of domains with keywords in them.
The patent describes a number of possible steps that it might take to identify commercial queries.
The first step may be to obtain a list of user queries, and it might limit that list to keep it manageable. An example from the patent tells us that it might “retrieve those stored search queries that occur at least once per 100 million queries.” That could potentially limit the list to a few million or billion queries.
The next step might be to collect a list of phrases or keywords of interest to advertisers or webmasters or both. That can include phrases and keywords used in advertising or phrases/keywords used in meta tags.
A list of domain names that contain 2 or more hyphens might be gathered as well. We’re told in the patent that:
It is very common to see domain names that include a single hyphen, but when two, three, or more hyphens are present, this is often an indication that these domain names are associated with companies that are attempting to trick search engines into ranking their web pages more highly.
Similarly, Google might create a list of host names (subdomains) that it finds that contain more than a certain number of hyphens.
The lists of user queries, domain names, and host names might be processed in a number of ways, such as:
Removing stop words, digits, punctuation, etc. “For example, for the domain name “buy-credit-cards-online.com,” server may remove the hyphens and “.com” portion to leave the following phrase ‘buy credit cards online.’” In a query such as, “where can I find low apr credit cards,” the “where can I find,” might be removed to leave the phrase “low apr credit cards.”
An n-gram analysis of the list of domain names and host names might be performed to find combinations of words found in that list that tend to show up frequently.
For example, assume that the domain name list includes the domain name “buy-cheap-credit-cards-online.com.” Server may form the following exemplary n-grams for this domain name: “credit cards,” “buy cards,” “cheap cards,” “buy credit cards,” “cheap credit cards,” “buy cheap cards,” “buy card online,” “cheap cards online,” “credit cards online,” “buy credit cards online,” “buy cheap credit cards,” “buy cheap credit cards online.” Other n-grams may also be formed.