Skip to main content

Detecting Trademark Infringement Domains using Levenshtein Distance

I just finished reading The Domain Game by David Kesmodel and it was a very fun read (A huge thanks goes to Tim Davids who sent me a copy of this book!). I am in awe of how Kesmodel found enough material to write so much. It was published in ~2007 which made it especially interesting because not much has changed since he has written it. The biggest thing that came to mind was the domain tasting loophole was closed and the new gTLDs. Other than that, it's still relevant.

The same issues plague us, mainly surrounding trademark infringement / cybersquatting. I kept reading statement's from these big portfolio holders over and over about how it's hard to actually filter an entire portfolio and keep it clean.

Is it really? I built a proof of concept many years ago that was a list based check. It uses a very small database of trademarks, try typing in ''. That's a very simple way to approach it but it doesn't handle typos.

Two nights ago when I finished reading The Domain Game I recalled an interesting algorithm which would be more effective: Levenshtein Distance. It basically calculates how many changes to a string (any combination of characters) would need to be made from one string to get to another.

rabbit -> rabbits (1 change. add s)
rabbit -> ribbit (2 changes. delete a. add i)
rabbit - > ribbits (3 changes. delete a. add i. add s.)

I have written a typo generator before and there are a few primary kinds of typos:

  • keyboard slips - missing an adjacent letter (hoogle vs google)
  • flipped letters - typed in wrong order (goolge vs google)
  • double letters - typing the same letter twice (ggoogle vs google)
  • phonetics - phonetic mistakes (gugle vs google)
  • missing letters - not typing a letter (gogle vs google)

The only category of typos which I am aware of that isn't included is keyboard shifts (when the users hands are off on alignment and they type part or the full word off - hpph;r would be google with a full right keyboard shift, it would be fiifkw with a full left keyboard shift).

Using Levenshtein Distance of Typos:

  • keyboard slips - 2
  • flipped letters - 2
  • double letters - 1
  • phonetics - ???
  • missing letters - 1

With the exception of phonetics (and keyboard shifts) all single error typos would theoretically be within 2 levenschtein distance. You could easily create software to do phonetics and check against a straight forward trademark database (or even run it through levenshtein). I am not sure full keyboard shifts would be defensible as TM infringement as they look so wildly different from their original that no confusion could occur in a consumer's mind when reading it. (I am not a lawyer and this is NOT legal advice). So for the purposes of this post I will ignore both of them.

There is the issue of multiple typos in one domain which is something that starts to border on looking like entirely different words in short strings. So unless the domain is very long and unique the double typo is hard to show. I also think the frequency of such events is far lower than single typo domains.

Let's take a look at some actual data. Here is the typos of and 2007 overture type in scores: 293430 (actual) 12535 (missing) 9766 (missing) 1938 (missing) 1780 (missing) 1190 (missing) 835 (double) 766 (missing) 718 (flip) 673 (missing) 661 (double) 630 (missing) 466 (missing) 393 (flip) 353 (flip) 352 (missing) 325 (double) 294 (missing) 168 (flip) 142 (slip) 137 (double) 137 (slip) 130 (slip) 115 (double) 107 (missing) 93 (double) 92 (flip) 89 (flip) 88 (double) 75 (phonetic) 73 (slip) 65 (double) 64 (double) 63 (double) 52 (double) 50 (slip) 49 (slip) 46 (double) 45 (slip) 45 (flip) 44 (slip) 44 (slip) 43 (slip) 40 (slip) 38 (slip) 37 (flip) 36 (slip) 34 (slip) 34 (slip) 32 (slip) 31 (slip) 30 (flip) 0 0 0 0 0 0

Missing letters is definitely the most common. Certain double and flips are definitely more common (anytime there is a double letter, having a triple instead is far more probably than a random letter alone). Keyboard slips seem to be the least popular oddly enough. I looked at multiple months of data from multiple sources and the relative volumes hold across the timespan. I have no reason to think people's typing got better in different areas.

All of these typos, including the one phonetic found, are within a Levenshtein distance of 2. Of course I used a very long domain which has no confusing typos. Some judgement in trademarks has to be used when virtually every word has trademarks. However, this would identify nearly all potential trademarks in any portfolio given a list of a trademarks to check against.

That list is a different problem - if anyone has a solution for that we would be one step closer to a solution. My initial thoughts would be to take the fortune (100/500/1000) and find all their brands. Protect yourself from the biggest companies down.

So, if you've got a portfolio too big to check manually or run a parking company that wants to clean up/alert you of potential issues this strategy will hopefully help. Contact me if you need help with this.