Spamcalc Last modified on 1 April 2002. A dns spam calculation program. Created by Joost "Garion" Vunderink. In this document you will find a short introduction to this script. The following topics will be dealt with: * What is dns spam? * What does the script do? * What are the weaknesses of the script? * How does the script work? Which algorithms does it use? * How did you determine the penalty scores for the words? * Which score marks the transition for no spam to spam? * Is there a maximum score? Which hostname causes it? * Why do I get negative scores? * Hey, your script did not detect this hostname as spam! * How can I help with improving the script? * What is dns spam? According to RFC1178, hostnames should be constructed in a hierarchical way, for example computername.subdomain.domain.country. Making 'cool' hostnames like i.am.the.coolest.person.at.domain.country is in contradiction with this RFC and is a bad thing for several reasons. Find more information on what dns spam is on http://www.dnsspam.nl/ (this page is in english). Hostnames that are considered dns spam are hostnames with (a part of) a sentence in them (master.of.the.world.net), swearwords (shittywhore.armaster.roadkill.net) and other forms of unwanted textual data (666666666666666666666666666666666.sixtysix.org, 0-1-2-3-4-5.blah.com). * What does the script do? The script takes a hostname or a list of hostnames and determines a dns spam score for each hostname. This value is an indication for the spam-ness of the hostname. The higher the score, the higher the chance that the hostname is actually a dns spam hostname. Note that this is just an indication. The script uses a few very simple algorithms to determine the spam score, and it is therefore not foolproof. This script is not meant as a replacement for human judgement, because that would be unrealistic. It is only meant as support in finding dns spam. Because the script is not even close to the dns spam detection capabilities of a human being, there is always the chance that so-called false positives and false negatives occur. False positive are non-dnsspam hostnames which get assigned a score that is too high, and false negatives are spam-hostnames that are scored too low. * What are the weaknesses of the script? There is one major weakness and that is that the script is not a human being. This means that some hostnames will be misjudged. Most often this will be caused by the fact that some hostname uses a way of spamming, or a certain word, that is not present in the datafiles of the script. * How does the script work? Which algorithms does it use? Read the file 'algorithms' for this information. * How did you determine the penalty scores for the words? Read the file 'feedback' for a short answer to this question. * Which score marks the transition for no spam to spam? There is no one score the boundary between no spam and spam. This script does not decide whether a hostname is dns spam, it just gives an indication for the chance that it is a dns spam hostname. However, experience shows that hostnames with a score of over 50 are usually dns spam host, and a score of over 100 means that it's almost certainly a dns spam hostname. * Is there a maximum score? Which hostname causes it? Yes, actually there is a maximum score. So it's no use making funny domain names to see if you can beat it, because you can't. The domain consisting of only "a."'s, ending in .tv and consisting of 255 characters scores a whopping 1,918,514 points. * Why do I get negative scores? Congratulations, you have such a good hostname that it is considered to be ultra-non-spam. For certain properties of the hostname (for example, containing 'adsl' or 'ppp') a negative amount of points is awarded. This is because sometimes legal hostnames with many fields get too many points. If they contain the words 'ppp' or 'adsl', indicating that it's actually a non- spammy hostname, some points are subtracted, hopefully bringing it back to a spam score that does not indicate a spam hostname. * Hey, your script did not detect this hostname as spam! I am not surprised. The word lists are just in a very embryonic state yet; they are far from complete. Your spammy hostname probably uses certain words that are not recognized by the script yet. This is why feedback is very important: the lists of spamwords must grow for the script to work better. * How can I help with improving the script? The best way to help is to create datafiles and send them to me. Suggestions about the script itself are very welcome too, of course. More information about this can be found in the "feedback" file. I hope you enjoy the script. Any feedback is welcome; you can email this to joost@carnique.nl or talk to me on IRCNet (my nick there is Garion). Joost Vunderink.