Abstract
With more than 350 million active domain names and at least 200,000 newly registered domains per day, it is technically and economically challenging for Internet intermediaries involved in domain registration and hosting to monitor them and accurately assess whether they are benign, likely registered with malicious intent, or have been compromised. This observation motivates the design and deployment of automated approaches to support investigators in preventing or effectively mitigating security threats. However, building a domain name classification system suitable for deployment in an operational environment requires meticulous design: from feature engineering and acquiring the underlying data to handling missing values resulting from, for example, data collection errors. The design flaws in some of the existing systems make them unsuitable for such usage despite their high theoretical accuracy. Even worse, they may lead to erroneous decisions, for example, by registrars, such as suspending a benign domain name that has been compromised at the website level, causing collateral damage to the legitimate registrant and website visitors.
In this paper, we propose novel approaches to designing domain name classifiers that overcome the shortcomings of some existing systems. We validate these approaches with a prototype based on the COMAR (COmpromised versus MAliciously Registered domains) system focusing on its careful design, automated and reliable ground truth generation, feature selection, and the analysis of the extent of missing values. First, our classifier takes advantage of automatically generated ground truth based on publicly available domain name registration data. We then generate a large number of machine-learning models, each dedicated to handling a set of missing features: if we need to classify a domain name with a given set of missing values, we use the model without the missing feature set, thus allowing classification based on all other features. We estimate the importance of features using scatter plots and analyze the extent of missing values due to measurement errors.
Finally, we apply the COMAR classifier to unlabeled phishing URLs and find, among other things, that 73% of corresponding domain names are maliciously registered. In comparison, only 27% are benign domains hosting malicious websites. The proposed system has been deployed at two ccTLD registry operators to support their anti-fraud practices.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
References
Alowaisheq, E., et al.: Cracking the wall of confinement: understanding and analyzing malicious domain take-downs. In: Proceedings of NDSS (2019)
Amazon: Alexa: SEO and Competitive Analysis Software (2022). https://www.alexa.com/
Anti-Phishing Working Group: Global phishing survey: Trends and domain name use in 2016 (2016). https://docs.apwg.org/reports/APWG_Global_Phishing_Report_2015-2016.pdf
Antonakakis, M., Perdisci, R., Dagon, D., Lee, W., Feamster, N.: Building a dynamic reputation system for DNS. In: Proceedings of USENIX Security, p. 18 (2010)
Bayer, J., et al.: Study on domain name system (DNS) abuse: technical report. arXiv preprint arXiv:2212.08879 (2022)
Bilge, L., Kirda, E., Kruegel, C., Balduzzi, M.: EXPOSURE: finding malicious domains using passive DNS analysis. In: Proceedings of 18th NDSS (2011)
Bilge, L., Sen, S., Balzarotti, D., Kirda, E., Kruegel, C.: Exposure: a passive DNS analysis service to detect and report malicious domains. ACM Trans. Inf. Syst. Secur. 16(4) (2014)
Corona, I., et al.: DeltaPhish: detecting phishing webpages in compromised websites. arXiv:1707.00317 (2017)
Daigle, L.: Whois protocol specification. Technical report, RFC Editor (2004)
De Silva, R., Nabeel, M., Elvitigala, C., Khalil, I., Yu, T., Keppitiyagama, C.: Compromised or attacker-owned: a large scale classification and study of hosting domains of malicious URLs. In: Proceedings of USENIX Security, pp. 3721–3738 (2021)
DNS Abuse Framework. https://dnsabuseframework.org/
Donders, A.R.T., van der Heijden, G.J., Stijnen, T., Moons, K.G.: Review: a gentle introduction to imputation of missing values. J. Clin. Epidemiol. 59(10), 1087–1091 (2006)
Emmanuel, T., Maupong, T., Mpoeleng, D., Semong, T., Mphago, B., Tabona, O.: A survey on missing data in machine learning. J. Big Data 8 (2021)
Farsight Security: Passive DNS Historical Internet Database: Farsight DNSDB (2022). https://www.farsightsecurity.com/solutions/dnsdb/
Felegyhazi, M., Kreibich, C., Paxson, V.: On the potential of proactive domain blacklisting. In: Proceedings of 3rd USENIX LEET (2010)
Frosch, T., Kührer, M., Holz, T.: Predentifier: detecting botnet C &C domains from passive DNS data. In: Zeilinger, M., Schoo, P., Hermann, E. (eds.) Advances in IT Early Warning, pp. 78–90. AISEC (2013)
Google: Certificate Transparency. https://certificate.transparency.dev/
Google Safe Browsing. https://safebrowsing.google.com/
Halvorson, T., Der, M.F., Foster, I., Savage, S., Saul, L.K., Voelker, G.M.: From. academy to.zone: an analysis of the new TLD land rush. In: Proceedings of IMC, pp. 381–394 (2015)
Hao, S., Kantchelian, A., Miller, B., Paxson, V., Feamster, N.: PREDATOR: proactive recognition and elimination of domain abuse at time-of-registration. In: Proceedings of ACM SIGSAC, pp. 1568–1579 (2016)
Hollenbeck, S.: Extensible Provisioning Protocol (EPP) Domain Name Mapping. RFC 3731, RFC Editor (2004)
ICANN: EPP Status Codes | What Do They Mean, and Why Should I Know? https://www.icann.org/resources/pages/epp-status-codes-2014-06-16-en
Internet Archive: Wayback Machine. https://archive.org/web/
Kheir, N., Tran, F., Caron, P., Deschamps, N.: Mentor: positive DNS reputation to skim-off benign domains in botnet C &C blacklists. In: Cuppens-Boulahia, N., Cuppens, F., Jajodia, S., Abou El Kalam, A., Sans, T. (eds.) SEC 2014. IAICT, vol. 428, pp. 1–14. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-55415-5_1
Kintis, P., et al.: Hiding in plain sight. In: Proceedings of ACM SIGSAC (2017)
Kohavi, R.: A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In: Proceedings of 14th IJCAI, vol. 2, pp. 1137–1143 (1995)
Korczyński, M., Tajalizadehkhoob, S., Noroozian, A., Wullink, M., Hesselman, C., van Eeten, M.: Reputation metrics design to improve intermediary incentives for security of TLDs. In: Proceedings of IEEE Euro SP (2017)
Korczyński, M., et al.: Cybercrime after the sunrise: a statistical analysis of DNS abuse in new gTLDs. In: Proceedings of ACM ASIACCS (2018)
Le Page, S., Jourdan, G.-V., Bochmann, G.V., Onut, I.-V., Flood, J.: Domain classifier: compromised machines versus malicious registrations. In: Bakaev, M., Frasincar, F., Ko, I.-Y. (eds.) ICWE 2019. LNCS, vol. 11496, pp. 265–279. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-19274-7_20
Le Pochat, V., Van Goethem, T., Tajalizadehkhoob, S., Korczyński, M., Joosen, W.: Tranco: a research-oriented top sites ranking hardened against manipulation. In: Proceedings of NDSS. Internet Society (2019)
Le Pochat, V., et al.: A practical approach for taking down avalanche botnets under real-world constraints. In: Proceedings of 27th NDSS (2020)
Liu, S., Foster, I., Savage, S., Voelker, G.M., Saul, L.K.: Who is.Com? learning to parse WHOIS records. In: Proceedings of IMC, pp. 369–380 (2015)
Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: Proceeding of 15th ACM SIGKDD ICKDDM, pp. 1245–1254. KDD (2009)
Maroofi, S., Korczyński, M., Hesselman, C., Ampeau, B., Duda, A.: COMAR: classification of compromised versus maliciously registered domains. In: Proceedings of IEEE EuroS &P, pp. 607–623 (2020)
Matthews, B.: Comparison of the predicted and observed secondary structure of T4 Phage Lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Struct. 405(2), 442–451 (1975)
Moura, G.C.M., Müller, M., Davids, M., Wullink, M., Hesselman, C.: Domain names abuse and TLDs: from monetization towards mitigation. In: Proceedings of IFIP/IEEE, pp. 1077–1082 (2017)
Namecheap. https://www.namecheap.com/
Newton, A., Hollenbeck, S.: Registration data access protocol (RDAP) query format. Technical report, RFC Editor (2015)
OpenPhish. https://openphish.com/
PhishLabs: Abuse of HTTPS on Nearly Three-Fourths of all Phishing Sites (2020). https://www.phishlabs.com/blog/abuse-of-https-on-nearly-three-fourths-of-all-phishing-sites/
PhisLabs: https://www.phishlabs.com/
Sectigo Limited: Sectigo®Official - SSL Certificate Authority & PKI Solutions. https://sectigo.com/
SiteAdvisor, M.: https://www.siteadvisor.com/
Spamhaus. https://www.spamhaus.org/
Spooren, J., Vissers, T., Janssen, P., Joosen, W., Desmet, L.: Premadoma: an operational solution for DNS registries to prevent malicious domain registrations. In: 35th ACSAC, pp. 557–567 (2019)
SURBL. https://surbl.org/
Tajalizadehkhoob, S., Böhme, R., Gañán, C., Korczyński, M., Eeten, M.V.: Rotten apples or bad harvest? what we are measuring when we are measuring abuse. ACM Trans. Internet Technol. 18(4) (2018)
Tajalizadehkhoob, S., et al.: Herding vulnerable cats: a statistical approach to disentangle joint responsibility for web security in shared hosting. In: Proceedings of ACM SIGSAC, pp. 553–567 (2017)
Ulevitch, D.: PhishTank Join the fight Against Phishing (2006). https://phishtank.org/
URIBL. https://www.uribl.com/
Wang, Y.M., Beck, D., Wang, J., Verbowski, C., Daniels, B.: Strider typo-patrol: discovery and analysis of systematic typo-squatting. In: Proceedings of USENIX Association, vol. 2, p. 5 (2006)
Zhang, P., et al.: CrawlPhish: large-scale analysis of client-side cloaking techniques in phishing. In: Proceedings of IEEE S &P, pp. 1109–1124 (2021)
Acknowledgments
We thank Benoît Ampeau, Marc van der Wal (AFNIC) and the anonymous reviewers for their valuable feedback, Anti-Phishing Working Group, OpenPhish, and PhishTank for providing access to their URL blacklists. This work has been carried out in the framework of the COMAR project funded by SIDN, the .NL Registry and AFNIC, the .FR Registry. It was partially supported by the Grenoble Alpes Cybersecurity Institute (under contract ANR-15-IDEX-02), and the French Ministry of Research (PERSYVAL-Lab project under contract ANR-11-LABX-0025-01, and DiNS project under contract ANR-19-CE25-0009-01).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix
A Machine Learning Metrics
where TN, TP, FN, and FP represent the numbers of true negative, true positive, false negative, and false positive, respectively. We refer to compromised domains as positives and to maliciously registered ones as negatives. Accuracy is the proportion of correctly predicted labels among all samples. We also make use of a Matthews Correlation Coefficient (MCC) as defined in Eq. 2 [35]. This metric was developed to evaluate the quality of a binary classification and its values vary between –1 and +1, where +1 means perfect prediction (the best score), 0 is equivalent to random results, and –1 shows that all samples were misclassified (the worst score). In contrast to accuracy, MCC provides a more realistic metric for imbalanced datasets such as ours.
B Scatter Plots of Probability Changes
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Bayer, J. et al. (2023). Operational Domain Name Classification: From Automatic Ground Truth Generation to Adaptation to Missing Values. In: Brunstrom, A., Flores, M., Fiore, M. (eds) Passive and Active Measurement. PAM 2023. Lecture Notes in Computer Science, vol 13882. Springer, Cham. https://doi.org/10.1007/978-3-031-28486-1_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-28486-1_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28485-4
Online ISBN: 978-3-031-28486-1
eBook Packages: Computer ScienceComputer Science (R0)