Operational Domain Name Classification: From Automatic Ground Truth Generation to Adaptation to Missing Values | SpringerLink
Skip to main content

Operational Domain Name Classification: From Automatic Ground Truth Generation to Adaptation to Missing Values

  • Conference paper
  • First Online:
Passive and Active Measurement (PAM 2023)

Abstract

With more than 350 million active domain names and at least 200,000 newly registered domains per day, it is technically and economically challenging for Internet intermediaries involved in domain registration and hosting to monitor them and accurately assess whether they are benign, likely registered with malicious intent, or have been compromised. This observation motivates the design and deployment of automated approaches to support investigators in preventing or effectively mitigating security threats. However, building a domain name classification system suitable for deployment in an operational environment requires meticulous design: from feature engineering and acquiring the underlying data to handling missing values resulting from, for example, data collection errors. The design flaws in some of the existing systems make them unsuitable for such usage despite their high theoretical accuracy. Even worse, they may lead to erroneous decisions, for example, by registrars, such as suspending a benign domain name that has been compromised at the website level, causing collateral damage to the legitimate registrant and website visitors.

In this paper, we propose novel approaches to designing domain name classifiers that overcome the shortcomings of some existing systems. We validate these approaches with a prototype based on the COMAR (COmpromised versus MAliciously Registered domains) system focusing on its careful design, automated and reliable ground truth generation, feature selection, and the analysis of the extent of missing values. First, our classifier takes advantage of automatically generated ground truth based on publicly available domain name registration data. We then generate a large number of machine-learning models, each dedicated to handling a set of missing features: if we need to classify a domain name with a given set of missing values, we use the model without the missing feature set, thus allowing classification based on all other features. We estimate the importance of features using scatter plots and analyze the extent of missing values due to measurement errors.

Finally, we apply the COMAR classifier to unlabeled phishing URLs and find, among other things, that 73% of corresponding domain names are maliciously registered. In comparison, only 27% are benign domains hosting malicious websites. The proposed system has been deployed at two ccTLD registry operators to support their anti-fraud practices.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 8579
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 10724
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.verisign.com/assets/domain-name-report-Q22022.pdf.

  2. 2.

    https://zonefiles.io.

  3. 3.

    https://github.com/korlabsio/urlshortener.

  4. 4.

    https://github.com/korlabsio/subdomain_providers.

  5. 5.

    https://www.spamhaus.org/statistics/tlds/.

  6. 6.

    https://support.alexa.com/hc/en-us/articles/4410503838999-We-retired-Alexa-com-on-May-1-2022.

  7. 7.

    https://developers.google.com/speed/public-dns/docs/isp.

References

  1. Alowaisheq, E., et al.: Cracking the wall of confinement: understanding and analyzing malicious domain take-downs. In: Proceedings of NDSS (2019)

    Google Scholar 

  2. Amazon: Alexa: SEO and Competitive Analysis Software (2022). https://www.alexa.com/

  3. Anti-Phishing Working Group: Global phishing survey: Trends and domain name use in 2016 (2016). https://docs.apwg.org/reports/APWG_Global_Phishing_Report_2015-2016.pdf

  4. Antonakakis, M., Perdisci, R., Dagon, D., Lee, W., Feamster, N.: Building a dynamic reputation system for DNS. In: Proceedings of USENIX Security, p. 18 (2010)

    Google Scholar 

  5. Bayer, J., et al.: Study on domain name system (DNS) abuse: technical report. arXiv preprint arXiv:2212.08879 (2022)

  6. Bilge, L., Kirda, E., Kruegel, C., Balduzzi, M.: EXPOSURE: finding malicious domains using passive DNS analysis. In: Proceedings of 18th NDSS (2011)

    Google Scholar 

  7. Bilge, L., Sen, S., Balzarotti, D., Kirda, E., Kruegel, C.: Exposure: a passive DNS analysis service to detect and report malicious domains. ACM Trans. Inf. Syst. Secur. 16(4) (2014)

    Google Scholar 

  8. Corona, I., et al.: DeltaPhish: detecting phishing webpages in compromised websites. arXiv:1707.00317 (2017)

  9. Daigle, L.: Whois protocol specification. Technical report, RFC Editor (2004)

    Google Scholar 

  10. De Silva, R., Nabeel, M., Elvitigala, C., Khalil, I., Yu, T., Keppitiyagama, C.: Compromised or attacker-owned: a large scale classification and study of hosting domains of malicious URLs. In: Proceedings of USENIX Security, pp. 3721–3738 (2021)

    Google Scholar 

  11. DNS Abuse Framework. https://dnsabuseframework.org/

  12. Donders, A.R.T., van der Heijden, G.J., Stijnen, T., Moons, K.G.: Review: a gentle introduction to imputation of missing values. J. Clin. Epidemiol. 59(10), 1087–1091 (2006)

    Article  Google Scholar 

  13. Emmanuel, T., Maupong, T., Mpoeleng, D., Semong, T., Mphago, B., Tabona, O.: A survey on missing data in machine learning. J. Big Data 8 (2021)

    Google Scholar 

  14. Farsight Security: Passive DNS Historical Internet Database: Farsight DNSDB (2022). https://www.farsightsecurity.com/solutions/dnsdb/

  15. Felegyhazi, M., Kreibich, C., Paxson, V.: On the potential of proactive domain blacklisting. In: Proceedings of 3rd USENIX LEET (2010)

    Google Scholar 

  16. Frosch, T., Kührer, M., Holz, T.: Predentifier: detecting botnet C &C domains from passive DNS data. In: Zeilinger, M., Schoo, P., Hermann, E. (eds.) Advances in IT Early Warning, pp. 78–90. AISEC (2013)

    Google Scholar 

  17. Google: Certificate Transparency. https://certificate.transparency.dev/

  18. Google Safe Browsing. https://safebrowsing.google.com/

  19. Halvorson, T., Der, M.F., Foster, I., Savage, S., Saul, L.K., Voelker, G.M.: From. academy to.zone: an analysis of the new TLD land rush. In: Proceedings of IMC, pp. 381–394 (2015)

    Google Scholar 

  20. Hao, S., Kantchelian, A., Miller, B., Paxson, V., Feamster, N.: PREDATOR: proactive recognition and elimination of domain abuse at time-of-registration. In: Proceedings of ACM SIGSAC, pp. 1568–1579 (2016)

    Google Scholar 

  21. Hollenbeck, S.: Extensible Provisioning Protocol (EPP) Domain Name Mapping. RFC 3731, RFC Editor (2004)

    Google Scholar 

  22. ICANN: EPP Status Codes | What Do They Mean, and Why Should I Know? https://www.icann.org/resources/pages/epp-status-codes-2014-06-16-en

  23. Internet Archive: Wayback Machine. https://archive.org/web/

  24. Kheir, N., Tran, F., Caron, P., Deschamps, N.: Mentor: positive DNS reputation to skim-off benign domains in botnet C &C blacklists. In: Cuppens-Boulahia, N., Cuppens, F., Jajodia, S., Abou El Kalam, A., Sans, T. (eds.) SEC 2014. IAICT, vol. 428, pp. 1–14. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-55415-5_1

    Chapter  Google Scholar 

  25. Kintis, P., et al.: Hiding in plain sight. In: Proceedings of ACM SIGSAC (2017)

    Google Scholar 

  26. Kohavi, R.: A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In: Proceedings of 14th IJCAI, vol. 2, pp. 1137–1143 (1995)

    Google Scholar 

  27. Korczyński, M., Tajalizadehkhoob, S., Noroozian, A., Wullink, M., Hesselman, C., van Eeten, M.: Reputation metrics design to improve intermediary incentives for security of TLDs. In: Proceedings of IEEE Euro SP (2017)

    Google Scholar 

  28. Korczyński, M., et al.: Cybercrime after the sunrise: a statistical analysis of DNS abuse in new gTLDs. In: Proceedings of ACM ASIACCS (2018)

    Google Scholar 

  29. Le Page, S., Jourdan, G.-V., Bochmann, G.V., Onut, I.-V., Flood, J.: Domain classifier: compromised machines versus malicious registrations. In: Bakaev, M., Frasincar, F., Ko, I.-Y. (eds.) ICWE 2019. LNCS, vol. 11496, pp. 265–279. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-19274-7_20

    Chapter  Google Scholar 

  30. Le Pochat, V., Van Goethem, T., Tajalizadehkhoob, S., Korczyński, M., Joosen, W.: Tranco: a research-oriented top sites ranking hardened against manipulation. In: Proceedings of NDSS. Internet Society (2019)

    Google Scholar 

  31. Le Pochat, V., et al.: A practical approach for taking down avalanche botnets under real-world constraints. In: Proceedings of 27th NDSS (2020)

    Google Scholar 

  32. Liu, S., Foster, I., Savage, S., Voelker, G.M., Saul, L.K.: Who is.Com? learning to parse WHOIS records. In: Proceedings of IMC, pp. 369–380 (2015)

    Google Scholar 

  33. Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: Proceeding of 15th ACM SIGKDD ICKDDM, pp. 1245–1254. KDD (2009)

    Google Scholar 

  34. Maroofi, S., Korczyński, M., Hesselman, C., Ampeau, B., Duda, A.: COMAR: classification of compromised versus maliciously registered domains. In: Proceedings of IEEE EuroS &P, pp. 607–623 (2020)

    Google Scholar 

  35. Matthews, B.: Comparison of the predicted and observed secondary structure of T4 Phage Lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Struct. 405(2), 442–451 (1975)

    Google Scholar 

  36. Moura, G.C.M., Müller, M., Davids, M., Wullink, M., Hesselman, C.: Domain names abuse and TLDs: from monetization towards mitigation. In: Proceedings of IFIP/IEEE, pp. 1077–1082 (2017)

    Google Scholar 

  37. Namecheap. https://www.namecheap.com/

  38. Newton, A., Hollenbeck, S.: Registration data access protocol (RDAP) query format. Technical report, RFC Editor (2015)

    Google Scholar 

  39. OpenPhish. https://openphish.com/

  40. PhishLabs: Abuse of HTTPS on Nearly Three-Fourths of all Phishing Sites (2020). https://www.phishlabs.com/blog/abuse-of-https-on-nearly-three-fourths-of-all-phishing-sites/

  41. PhisLabs: https://www.phishlabs.com/

  42. Sectigo Limited: Sectigo®Official - SSL Certificate Authority & PKI Solutions. https://sectigo.com/

  43. SiteAdvisor, M.: https://www.siteadvisor.com/

  44. Spamhaus. https://www.spamhaus.org/

  45. Spooren, J., Vissers, T., Janssen, P., Joosen, W., Desmet, L.: Premadoma: an operational solution for DNS registries to prevent malicious domain registrations. In: 35th ACSAC, pp. 557–567 (2019)

    Google Scholar 

  46. SURBL. https://surbl.org/

  47. Tajalizadehkhoob, S., Böhme, R., Gañán, C., Korczyński, M., Eeten, M.V.: Rotten apples or bad harvest? what we are measuring when we are measuring abuse. ACM Trans. Internet Technol. 18(4) (2018)

    Google Scholar 

  48. Tajalizadehkhoob, S., et al.: Herding vulnerable cats: a statistical approach to disentangle joint responsibility for web security in shared hosting. In: Proceedings of ACM SIGSAC, pp. 553–567 (2017)

    Google Scholar 

  49. Ulevitch, D.: PhishTank Join the fight Against Phishing (2006). https://phishtank.org/

  50. URIBL. https://www.uribl.com/

  51. Wang, Y.M., Beck, D., Wang, J., Verbowski, C., Daniels, B.: Strider typo-patrol: discovery and analysis of systematic typo-squatting. In: Proceedings of USENIX Association, vol. 2, p. 5 (2006)

    Google Scholar 

  52. Zhang, P., et al.: CrawlPhish: large-scale analysis of client-side cloaking techniques in phishing. In: Proceedings of IEEE S &P, pp. 1109–1124 (2021)

    Google Scholar 

Download references

Acknowledgments

We thank Benoît Ampeau, Marc van der Wal (AFNIC) and the anonymous reviewers for their valuable feedback, Anti-Phishing Working Group, OpenPhish, and PhishTank for providing access to their URL blacklists. This work has been carried out in the framework of the COMAR project funded by SIDN, the .NL Registry and AFNIC, the .FR Registry. It was partially supported by the Grenoble Alpes Cybersecurity Institute (under contract ANR-15-IDEX-02), and the French Ministry of Research (PERSYVAL-Lab project under contract ANR-11-LABX-0025-01, and DiNS project under contract ANR-19-CE25-0009-01).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jan Bayer .

Editor information

Editors and Affiliations

Appendices

Appendix

A Machine Learning Metrics

$$ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} $$
$$ FPR = \frac{FP}{FP+TN} \qquad FNR = \frac{FN}{FN+TP} $$
$$\begin{aligned} MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}, \end{aligned}$$
(2)

where TN, TP, FN, and FP represent the numbers of true negative, true positive, false negative, and false positive, respectively. We refer to compromised domains as positives and to maliciously registered ones as negatives. Accuracy is the proportion of correctly predicted labels among all samples. We also make use of a Matthews Correlation Coefficient (MCC) as defined in Eq. 2 [35]. This metric was developed to evaluate the quality of a binary classification and its values vary between –1 and +1, where +1 means perfect prediction (the best score), 0 is equivalent to random results, and –1 shows that all samples were misclassified (the worst score). In contrast to accuracy, MCC provides a more realistic metric for imbalanced datasets such as ours.

Fig. 10.
figure 10

Scatter plot of probability changes between the full model and the model without features related to WHOIS data (FS2).

B Scatter Plots of Probability Changes

Fig. 11.
figure 11

Scatter plot of probability changes between the full model and the model without features related to hyperlinks (FS9).

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bayer, J. et al. (2023). Operational Domain Name Classification: From Automatic Ground Truth Generation to Adaptation to Missing Values. In: Brunstrom, A., Flores, M., Fiore, M. (eds) Passive and Active Measurement. PAM 2023. Lecture Notes in Computer Science, vol 13882. Springer, Cham. https://doi.org/10.1007/978-3-031-28486-1_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-28486-1_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-28485-4

  • Online ISBN: 978-3-031-28486-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics