Scott Helme

Stronger Than Ever: How We Turned a DDoS Attack Into a Lesson in Resilience

Scott Helme — Mon, 03 Feb 2025 14:06:34 GMT

Operating an online service like Report URI, it comes with the territory. The ever present threat of attack is something we are fully aware of, and prepare for as best we can. Being the regular subject of attacks, mostly handled by our robust systems and automated defences, these attacks mostly go unnoticed, but not the most recent one!

Transparency is at our core

I've been open about how things work at Report URI since the beginning, almost a decade ago. How the service has grown and scaled, technical evolutions of our infrastructure, and even the bad bits like incidents and failures, warts and all. This time around is no different and while, as you'll see soon, this wasn't a particularly notable incident with any real impact, I still want to maintain that transparency that we've demonstrated since the beginning.

For those with concerns, let me address those before I dig into the juicy technical details and the changes we've made as a result of the attack. Here's the TLDR;

We were subjected to several attempted DDoS attacks, and the first cohort didn't even raise an alarm, but on the 23rd Jan, we noticed the first impact. For a period of time, our report ingestion servers were heavily loaded and we began to down-sample inbound telemetry. We will often do this for individual customers if they have deployed a misconfiguration and are sending excessive volumes of telemetry, for example, but it is not a regular occurrence at the service level. During this period of time, customers still received telemetry, but at a reduced rate due to the downsampling. This attack subsided without any additional impact.

On the 25th Jan the attack ramped up again, but this time, they were targeting our web application and not our report ingestion servers. I've covered our infrastructure in various posts before, most recently this one, but our web app and our ingestion servers run on completely isolated infrastructure. As the attack ramped up, we were already aware and mitigating, but there was a period of ~11 minutes where unauthenticated users were not able to visit our site. If you were logged in and authenticated during this time, you were able to use our site as normal.

That's the TLDR of the attacks, and there was no other impact, so it was pretty benign overall, but also quite impressive that they were able to bring enough force to bear to have the impact they did!

The ransom demand!

Another interesting twist in this tale, and perhaps the most valuable lesson that many can take away from this, is that we also received a ransom demand during the 25th Jan attack that was clearly nonsense. My guess is that the DDoS attack was a ploy to cause panic and confusion with the goal of rushing us into paying the ransom, a common tactic that the industry sees on a regular basis. Do not fall for these tactics! As the email was sent to my personal email address, I'm sharing it here in my personal capacity for the benefit of everyone. Maybe you've seen the name before, or some of these contact details, or maybe you're just curious what a ransom email might look like!

For those familiar with our infrastructure, which poor Jacob clearly isn't, we don't even use a SQL database, so this idea that someone had found a SQL Injection vulnerability is quite entertaining! Despite that, there was a screenshot and a text file containing what was alleged to be the credentials of some of our users. Despite having no faith that this data was real, it's very easy for us to verify this, so that's exactly what we did. None of the email addresses presented in the data were registered users of our service. I was still curious though, where did this data come from?

There's only one person you turn to when you have questions about stolen credential data, and that is, of course, Troy Hunt from Have I Been Pwned!

I was able to quickly reach out to Stefán Jökull Sigurðarson, a good friend, fellow conference speaker and security nerd, along with being, Employee 1.0 at HIBP. I provided the data for him to verify and of course, it came back to a common, recent breach that was completely unrelated to Report URI. This was stolen data that Jacob was trying to pass off as having come from us. His claims were false.

Collaborating with law enforcement

Despite there being no meaningful impact on our service, and having conclusively proven the claims of a data breach to be false, I still felt there would be value in passing details along to law enforcement.

The relevant authority for us to contact in the UK is the National Crime Agency, and having good contacts in this space, I was able to hand over all data that I thought might be worthwhile very quickly. Maybe the only outcome is to know that Jacob has made these threats elsewhere, and that they can be treated with an appropriate level of scepticism, or maybe nothing will come of it at all, but I felt it important to do what we can nonetheless.

With those aspects of this story now nicely covered off, it's time to dig into the technical details and the remediation steps we've already taken, and those that we plan to take!

Sessions for everyone! Or maybe not...

Like most everyone, we use cookies at Report URI to maintain authenticated state in the otherwise stateless protocol that is HTTP. When you visit our site, you can see the __Host-report_uri_sess cookie is set, and this is the cookie we use to tie to the session identifier in our session store.

Our session store is run on a dedicated Redis instance hosted on our private network, and it's quite a sizeable server given that it's only job is to host our session store. Still, though, this is one area of our infrastructure where we don't mind slightly overprovisioning, making it all the more surprising that this is the part of our infrastructure that hit issues first.

As an in-memory data store, Redis is very capable of high performance operation, which is why we use it for our session store, we want the fastest possible session store we can have. Being an in-memory data store, though, Redis isn't typically geared towards persistence of data, but it does offer some persistence options. The Redis cache that handles our inbound report data, which now exceeds 1,000,000,000 telemetry items per day, has no persistence configured. This cache is designed to buffer the insane amount of inbound telemetry we receive while our consumer servers feed from the cache and process the telemetry into persistent storage in Azure Table Storage. Our session cache though, that does have persistence configured.

We use the Redis Database persistence option detailed in the Redis Persistence docs. When configured, Redis will dump a copy of the in-memory store to disk as a snapshot on a frequency that you can control. Here's our Redis config for snapshotting:

################################ SNAPSHOTTING  ################################
#
# Save the DB on disk:
#
#   save  
#
#   Will save the DB if both the given number of seconds and the given
#   number of write operations against the DB occurred.
#
#   In the example below the behaviour will be to save:
#   after 900 sec (15 min) if at least 1 key changed
#   after 300 sec (5 min) if at least 10 keys changed
#   after 60 sec if at least 1000 keys changed
#
#   Note: you can disable saving completely by commenting out all "save" lines.
#
#   It is also possible to remove all the previously configured save
#   points by adding a save directive with a single empty string argument
#   like in the following example:
#
#   save ""
#save 900 1
#save 300 10
save 60 1000

We have our Redis session cache configured to take a snapshot and save it to disk every 60 seconds if at least 1,000 keys have changed in the cache. This is a fairly reasonable config that gives us some persistence if there was ever an issue with the in-memory store, and we could recover with minimal loss. Given the hardware resources of the server, I was surprised to see the following errors starting to be reported.

14618:M 25 Jan 09:26:58.034 * 1000 changes in 60 seconds. Saving...
14618:M 25 Jan 09:26:58.034 # Can't save in background: fork: Cannot allocate memory

It's odd because first, the server has an enormous amount of RAM that apparently we were low on, and second, I hadn't received any notifications that the server was low on resources, which is something I'd receive a notification about immediately. Something didn't add up, and I was quickly able to verify that the server was only using a little over 50% of the available RAM, so we still had a huge amount left.

My first thought was how does Redis generate the snapshot and save it to disk. If Redis was trying to create a duplicate of the data in memory, to then dump it to disk, that could explain it, but it didn't make sense as we'd see regular swings in the consumption of RAM on the server, something that the monitoring did not show.

Reading the docs a little more I was able to confirm that Redis forks and while the parent process continues to serve clients, the child process dumps the data to disk and then exits. To avoid having to make a duplicate of the data in memory for the child to dump to disk, Redis uses the 'copy-on-write' semantic of the fork() system call. This is a really awesome way of avoiding the inconvenience and resource consumption of duplicating the data just to dump it to disk and then erase it. In short, all of the data in memory is marked as read-only, the kernel will intercept writes to the data and create a new memory location to write the incoming data, leaving the original data intact for the child process to read and dump to disk. There's a nice summary here in the man page for fork(), but this left me wondering why are we struggling with memory allocation given that fork() should have minimal impact on the memory resources of the server... It turns out, there's an FAQ for that!

The kernel is responsible for keeping track of memory resources on the system and it has three 'accounting modes' it can use to keep track of that virtual memory and manage it.

       /proc/sys/vm/overcommit_memory
              This file contains the kernel virtual memory accounting
              mode.  Values are:

                     0: heuristic overcommit (this is the default)
                     1: always overcommit, never check
                     2: always check, never overcommit

The default mode here, mode 0, is what is causing our issue with the Redis process fork(). Despite using the copy-on-write semantic, it is theoretically possible that the Redis child process created during the fork will need to create a full duplicate of the data in memory, and the systems does not have enough memory to accommodate that once >50% of the RAM has been consumed, which is why the fork() call fails, causing the failure of the snapshot. Now, that's no big deal, right? We can't dump a copy of the data to disk, but the cache should still be functional. There's just one default config option we need to look at in Redis:

# By default Redis will stop accepting writes if RDB snapshots are enabled
# (at least one save point) and the latest background save failed.
# This will make the user aware (in a hard way) that data is not persisting
# on disk properly, otherwise chances are that no one will notice and some
# disaster will happen.
#
# If the background saving process will start working again Redis will
# automatically allow writes again.
#
# However if you have setup your proper monitoring of the Redis server
# and persistence, you may want to disable this feature so that Redis will
# continue to work as usual even if there are problems with disk,
# permissions, and so forth.
stop-writes-on-bgsave-error yes

By default, if the Redis snapshot fails, Redis will disable writes to the cache to preserve the data as it is no longer being backed up, and this is the point at which new sessions for authenticated users on the website can't be created. All existing sessions and data are there, and can be read, but when new sessions are created, well, they can't be, because the write to the cache will fail. This is the crux of the issue and why for a period of ~11 minutes, visitors to the site may have experienced an error.

Improvement in session handling

Having read the previous section, it will probably be obvious what most of the remediation steps are going to be, so I will summarise them here to give an overview of what we've done, and what we're going to do.

The first step was to resolve the immediate issue with the session cache, and that was to ensure to cache would keep working even if the snapshot process fails. I disabled this config option on the live server.

config set stop-writes-on-bgsave-error no

The next step was to then persist this change in the config file so that the behaviour will remain even after Redis next restarts.

# By default Redis will stop accepting writes if RDB snapshots are enabled
# (at least one save point) and the latest background save failed.
# This will make the user aware (in a hard way) that data is not persisting
# on disk properly, otherwise chances are that no one will notice and some
# disaster will happen.
#
# If the background saving process will start working again Redis will
# automatically allow writes again.
#
# However if you have setup your proper monitoring of the Redis server
# and persistence, you may want to disable this feature so that Redis will
# continue to work as usual even if there are problems with disk,
# permissions, and so forth.
stop-writes-on-bgsave-error no

Finally, we needed to allow the Redis process to successfully fork() even when the server has <50% of RAM remaining by changing how the kernel was accounting for memory. Changing to mode 1 will allow the kernel to overcommit on memory usage during the fork() and we can rely on the copy-on-write to prevent us from exceeding available RAM.

root@redis-session-cache:~# cat /proc/sys/vm/overcommit_memory
0
root@redis-session-cache:~# echo 1 > /proc/sys/vm/overcommit_memory
root@redis-session-cache:~# cat /proc/sys/vm/overcommit_memory
1

The final thing that we're going to change, which is almost complete at the time of writing, requires a little development work and testing. When a visitor hits our site, we create a session for them in the session cache, which in hindsight isn't really necessary, nor even a good idea. We're changing our session handling so that once deployed, a session will only be created in the cache, and a cookie assigned to the user, after a successful authentication. This would mean that the millions upon millions of hits against our website wouldn't have had any impact on our session cache, because without a successful authentication, there would never be any communication with Redis at all. Here's a screen capture of me navigating around the test site with our existing functionality, and you can see the cache hits against Redis in the background.

With the new change deployed, visitors to the site won't have a session created and won't be issued a cookie until after a successful authentication, resulting in no activity against the session cache.

I'm hoping to have this change out in the next week or so, after it's had some extensive testing, and it will certainly help should we find ourselves in a similar situation again.

Tightening our DDoS protections

As regular readers will know, we make extensive use of Cloudflare's services at Report URI, from caching and CDN services, to bot and DDoS mitigation, Workers, and more. Whilst we have the DDoS mitigation services in place, no protection is ever 100% effective and, of course, a small portion of the traffic did make it through to our origin servers, causing the issues above. Here are our firewall events for the attack on 25th Jan, showing the quite obvious period of time that the attack took place:

The attack that happened on the 23rd Jan was much smaller in volume, but was made up of mostly POST requests with telemetry data sent to our ingestion endpoints, causing a different kind of load, but still an inconvenience.

Now that we have a sizeable sample of data to work with, I've been working my way through and outlining our plan for tightening these controls. As is often the case with security, this a balancing act of trying to block as much malicious traffic as we can, without adversely impacting legitimate users. This will be a much slower process of making small tweaks and observing their impact over time, but as we go, our rules will become more resilient and more effective at stopping the kind of attacks we're seeing. I'll probably write this process up as a separate blog post as this one is already getting quite large, and so far we've only taken the first steps in starting to tighten our WAF rules, so keep an eye out for that in the future.

Onwards and upwards

Neither of these attacks really had any meaningful impact, none of our data was ever at risk, and most people won't have even noticed anything happened. Despite that, I still wanted to talk about this issue openly, to show our commitment to transparency, but also, to talk about some interesting technical challenges! Maybe you and your organisation will face a similar issue in the future and you can be more aware of the ransom scam, maybe the lessons we learned here are something you can use to avoid similar issues of your own in the future, or maybe this blog post was just an interesting read for you. Either way, these things happen, they happen to everyone, and they're likely to keep on happening, so here's to hoping that we can all share and learn from these experiences.

Let's Encrypt to end OCSP support in 2025

Scott Helme — Mon, 30 Dec 2024 11:01:42 GMT

Well, the writing has been on the wall for some years now, arguably over a decade, but the time has finally come where the largest CA in the World is going to drop support for the Online Certificate Status Protocol.

What is OCSP?

The Online Certificate Status Protocol is a mechanism that was created to check if an SSL (TLS) certificate is revoked. It is quite simple in operation, requiring a client that has been presented with a certificate from a server, to call back to the OCSP Responder hosted by the issuing Certificate Authority, and ask to see if the certificate has been revoked since issuance.

A certificate can be revoked for many reasons, but the most concerning one is that the private key associated with the certificate was compromised. That means it is possible that the server currently presenting the certificate to you may not actually be the site that you think you're connecting to, it could be an attacker or other hostile entity impersonating that site, and you'd have no way to tell. Knowing that the certificate was revoked would prevent you falling victim to this attack, and allow you to discard the certificate and close the connection safely.

What are the problems with OCSP?

The problems with OCSP are plenty, and Let's Encrypt point out their main concerns in their announcement to end support for OCSP, but here is my overview:

Privacy

If you visit a website, let's say adult-entertainment.com, then your browser is going to receive their certificate and then send a request to the issuing CA to say "Hey, is the certificate for adult-entertainment.com revoked?"... You're leaking your browsing activity to the CA who could now track the sites you visit in real-time!

Performance

The OCSP check is a HTTP request/response roundtrip on the network, on top of the necessary DNS and TCP roundtrips too. On top of that, we have any time spent waiting for an answer from the OCSP Responder. This is slowing us down!

Availability

What do we do if the OCSP Responder is down? Well, it turns out, clients don't really care and they just skip the check and accept the certificate anyway, making the whole thing pointless. OCSP checking add no security!

Depending on who you speak to, there are also other valid concerns with OCSP checking, but these would be my main points. The industry has tried to respond to these over the years, looking for ways to fix the inherent issues, but none have managed to quite make it over the line.

Can OCSP be saved?

Over a decade ago (yes, really, 10+ years ago!), I wrote about OCSP Stapling, which helped to mitigate the performance and privacy costs of OCSP checking. OCSP Stapling is awesome, and everyone should support it, even now, because it does help to protect the privacy and performance of visitors to your site, but sadly, it doesn't add any tangible benefit, it only removes the negatives.

A few years after that, around 2016-2017, the ideas of OCSP Expect-Staple and OCSP Must-Staple were floated. These were potentially the solution to the problem of soft-fail within OCSP, where clients don't care if the OCSP check fails, and they just continue on regardless. Sadly, staking the availability of your site against the availability of an OCSP Responder that you don't control didn't prove to be very popular.

Later in 2017, I published Revocation is broken, outlining all of the above, and more, in great detail, and it's well worth a read if you want to get to the bottom of understanding the issues.

Despite the evidence, despite another post in 2020 Demonstrating that revocation checking is pointless, despite OCSP causing major issues for Apple and MacOS users, OCSP lived on. But no more!

Let's Encrypt Ending OCSP Support in 2025

I know that Let's Encrypt are only one CA among many, but they are the largest out there in terms of their volume of issued certificates, and by quite a considerable margin. That's why them announcing this change is going to have such an impact, and I think it's going to surface a few issues in the wider ecosystem.

Let's Encrypt first announced their Intent to End OCSP Service in July 2024, and just recently, in December 2024, they announced that they will be Ending OCSP Support in 2025. Here's the current schedule:

January 30, 2025
OCSP Must-Staple requests will fail, unless the requesting account has previously issued a certificate containing the OCSP Must Staple extension

May 7, 2025
Prior to this date we will have added CRL URLs to certificates
On this date we will drop OCSP URLs from certificates
On this date all requests including the OCSP Must Staple extension will fail

August 6, 2025
On this date we will turn off our OCSP responders

source: https://letsencrypt.org/2024/12/05/ending-ocsp/

In an ideal world, this would cause absolutely no problems whatsoever, but we don't live in an ideal world. For over a decade now, certificates have somewhat reliably always had an OCSP URL in them for clients to check against, if they wish, and the removal of that may break some expectations. Looking at my own internal traffic at home, I can use my DNS server running Pi-Hole to check just how popular OCSP traffic is, and it turns out, it's pretty darn popular with OCSP responder domains being some of the most frequently looked up, and blocked. (Side quest: I block all OCSP DNS lookups in my home network)

It's one thing to not be able to communicate with the OCSP Responder and then soft-fail through the error, but what if you're expecting an OCSP URL to be present in the certificate, and it isn't?

That's a lot of traffic!

Can you imagine, as the CA issuing certificates, that every TLS connection established where one of your certificates was served, could potentially result in an OCSP request being sent to you? Imagine being the CA that issued the certificate for google.com, how many OCSP requests do you think you'd receive in a single day?!

I was curious, and it seems that CAs are very secretive about their numbers and I couldn't find anything reliable at all, so I decided to reach out to Josh Aas, Executive Director of Let's Encrypt, and ask! He kindly provided some information and, the numbers are staggering!

83 billion CDN hits per week.
11 billion origin hits per week. 

or

~137,000 CDN hits per second.
~18,000 origin hits per second.

Given that OCSP checking is a simple HTTP request/response cycle, it is very suitable to caching, and that's how Let's Encrypt are achieving a ~87% CDN offload. The problem is that OCSP responses can't be valid for a long period of time, as that would defeat their purpose, so the cache will need to be replenished regularly. Either way, handling ~18,000 requests per-second that require database activity is still a pretty sizeable cost for a non-profit organisation like Let's Encrypt that runs entirely on sponsorships and donations. I'm sure they will be happy to see it go and spend that valuable money and effort elsewhere.

The future of revocation checking

Whilst I think that, ultimately, the solution for revocation checking is certificates that are so short that we don't need revocation checking, we need something else in the interim. We're a little way off having <= 7 day validity periods on certificates, so what's going to stem the gap between now and then?

CRLite

I've spoken about CRLite before, and it's a really awesome solution to the problem of revocation checking. You can read more in CRLite: Finally a fix for broken revocation?, but the TLDR is that it had to be a new mechanism that solved all of the existing problems:

⚖️ The size issue of CRLs.
🎭 The privacy issues of OCSP.
📈 The performance costs of online checks.
❌ The single point of failure.
👎 The soft fail approach to revocation checking.

Using a probabilistic data structure called a Bloom Filter, with full details in the linked blog post, it looks like we finally have a solution for the problem of revocation. Mostly...

This is a concern for another blog post, but my gut feeling is that CRLite is really only viable for desktop platforms and powerful mobile devices, and probably unrealistic to consider for most other clients. Whilst that's a huge step forwards from where we are now, which is basically no revocation checking at all, it's still not an ideal scenario that will cater for all clients. That said, I do believe dropping OCSP is the right thing to do and that our efforts as an industry are better directed elsewhere. One thing I will be keeping an eye out for is any unexpected side-effects of removing those OCSP endpoints from certificates!

XSS Ranked #1 Top Threat of 2024 by MITRE and CISA

Scott Helme — Tue, 10 Dec 2024 14:35:54 GMT

As we draw near the end of 2024, MITRE have taken a look back at the security vulnerabilities discovered throughout the year and published their list of the Top 25 Most Dangerous Software Weaknesses, and Report URI is here to help you with the #1 Top Threat: XSS.

Common Weakness Enumeration

The CWE Program is a standardised way of referring to types of security vulnerabilities with a unique ID, allowing a common classification to be used for a particular type of vulnerability across industry. This makes it really easy to ensure that we're referring to the same type of issue, no matter what the source of information is.

The project is maintained and operated by MITRE, and is sponsored by the U.S. Department of Homeland Security (DHS) and the Cybersecurity and Infrastructure Security Agency (CISA), notable company to keep!

The particular CWE finding itself in the #1 spot of Most Dangerous Software Weakness of 2024 is, XSS!

CWE-79: Improper Neutralization of Input During Web Page Generation ('Cross-site Scripting')

I guess that's impressive or concerning, depending on how you look at this, but XSS has managed to get itself promoted from the #2 spot in the 2023 ranking!

Common Vulnerabilities and Exposures

In order to see just how common these XSS vulnerabilities are, I decided to take a look at the data from the CVE Program. The CVE Program is another DHS and CISA backed project that tracks publicly disclosed cybersecurity vulnerabilities in the wild. They assign a unique ID to each vulnerability so they can be tracked, and vulnerabilities can map back to a CWE ID so we know what caused the particular vulnerability. Think of it like this:

Alpha Bank is breached (CVE-1234) with an XSS vulnerability (CWE-79).
Zulu Bank is breached (CVE-5678) with an XSS vulnerability (CWE-79).

I grabbed the latest copy of the CVE JSON data and did some quick parsing to come up with some numbers for the CWE Top 25. In the below table, you can see each type of vulnerability by CWE ID, and then how many occurrences of that type of vulnerability have been discovered in the wild. Let's just say, it's pretty clear why XSS (CWE-79) is the #1 spot, and it's not even close!

CWE ID	Vulnerabilities Caused
CWE-79	4,632
CWE-89	2,070
CWE-352	858
CWE-862	826
CWE-22	643
CWE-125	610
CWE-200	570
CWE-434	524
CWE-20	521
CWE-78	505
CWE-416	474
CWE-94	447
CWE-787	415
CWE-77	358
CWE-269	292
CWE-502	287
CWE-400	272
CWE-918	254
CWE-863	248
CWE-476	201
CWE-287	172
CWE-119	145
CWE-306	128
CWE-190	127
CWE-798	101

I think it's fair to say that we have some work to do on this, and for almost a decade now, Report URI has been working on precisely this problem, so we're ready to help.

Detecting and Mitigating XSS

Regular readers will know that I've been a fan of Content Security Policy for a very long time, and any time you see someone talking about CSP, it's almost always going to be in the context of protecting against XSS. That's because CSP is a very flexible and powerful mechanism, but XSS is the most dangerous thing it can protect you against, so that's often the focus of conversation.

CSP is, quite literally, the ultimate defence mechanism for XSS for many reasons, and Report URI makes it easy to get started with your deployment. Think of CSP as 'The Final Boss' for XSS, and we're here to make sure it's not defeated.

I'm not going to go into the technical details of CSP here as I have, quite literally, been talking about CSP for more than a decade on this very blog! What I am going to do is say that we're ready to help, and all you have to do is reach out.

Over the years, we've launched countless new features at Report URI to make our tools more effective and easier to use, we've fully aligned with the PCI DSS v4.0 requirements (6.4.3 and 11.6.1), we regularly process over 1,000,000,000 pieces of telemetry for our customers every single day, and we want to make a difference.

If you'd like more information, we have a dedicated XSS Solution page, or PCI DSS / Magecart solutions pages if that's more your bailiwick, or you can just reach out to us to get the conversation started.

Let's make sure that XSS isn't #1 in 2025!

Report URI Penetration Test 2024

Scott Helme — Wed, 04 Dec 2024 12:04:36 GMT

It's that time of year again! At Report URI, we've just been through our 5th penetration test, and as usual, we're going to publish the results, take a look at what was found, and what we're going to do about it.

Penetration Tests

We're racking up quite the tally of penetration tests now, with 5 years of testing history on show! You can see our tests from the previous years here:

Report URI Penetration Test 2020

Report URI Penetration Test 2021

Report URI Penetration Test 2022

Report URI Penetration Test 2023

Along with those, you're of course looking at our 2024 test and, if I do say so myself, we've had another good year and the results are awesome. Despite continually building and developing the product, introducing countless new features and new code, our efforts in security are paying off when it comes to being scrutinised.

The Results

Here's the summary of our 2024 findings, with only two low-rated issues showing up!

Neither of these issues of anything for us to worry about, of course reflected in the Low Severity rating, but they are worth talking about.

Vulnerabilities in Outdated Dependencies Detected

This one I knew was coming and is something we were already actively working towards solving. The problem is an XSS vulnerability in the version of Bootstrap we're using, which was assigned CVE-2024-6484 and Medium Severity. Taking a look at the details, and the provided PoC, it seems that the issue is if you let an attacker inject arbitrary content into HTML attributes, including the href, they can execute arbitrary JS. Here's the PoC code...

Now, Bootstrap or not, I'm pretty sure if you let an attacker inject a javascript: URI into the href attribute of an tag, then you're going to have XSS one way or another, no? Maybe I'm missing something in the nuance of this, and if there's somebody out there who can explain this to me, please drop by the comments below, or email me! There were also two other issues found that were very similar in nature, CVE-2024-6485 and CVE-2024-6531, and I'd love some feedback on these similar issues.

Either way, we're not really concerned about these issues because we don't use the impacted components of Bootstrap, we don't take user content and reflect it in to page in this way, and, of course, we have a Strict Content Security Policy in place. These reasons are ultimately why this issue was rated as a Low severity finding in our report, but we're still going to fix it.

By the end of the week, we should have our patched version of Bootstrap deployed, which is the road we've decided to take as a short-term measure due to the work involved in moving to v4 or v5 at this time. It solves the problem, completely removes the risk, and comes at a relatively low cost for us, so it's a nice solution.

No Anti-Automation Protection

This one is a non-issue for me because, as we always do, we disabled all of our Cloudflare protections for the origin IP addresses of the penetration testing company during the test. This allows them unfettered access to our application and means they're actually testing our application, rather than the effectiveness of whatever WAF solution we might be using.

Part of that protection that was disabled was things like our WAF Rules, Bot Management, Rate Limiting and everything else we have configured with Cloudflare, so, there's no action required here from us.

Another Strong Year

As we approach the end of the year, our penetration test is a way to validate all of the hard work we've put in throughout the year, and goes to show that the processes and measure we have in place continue to be effective.

Report URI: Simplifying pricing and changes to free accounts

Scott Helme — Mon, 28 Oct 2024 15:26:53 GMT

We've been making great progress on developing new features at Report URI recently, and over the coming months, you're going to see many of them launched! As we've expanded the team to achieve this, and as we want to continue to grow, we're going to be making some changes to support our ongoing activities.

The Paradox of Choice

One of the first things that many will notice is that there are now fewer subscription options to choose from. Including the free-tier plan, there were a total of 9 subscription levels that you could choose between when signing up to Report URI.

Whilst this was originally designed to be helpful by providing more options to our customers, it often left us with difficulty explaining what the most appropriate subscription level was and guiding customers on which one to choose. We've now reduced this down to a much more simple selection of 4 different subscription levels and removed the sliding scale!

For any existing customers already subscribed, we will grandfather you in on your current subscription, and we will continue to honour that for as long as you wish to remain subscribed, so there will be no changes for you to contend with. For anyone joining us going forwards, we hope that this simplified selection will make the choice easier for you to get started and figure out which is the best plan for you.

Changes to free accounts

This is something that I've resisted for a really long time, but we're now announcing changes to the free plans on Report URI. As it currently stands, only 1.98% of users on Report URI are paying customers, and a whopping 98.02% of our users are on a free plan!! This is something that we can't sustain going forwards, and given relentless increases in costs, and that we now process almost 700,000,000 pieces of telemetry per day(!), we need to change. I'd like to handle this in the best way that I can, and give ourselves room to grow and bring on key staff to continue to improve our offering.

First of all, all free accounts will continue to function as they do now until 1st Feb 2025, giving 90+ days of notice about the upcoming change. Second, we don't want to push all of our free users on to the new pricing tier, so we've created a special offer for existing free users to exceed the functionality of your current free plan for only $9.99/mo going forwards, with access to what was previously our lowest paid plan!

This will give those users time to make the decision to upgrade before Feb 2025, and then be grandfathered in to take this plan forwards with them for as long as they'd like!

☕☕

I'm hoping that users on a free account can see that I'm trying to strike the best balance possible between continuing to support them as much as we can, and the ongoing needs of Report URI as we continue to grow and deliver new features. I'm also hoping that the value that we provide as a service is worth a couple of cups of coffee per month, which is all we're asking for in exchange for using our platform. To see if you've qualified, head to the Billing page in your account and look for the Special Offer notice. If you were a registered user on a free account before today, you will see this banner with a link for the offer:

Going Forwards

As I said earlier, for all currently subscribed users, this change doesn't change anything, and your account and subscription will continue to work as they do now for as long as you wish. For our free users, I hope you can understand the choice we've made here and that the value we provide is enough for you to continue using our service for a very small fee.

The 30-day free trial will remain and, as before, it comes with no commitment whatsoever. You can try out the service for the full 30 days and cancel with a single click and only a moments notice, without ever having to pay a thing. This should allow anyone to get a full grasp of the service and what we offer, even though the free plan will no longer be available.

I'm sure I'll see you all on the other side, and if you have any questions, you can email me directly using scott@ and I'd be happy to help!

Are shorter certificates finally coming?!

Scott Helme — Tue, 15 Oct 2024 10:03:55 GMT

Regular readers will know my views on the validity period of TLS certificates, and how they definitely need to be made shorter than they currently are! We made some good progress on reducing their lifetime over the last few years, but recently, that progress seems to have stalled out... Well, now, we might have our first glimmer of hope!

The story so far

There's a lot of really good information out there on this topic, and I myself have covered this in various ways over the years. I'd definitely recommend my post on Cryptographic Agility as a good starting point, but I will pull the most pertinent information out into this post here. Looking at the historic reductions to certificate lifetimes, we can summarise them as the following maximum limits and when they were introduced:

2012 - 60 months
2015 - 39 months
2018 - 825 days
2020 - 398 days

That means that right now, the longest a certificate can be valid for is 398 days total. If you look at the cadence for change there, you can also see how we fell of the pace a little as an industry, and I've long been waiting for the announcement of the next change. Well, we don't have to wait any longer!

SC-081: Introduce Schedule of Reducing Validity and Data Reuse Periods

You can go and read the full details of the proposed ballot yourself, but let's just dive straight into the good stuff because I'm too excited to delay any further. (Also, please drop by that page and give a 'thumbs up' response to show some support!)

Here's the data from the ballot that will show how, over the next 3 years, it is proposed that we continue our efforts to reduce certificate lifetimes and improve the security of the ecosystem. I've put it in a table to make it easier to digest, and here it is!

Update 14th Nov 2024: the below table and text has been updated to reflect changes to the proposal. The first reduction to 200 days is now proposed to land in March 2026, 6 months later than originally planned, along with the second reduction to 100 days also being postponed by 6 months. The final reduction has changed from being 45 days maximum validity to 47 days maximum validity, and was delayed even further by 11 months until March 2028.

~~Certificate issued on or after~~	~~Certificate issued before~~	~~Maximum number of days for validity~~
~~September 1, 2020~~	~~September 15, 2025~~	~~398~~
~~September 15, 2025~~	~~September 15, 2026~~	~~200~~
~~September 15, 2026~~	~~April 15, 2027~~	~~100~~
~~April 15, 2027~~		45

Certificate issued on or after	Certificate issued before	Maximum number of days for validity
	March 15, 2026	398
March 15, 2026	March 15, 2027	200
March 15, 2027	March 15, 2028	100
March 15, 2028		47

First of all, this does mean that nothing will happen for almost 18 months... A shame, yes, but, I can also understand why. It's nice to give the industry time to plan and prepare for a change, and, the first change is also a smaller change too.

It's from March 2026 that change starts happening, with the new limit of 200 days on certificates. That still leaves certificates valid for a really long time, so is a gentle introduction into the reduction schedule.

This is followed by a 100 day limit on certificates a whole year later in March 2027, so again, another huge period of time to plan and prepare for the next change.

Finally, we arrive in the distant future of March 2028, when the 47 day lifetime limit will be introduced! At this point, fully automated certificate renewal is obviously the goal, and the path to get there has now been laid out.

What does this mean?

It means that we're finally making the progress we need to make as an industry. It means that someone has finally been willing to be the person that steps forward and proposes the changes we need, saying the things that need to be said, but that many won't like.

For so many years now, in both of the training courses that I deliver, I've been telling people and organisations that certificates are only going to get shorter, because they simply need to!

If you haven't been preparing for automation in your certificate renewal processes already, you've missed the writing that's been on the wall for many years now. Certificates have only ever gotten shorter as time has gone by, and key industry players have been pushing for shorter certificates for a long time. If you haven't started working on automating your certificate renewal processes, then you should. The best time to start that work was yesterday, the next-best time is today, and the worst time is tomorrow.

Will this ballot pass?

Thus far, the reaction to this proposal has been exactly what I expected it would be. Those that are looking forwards and to the improvement of security are supportive, and those who are not prepared or willing to change are resisting. I'm hoping that the gradual introduction of these changes, with a nice, steady and predictable timeline will help to sway some to do the right thing, but we're just going to have to wait and see how the vote goes. If we graph the proposed changes, and look at them alongside historic changes, we can see that this is a very reasonable plan being proposed.

We should also consider what has happened in the past when this industry desperately wants change for the better, but commercial entities have tried to hold us back, being swayed by their profits rather than what's best for the average Internet user. I have my fingers crossed that we can get this done, but only time will tell... 🔒🌍✅

iOS 18 Quick Tips; Security Edition

Scott Helme — Wed, 02 Oct 2024 11:17:37 GMT

Having recently updated to iOS 18, there are a couple of features that I've immediately enabled now that they're available! I'm going to share with you what those features are, and, a security tip that has been available prior to the release of iOS 18 too.

iOS 18

There are a bunch of good reasons that you should always be making sure that your devices are fully updated, so if you have a recent iPhone, you can update it now. Head to Settings -> General -> Software Update, and your phone will automatically search for any available updates. If there is an update available, then install it, and if you're already updated, then you should see something like this.

Once you're on at least iOS 18, you're ready to continue and look at some of the security features and tips below!

Require Face ID to open any app!

I have to start with this one because it's been something that I've wanted for quite some time now! A few of my apps, like banking apps, have the ability to require Face ID as part of the login process, thus requiring it when you open the app, but so many apps do not have this option. That means that once your phone is unlocked, someone can go through almost any of the apps I have installed without any further challenge.

With iOS 18, you can now require Face ID to launch any app that you like, and it couldn't be easier to enable or disable it! Here's a video of the process, which involves long-pressing on the app icon, and then pressing 'Require Face ID', that's it!

As you can see there, the app I decided to show that on was the Tesla app, the app that allows access to unlock our car and drive our car! Of course, this app is already protected by the lock functionality of the phone, so if my phone was stolen whilst locked, I wouldn't have anything to worry about, but if someone has my unlocked phone, we have a problem! Given the recent rise in mobile phone theft right from your hands, even the Met Policy have issued guidance on the problem. At least now, if your phone is snatched from your hand, the thief can't roam freely through important apps on your phone...

Hiding apps on your phone

This is another new feature that came with iOS 18 and it's the ability to hide an app from your phone so that someone with access to the unlocked device can't see it or find it. During the process of requiring Face ID above, there was another option that was to require Face ID and hide the app. Selecting this option will remove all trace of the app from your home screen and it will now be placed inside a 'Hidden' apps folder.

Of course, this 'Hidden' app folder needs to be present on all phones, all of the time, otherwise it would be obvious that you had hidden an app, and for that reason, it is. The Hidden folder also doesn't give any hints as to how many apps are in there, it looks exactly the same if there are zero, two, or eleven hidden apps.

I'm sure the use cases for this will vary enormously, but I can certainly think of a few good reasons to have this feature.

Stop a user switching to other apps

Okay, so this one isn't a new feature in iOS 18, but it is something I've found really useful to have on quite a few occasions. It's called Guided Access, and it allows you to lock your phone on a certain app, requiring Face ID to close that app or switch to another. Head to Settings -> Accessibility -> Guided Access to turn this on.

You can now turn on Guided Access with a triple-click on the lock button and there's an extra feature available too. If you want to prevent access to a certain part of the screen, you can draw an outline around it and it will create a rectangle that blocks that part of the screen too! I've had to use an Apple screenshot here as you can't create a screenshot on your device during the Guided Access setup phase.

You can block out as many parts of the screen as you like, or block out no parts of the screen, and then click Start. Once it's started, you'll see that you can't access those parts of the screen, nor can you close the app or swap to another app! If you want to end Guided Access, just double-click the lock button and Face ID will verify you to end Guided Access. You can now triple-click to activate this at any time you like, and triple-click whilst Guided Access is enabled to change the blocked out areas of your screen.

I've found this handy when having to show someone tickets on my phone in an app, when I'm showing someone photos but I don't want them to be able to switch to other apps, and a whole bunch of other times too.

If you have any other iOS security tips, drop them in the comments below!

Introducing Frame Watch: Monitor payment page activity with ease!

Scott Helme — Mon, 29 Jul 2024 13:04:42 GMT

For a long time, Report URI has been helping website owners deliver a more secure browsing experience for their users. With this latest release of a new feature, called Frame Watch, we're adding yet another capability to our platform to give you more visibility into payment processing on your site.

Payment Pages and Card Holder Data

While Report URI has been around for almost a decade now, there has been a very recent and sharp increase in the demand for our services. The industry body known as the Payment Card Industry Security Standards Council (PCI SSC) set out a document known as the Payment Card Industry Data Security Standard (PCI DSS), which provides minimum security requirements for websites handling Card Holder Data (CHD) via online payments. The latest and most significant overhaul of that standard was released in 2023 and compliance is required by March 2025, so the clock is definitely ticking.

If you'd like my overview of the latest v4.0 standard, you can read that here: PCI DSS 4.0; It's time to get serious on Magecart

There was also a minor release much more recently, and you can read my views on that here: PCI DSS 4.0.1; What's Changed?

Script Watch and Data Watch

To help organisations better meet those requirements, we added two new features to our telemetry monitoring capabilities, known as Script Watch and Data Watch. The purpose of Script Watch is to monitor for new JavaScript dependencies on your site and to notify you to any new dependencies as soon as they are spotted. Data Watch serves a very similar purpose but is looking for external data dependencies, the locations that you're sending data to from your website. If it sees you sending data to a new location for the first time, you would be notified.

Both of these features have proven to be exceptionally popular and are now, by a comfortable margin, the most popular features on our service. Use of these features did lead to a common theme amongst the feedback during testing, though, and Frame Watch is the answer to those requests!

Frame Watch

Much like Script Watch and Data Watch were looking out for their own relevant activity on your site to notify you, Frame Watch will now be able to monitor for something specific too.

When embedding the JavaScript from your payment provider, all sorts of other sites can be framed and introduced to your site to handle that payment. It can be from the card issuer themself, a form of 3DS challenge where a user is required to enter a code usually sent via SMS, or one of many other possible processes. Many of our customers were asking for visibility into these processes so that they could know what was happening on their site. At the same time, this would also allow our customers to monitor for signs of malicious activity by looking for questionable sources of frames.

Frame Watch is now generally available for all customers on an appropriate plan and you can begin using it right away. If you've not yet tried out Report URI, you can sign up for 30-day free trial and use the code FRAMEWATCH at checkout to get 50% off your first three months!! This offer is open until the end of Aug 2024 so get signed up now to lock it in.

As always, if there are any questions then please feel free to reach out to me and I'd be happy to help.

I'm a Microsoft MVP again!

Scott Helme — Mon, 15 Jul 2024 12:44:44 GMT

After getting my first MVP Award last year, I'm super happy to see that I have been renewed for 2024!

MVP Developer Security

I'm glad that all of the activities I've continued to do throughout the years has once again received this recognition, and I plan to continue those actitvities into the future!

It's been a busy year over the last year, and while I haven't published quite as many blogs as usual, many of my other activities have certainly ramped up!

If you want to drop by my MVP page on the Microsoft site, you can find it here! Thanks for all the support 😎

PCI DSS 4.0.1; What's Changed?

Scott Helme — Tue, 09 Jul 2024 12:50:40 GMT

Back in April 2022, I published PCI DSS 4.0; It's time to get serious on Magecart, and I was seriously impressed with the stance that the PCI SCC were taking against Magecart and other JS based threats. In this last week, PCI DSS v4.0.1 has been published with a few changes to key sections that I spoke about previously, so let's delve into what those changes are, and what the impact is.

PCI DSS v4.0

If you aren't familiar with the significant changes that came with PCI DSS v4.0, you really should read PCI DSS 4.0; It's time to get serious on Magecart. That blog post goes into detail on what the major new requirements were, and why I was so happy to see them. While we'll be referring back to the new requirements introduced in 4.0 in this blog, I will be focusing mainly on the key changes introduced in the 4.0.1 update.

PCI DSS v4.0.1

Only being a week or two old, the v4.0.1 standard is still quite fresh, but the textual changes aren't significant. As you can probably tell from the new version number, the textual changes are actually quite small, but those small changes do have a significant impact on the strength of the new requirements. I'll be focusing on changes to the 6.4.3 and 11.6.1 requirements in this post, and those requirements largely focus on mitigating Magecart and other JS based threats that exist on payment pages.

Requirement 6: Develop and Maintain Secure Systems and Software

6.4 Public-facing web applications are protected against attacks.

As one of the first major requirements that was designed to tackle the problem of Magecart and other JS based threats head on, requirement 6.4.3 was pretty strong. Here's the original text from the 4.0 version:

A method is implemented to confirm that each
script is authorized.

A method is implemented to assure the integrity
of each script.

An inventory of all scripts is maintained with
written justification as to why each is necessary.

This leaves little room for interpretation, but to ensure that there was no room for interpretation, the PCI SSC even defined the word "necessary", here with my own highlight.

Definitions
“Necessary” for this requirement means that the
entity’s review of each script justifies and confirms
why it is needed for the functionality of the
payment page to accept a payment transaction.

The script had to be necessary to accept a payment transaction, otherwise, it had to go. That would mean a huge reduction in the amount of JS on payment pages, and thus, a huge reduction in the amount of risk. Less JS, more better!

Now that we've summarised requirement 6.4.3 in the 4.0 document, let's take a look at what 6.4.3 looks like now in the 4.0.1 document, with the important change highlighted.

A method is implemented to confirm that each
script is authorized.

A method is implemented to assure the integrity
of each script.

An inventory of all scripts is maintained with
written business or technical justification as to
why each is necessary.

That's a definite softening of the requirement there and this was largely based on feedback from the industry that as previously worded, we wouldn't be allowed a whole bunch of JS on the payment page that was previously allowed. We wouldn't be able to have tracking or marketing tools, or chat bots, or widgety do-hickey things that do 'stuff', leaving us only with the JS needed to process the payment. To my mind, and as stated in my previous post, this massive reduction in JS on this one page was the main benefit of the requirement. By taking such drastic measures to reduce the amount of 3P JS, we were drastically reducing the risk too.

Alongside the change to the requirement, the definition of the word "necessary" was also removed. If we look at the 'Summary of Changes' document that the PCI SSC published, they state:

Remove Definitions that explain what “necessary” means for this requirement (“needed for the functionality of the payment page to accept a payment transaction”).

It is a shame to see such a softening of the requirement based on pressure from the industry, especially when it would have been so effective at mitigating the attacks it set out to mitigate, but this is where we currently stand.

Requirement 11: Test Security of Systems and Networks Regularly

11.6 Unauthorized changes on payment pages are detected and responded to.

After being concerned about the changes to requirement 6.4.3, I was a little worried that requirement 11.6.1 might also face a similar reduction in the strength of the protections. I'm glad to say that this isn't the case and the changes here are a welcome clarification to the wording of the requirement. Looking at the text from the 4.0 document, we had:

A change- and tamper-detection mechanism
is deployed as follows:

To alert personnel to unauthorized modification
(including indicators of compromise, changes,
additions, and deletions) to the HTTP headers
and the contents of payment pages as received
by the consumer browser.

The mechanism is configured to evaluate the
received HTTP header and payment page.

This was clearly aimed at requiring the monitoring of a CSP header if that is the mechanism the site was using to control JS on the payment page. The problem was, the wording of the requirement didn't quite specify that and it left it open to interpretation that all HTTP headers needed to be monitored. It also sounded like the requirement had us monitoring all of the content of a payment page, rather than just the script content. In light of that, the text was updated as follows, highlight mine:

A change- and tamper-detection mechanism
is deployed as follows:

To alert personnel to unauthorized modification
(including indicators of compromise, changes,
additions, and deletions) to the security-
impacting HTTP headers and the script contents
of payment pages as received by the consumer
browser.

The mechanism is configured to evaluate the
received HTTP headers and payment pages.

This is a welcome change to the wording of the requirement and now makes it much easier for me to explain to those seeking to comply with the standard that measure we had in place at Report URI were already sufficient. Using our CSP reporting service, we can already monitor the content of the CSP response header as a copy is sent with the report, but, the content of other HTTP response headers is not sent, meaning we can't monitor those!

Summary

In short, I do feel that the requirements as they stand in the v4.0.1 standard are still going to help to significantly impact the effectiveness of attacks like Magecart and others, but just not quite as much. I was a little disappointed to see that pressure from the industry has lead to a softening of 6.4.3 in terms of what JS is allowed on the payment page, and I have faced a lot of that pressure myself in my role at Report URI. When helping organisations move towards complying with 6.4.3, there was often a lot of resistance to reducing the amount of JS on the payment page. The sales and marketing teams wanted their tracking and analytics, support wanted their chat bots, and, often, the biggest issue is that all JS is just baked into all pages across the site and changing that would be a task in itself.

Still though, I feel that the payment page doesn't hugely benefit from having any of those things present, and for the sake of cleaning up the sheer amount of JS on one page of your site, we've lost the sharp edge that this requirement previously had to cut through the risk!

PCI DSS Compliance

If you'd like help with meeting the new PCI DSS v4.0.1 requirements, specifically 6.4.3 and 11.6.1, then check out Report URI and our PCI DSS Compliance page that gives all of the information on how we can help you. You can sign up for a free trial, with no commitment, and reach out to me if you need help getting started.

Links

PCI DSS v4.0.1 document
PCI DSS v4.0 -> 4.0.1 Summary of Changes document
PCI DSS v4.0 document

Warning users of the Polyfill[.]io supply chain attack!

Scott Helme — Mon, 01 Jul 2024 14:52:46 GMT

I'm sure many of you have heard of the recent issues around the Polyfill supply chain attack. In short, a popular domain used for loading JavaScript, polyfill[.]io, recently changed hands and after that change in ownership, the new owners started to serve malware with the JavaScript. Here's how Report URI has responded so far, and our plans for future action.

Report URI

Report URI has been helping our customers deploy strong application security controls, like Content Security Policy, for almost 10 years. As part of that journey, we have developed features and capabilities that allow us to provide assistance in scenarios such as the one we currently find ourselves dealing with. In order to respond more specifically to the polyfill[.]io threat, we've made the following changes.

Threat Intelligence

Using a blend of our own internal Threat Intelligence feeds, and a selection of external Threat Intelligence data that we subscribe to, we launched a powerful new feature back in 2022 and only months later, we expanded that feature.

With the ability to look for a known Indicator of Compromise (IoC) within your telemetry and raise that within our UI, website operators can see if their site began loading JavaScript from questionable sources. Over the weekend, Report URI started flagging the use of the polyfill[.]io and cdn.polyfill[.]io domains as an IoC and any such reports will now attract the appropriate warning in our UI.

Customers with our Threat Intelligence product can head to their CSP -> Reports page right now and check the "IoC Filter" to see if they have any instances of loading JS from polyfill[.]io or any other known source of compromised JS.

Analysing your Content Security Policy

Alongside this expansion of our existing capability, we've also added a new capability. Given the unique nature of the current polyfill[.]io threat, we've added a new detection to our report ingestion pipeline. Previously, our analysis was focused on resources that were reported as being blocked on your site. Your CSP would define resources that were allowed to be present, and the browser would send a CSP Violation Report for any resources that were not permitted to be present. This is the standard functionality of Content Security Policy.

The problem we now face is that if you have allowed polyfill[.]io or cdn.polyfill[.]io in your CSP, the presence of those resources on your site would not trigger a Violation Report. You've authorised those resources to be present, so as the browser sees it, there is nothing to report on. Fortunately for us, one of the pieces of information that a browser will send us in the Violation Report is a copy of the original CSP your site sent along with the page. Here is an example JSON payload that we would receive:

By analysing the provided copy of your original policy, we can determine if you are permitting dangerous resources to be loaded on your site. Given that the polyfill[.]io and cdn.polyfill[.]io domains were previously trusted, it's possible that they are present in your CSP and now pose a threat. Such Violation Reports that indicate your CSP will allow the loading of dangerous resources will now be flagged as an IoC.

As you can see, the blocked resource in this particular case is not what is attracting the warning, but instead, it's the presence of cdn.polyfill[.]io attracting the warning!

Customer Outreach

Over the past few days we've monitored our telemetry to better understand the risk this issue poses to our customers and how we might best respond. Along with the above changes that are now available to all customers on a suitable plan, we are also beginning an outreach campaign to those customers that we feel are most at risk. Those customers are mostly on our Enterprise tier and you will be hearing directly from your account contact starting today.

PCI DSS 4.0; Certificate Transparency Monitoring is mandatory!

Scott Helme — Mon, 22 Apr 2024 09:59:54 GMT

I've previously covered two of the major new requirements coming in PCI DSS 4.0, and now it's time to take a look at another one! I've long spoken about Certificate Transparency and the major benefits it can bring to security on the Internet, and now it seems the PCI SSC have recognised that with a new requirement in PCI DSS 4.0 that mandates the use of Certificate Transparency!

PCI DSS v4.0

If you didn't see my previous blog post, and you're interested in looking at the other major new requirements, you can read my article PCI DSS 4.0; It's time to get serious on Magecart that covers them in detail. Magecart has been a considerable threat to e-commerce websites for quite a number of years now, and the new requirements in 6.4.3 and 11.6.1 are going to deal these criminal actors a major blow. This blog post is going to cover the new requirement in 4.2.1.1, though, which is going to cover the use of Certificate Transparency.

Certificate Transparency

I first wrote Certificate Transparency, an introduction all the way back in 2017, so CT is certainly not a new technology. I won't go into great detail here, but to give a brief overview, CT gives you a reliable mechanism that can be used to detect and inventory all publicly trusted certificates issued for your domains, even if someone tries to hide their existence from you. The technology behind the operation of CT is quite complicated, but fortunately for us, using CT is trivial! Over at Report URI, we added support for CT Monitoring all the way back in 2019, announced here, so we've been helping domain owners set up CT Monitoring with just a couple of clicks for almost 5 years.

Requirement 4: Protect Cardholder Data with Strong Cryptography During Transmission Over Open, Public Networks

The current objectives outlined in Requirement 4 have existed for a long time, for good reason, and have undergone some changes and additions in the v4.0 release. The new requirement we're going to focus on is 4.2.1.1 which states:

4.2.1.1 An inventory of the entity’s trusted keys and certificates used to protect PAN during transmission is maintained.

In many ways, this makes complete sense once you think about it. Requirement 4 is focused on ensuring the proper encryption of CHD when it is being transmitted, but how can you ensure that if you don't have an inventory of your keys and certificates? This is the primary reason that CT was first introduced! If you can't be sure of all of the certificates that have been issued for your domain, how can you be sure that they aren't being abused to intercept and decrypt your encrypted data? CT was, of course, first introduced to help protect all encrypted Internet traffic as a whole, and CHD in transit falls directly into that category.

Does this really make CT Monitoring mandatory?

Yes, and let me explain why. Let's say I process payments right here on scotthelme.co.uk and, of course, I have certificates and thus keys to ensure that data is encrypted in transit using TLS. I can keep an inventory of the new certificates that I purchase, and I can keep an inventory of the certificates that I have in my possession already, but those aren't the only certificates that might exist. What if someone in my organisation obtains a certificate outside of the proper process? It'd be missed from the inventory. What if someone outside of my organisation tricks a Certificate Authority into issuing them a certificate for scotthelme.co.uk by mistake? It'd be missed from the inventory. What if I simply miss one of my existing certificates from my inventory due to an oversight or a mistake? It'd be missed from the inventory.

The fact is, there are a lot of different ways that certificates might exist and not be accounted for in my inventory. Certificate Transparency Monitoring does not allow for such events to occur. If a publicly trusted certificate exists for my domain, scotthelme.co.uk, then it will be found and it will be included in my inventory. To explain how this works at a technical level is a lengthy blog post in itself, but this was one of the core design requirements of CT, this is exactly how it works and what it enables. This is the "raison d'etre" of Certificate Transparency, you can ensure that no certificates are ever missing from your inventory. What's even better is that you can get started in just a few clicks. Literally.

Setting up CT Monitoring on Report URI

As I mentioned earlier, Report URI has offered CT Monitoring as a public feature for almost 5 years, and I've been involved with CT for much longer than that. All you need to do to enable CT Monitoring is tell us which domains you want to monitor for certificate issuance. In the example above, where I'm transmitting CHD to scotthelme.co.uk, then that is the domain I would need to enable monitoring for. If you have more domains than that, then you simply need to list them all. That's it... You just need to tell us the domains, and we'll do everything else. Here's a screenshot of my configuration in my own Report URI account.

On the Filters page, you provide the list of domains that you want us to monitor. I monitor quite a few domains, because I operate a few different websites, and you may have a single entry or many entries like I do, but all you need to do is provide the list. This is now the list of domains that we will monitor for certificate issuance against on an ongoing basis, and it includes certificates that contain subdomains of those domains too. A certificate containing account.scotthelme.co.uk, for example, would be observed and recorded as part of the scotthelme.co.uk domain entry.

Once that list is saved, we will begin monitoring for certificate issuance against those domains. Any certificate issued will then be recorded and, as you can see in the above screenshot, you can request to receive an email notification that a new certificate has been issued if you'd like. Here's what one of those emails looks like, I received this notification only a couple of hours ago.

As you can see, the email contains the basic information from the certificate which, in the majority of cases, is going to be all that you need to determine if you need to take any action. If you would like to investigate further, there is a link to take you to the full details of the certificate for deeper investigation. If you want to browse through all of your certificates that have been issued, you can also do that in your account.

This is a snippet of the list of certificates issued in April that Report URI has detected for my domains, and, you can see the certificate that corresponds to the email alert above is in the list too. You can filter and search on this data by any criteria you see in the UI, based on when it was issued, what sub/domain is in the certificate, and who issued it. You can download a copy of the certificate if you want a copy to work with and you can even view a full parsing of the certificate without having to use any tools yourself.

If you would like to view that detailed parsing, here is the link, but before I wrap up here, let's look at the specific text of the requirement again:

4.2.1.1 An inventory of the entity’s trusted keys and certificates used to protect PAN during transmission is maintained.

With CT Monitoring provided by Report URI, we're fully achieving this requirement because:

CT logs contain all publicly trusted certificates for your domain/s, and;
Your certificate contains your public key, so we can inventory those too.

All of this is live and available on Report URI right now and you can get started in minutes. Over the coming months we'll also be releasing a few UI tweaks and feature improvements to make our CT Monitoring even more powerful for your PCI DSS compliance needs, but for now, why not try it out and see how fast you can get it set up?

Google update their Minimum Viable Secure Product

Scott Helme — Tue, 28 Nov 2023 13:50:15 GMT

Back in 2021, Google launched, alongside other organisations, a new security baseline for products known as the Minimum Viable Secure Product. Now, 2 years later, they've released an update to that standard.

This a cross-post of my article on the Probely blog, you can read the original there.

Minimum Viable Secure Product

You can read the original announcement from Google if you like, but we'll be focusing a lot more on the update released a couple of days ago. The MVSP site is also a great place to get a lot more detail on the project and track future changes or updates.

In terms of what the MVSP project is trying to achieve, I think this snippet from the site gives a really good idea of exactly what it's about:

Minimum Viable Secure Product (MVSP) is a list of essential application security controls that should be implemented in enterprise-ready products and services. The controls are designed to be simple to implement and provide a good foundation for building secure and resilient systems and services. MVSP is based on the experience of contributors in enterprise application security and has been built with contributions from a range of companies.

I've said myself many times in the past that sometimes, we need to focus on getting the basics right before we get carried away on more complex or elaborate risk reduction, and MVSP aligns extremely well with that approach. You should read through all of the requirements outlined on the site, of course, but I'm going to pick a few that are near and dear to my heart to focus on!

§1.1 External Vulnerability Reports

As I sit and write this blog post, I'm currently going through two responsible disclosure processes where I'm desperately trying to get in touch with the organisations in question. I've tried email to customer services, I've raised a support ticket, I've reach out to people listed as employees on Linked In and finally, I have to resort to public calls for help:

Hey @capellisport, we've made several attempts to contact you via various channels!

Does anyone have a security contact or some way of reaching an appropriate person? Please reach out, my DMs are open.
— Scott Helme (@Scott_Helme) November 21, 2023

This is downright ridiculous and it should not be this hard to get in touch with an organisation to let them know that they have a serious security issue!! That's precisely what the Security.txt file is for. You can read the full details in that blog post, but the TLDR; It's a simple text file you host with details on how people can contact you to report security vulnerabilities / responsible disclosure.

You can see mine here:
https://scotthelme.co.uk/.well-known/security.txt
https://report-uri.com/.well-known/security.txt
https://securityheaders.com/.well-known/security.txt

§1.4 External Testing

No matter how good your own security processes are, you always need another set of eyes to spot the issues you've missed. As a penetration tester for many years myself, I understand the value of such services form both sides of the conversation, as the 'hacker' conducting the test, and as the company on the receiving end. My own company, Report URI, has just had its annual penetration test completed by an external company and, as always, the full report is published for anyone to see!

I don't think you can just have an annual penetration test and then brush your hands together and say 'all good', though. Penetration tests won't find all issues, and, if you're only having a test once per year, an issue could easily sit around for 6+ months until it's discovered. Not good... That's why it's also a great to idea to use a DAST solution like Probely that can scan your application for vulnerabilities on a far more regular basis.

Just like a penetration test, Probely (or any other tool), won't find all issues, but they can find issues much sooner and much cheaper than a penetration test. There's no point in spending thousands of dollars for a penetration tester to find and report issues that you could have found for hundreds of dollars instead. Not only that, but you can find them sooner, and we've all seen the graph on the cost of fixings bugs, right?

Remember that all security vulnerabilities like this are basically just bugs that need fixing! The sooner you find them, the cheaper they are to fix and the less risk you were exposed to.

§2.3 Security Headers

Yeah! Who doesn't love some Security Headers?! Not only do they recommend using Security Headers (the actual HTTP response headers), but they also recommend using Security Headers (our website!) to scan and assess your headers!

You can see the recommendation and link to us right here, and I'm super grateful for our free service to be mentioned like this. Head over to our site and you can perform a free scan that takes 2-3 seconds right now!

§3.3 Vulnerability prevention

After working in the Cyber Security world for so long, one thing that I realised was there are always solutions to any problem you may have, and often, they're quite easy. The problem that I've always come across is that people simply didn't know about the solution, and it's one of the things I focus on in both training courses I deliver. You can see full details on my Training page, but here's the summary of the two training courses I deliver:

Hack Yourself First: In collaboration with Troy Hunt, I deliver his awesome, 2-day workshop where we learn how to hack in to our demo application and then how to defend against those attacks.

Practical TLS and PKI: In collaboration with Ivan Ristic, I deliver his incredible, 2-day training where we deploy and fully configure TLS and PKI in real-world environments. This training course also ties in superbly well with MVSP §2.2 HTTPS-only and §2.8 Encryption!

Do you meet the MVSP requirements?

I'm sure that many of you reading this blog post will be able to quickly flick through the list of requirements and check them off, but can you check off all of them?

It's really interesting to see the direction that MVSP is going in and I wholeheartedly agree with everything that's in there. As the name would imply, this is the Minimum Viable Secure Product and not the Maximum Viable Secure Product, so you could and should be exceeding many of these requirements, especially if you're a security focused company like we are! I'd like to leave you with this quote from the MVSP docs, highlight my own:

Minimum Viable Secure Product (MVSP) is a list of essential application security controls that should be implemented in enterprise-ready products and services.

Report URI Penetration Test 2023

Scott Helme — Mon, 20 Nov 2023 11:05:19 GMT

It's that time of year again at Report URI, right before we start getting festive, that we have our annual penetration test and 2023 is going to be our fourth test that we publish in full.

Penetration Tests

Our previous penetration tests, which were also published publicly, were back in 2020, 2021, 2022, and now it's time for the 2023 edition. Just as it was before, there were no artificial limits placed on the scope of the penetration test and it was a 5-day test. Again, we provided our source code as part of our test because it can help the tester move much more quickly and confirm or even discover issues by looking over the code. I really stressed the point that we wanted to be absolutely confident that we didn't have any issues lurking after another year of introducing new code and bug fixes, so I requested the best tester they had on staff. I also provided as much information as I could, full featured accounts, test data, a demo of the application and access to our source code.

The Results

Maintaining the same number of findings as last year, but swapping around the ratings a little, here is the summary for our 2023 findings.

This is an outcome to be really pleased with, given that we were trying our best to uncover any issues that we may have and leaves me feeling really confident that our existing processes continue to serve us well. Let's take a look through the report in more detail and see exactly what was found.

5.1 Vulnerabilities in Outdated Dependencies Detected

Again?! This one did come as a little bit of a surprise to me and you may recognise this from the 2022 report. This finding is in a different dependency that has since had an issue identified, but I was surprised that we were able to use a dependency with a known vulnerability.

We have a GitHub action that checks all of our JS dependencies for known issues but it seems it was having trouble with this one which I'm investigating further. This has now been resolved and the dependency updated!

It was also noted that our version of Bootstrap was EOL but there are currently no known issues and we have our CSP as an additional control too. This is something we're aware of and it will be addressed in due course.

5.2 Insecure TLS/SSL Configuration

This was only raised as an info item on the report and is also something we're aware of. Given the wide variety of clients that report telemetry to us, we do have a wide array of cipher suites on offer to support them all. We do support the latest and greatest protocol versions and cipher suites, so modern clients will always have the best protection available. We won't be making any changes for this item and if you'd like, you can view our SSL Labs results.

5.3 CSP configured without ‘base-uri’ directive

We had previously and consciously not set the base-uri directive in our Content Security Policy. The tag in HTML allows you to set the base URL for path-relative assets, like so:

The URL used to load the script by the browser would now be:

https://evil-cyber-hacker.com/keylogger.js

The problem the attacker would have is the JS load would still be subject to our strict script-src in our CSP and the evil-cyber-hacker.com domain is not allowed, so the script wouldn't load. Ultimately, the base-uri directive not being present wouldn't allow an attacker to do anything, but we still added it anyway!

Appendix

We then get on to some really interesting parts of the report that can only be found in the Appendix because they were not actually findings on the test, but they could have been if circumstances were different. I'm really proud of this section and it just goes to show how our Defense In Depth strategy can really pay off!

A1. CodeIgniter Validation Placeholders RCE

When Report URI was first built, it used the CodeIgniter framework exclusively. Over the years we have slowly migrated away from CodeIgniter and that migration is almost complete, with very, very little of our application even touching CodeIgniter now.

A new feature, called Validation Placeholders, that was not supported in the version of CodeIgniter we use, nor was its predecessor feature used by our code, contained a pretty serious Remote Code Execution vulnerability. You can read the full details in the security advisory, so I won't reproduce them here, but I was able to quickly evaluate the functionality of Validation Placeholders, and of our usage of Validation Rules, to determine that we were not vulnerable. Despite that, we still took robust action. I will quote directly from the report:

While Report URI used CI 3, which did not have the vulnerable feature, a patch was deployed the same day which addressed any remaining concerns. All validation rules were changed to use arrays, rather than pipe format shown in examples above. This prevents the sort of injection which enabled the placeholders to execute code. A static analysis rule was also added which forces validation rules to use Report URI’s wrapped rules, rather than CI’s default ones. Finally, the wrapped rules were changed to only accept arrays and not strings.

However, without CI 4’s placeholders, untrusted data is never injected into the execution context, as one would expect... As such, while Report URI was never exposed to this vulnerability, additional measures have been taken to further strengthen defences against it. Finally, this issue also encouraged Report URI to accelerate their transition away from CodeIgniter.

A2. Race Condition on Email Change

This was an interesting issue for the tester to come across and one that was pretty easy for us to resolve. By submitting multiple requests to change the email address on the registered account within the same HTTP/2 packet, a user account was able to be 'cloned' into several instances of the same account but with different email addresses associated.

During penetration testing, a race condition vulnerability was identified in the user email address change functionality. While the condition enabled the creation of duplicate user accounts, it was without immediate security risk, due to the robust "fail fast and fail early" principles employed by the application. The duplicates were clones, inheriting access to the original accounts' subscriptions – meaning that each duplicated account had their usage counted against the original account’s limits.

There were no security concerns presented by this issue and it only posed a problem for the person 'cloning' their account because it would break certain functionality of their account and yield no benefits. This issue is possible because of inherent limitations in Azure Table Storage not being accounted for in our code. I wrote about why I chose Table Storage all the way back in 2015.

Table Storage is a simple key:value store and it doesn't support atomic operations. Entities are inserted into a table and have two properties, the Partition Key and Row Key, that form the primary key for lookup and can't be changed. Because of the unique constraint on the combination of the Partition Key and Row Key, we store the user email address in the Row Key to prevent duplication of accounts with the same email address. The Partition Key and Row Key also can't be modified for an entity, they are the only properties that are permanent. This means to change the email address for a user we have to clone their user entity into a new entity with a different email address in the Row Key, insert the new entity and delete the old entity. It was racing the delete process that made this issue possible.

Because Table Storage doesn't support atomic operations, I couldn't fix the problem there, so I turned to our Redis cache, which does support atomic operations. By implementing a new 'user entity write lock' mechanism, we could leverage Redis to easily resolve this problem.

Here's the code to do this in Redis:

$this->redis->set('entityLock:' . $email, '1', ['NX', 'EX' => 15]);

To snip the explanation text from the issue:

The NX option will cause the write to Redis to fail if the key already exists, and the EX option sets the expiry on the key to 15 seconds. This means a user can only change their email address every 15 seconds, but fixes the race condition found in the penetration test.

To quote the penetration test report:

Despite the lack of security implications, it is worth commending Report URI’s swift remediation of the issue.

A3. Potential Issue – Path Access Control

We have some Controllers that perform functions only intended to be accessed and triggered from the Command-Line Interface. Our Router takes care of ensuring that any requests coming in from Nginx can't hit these Controllers as they are only intended to be called from the CLI. It turns out we had a small bug in our Router code and you could indeed hit these Controllers with HTTP requests. Despite this, there was no issue, and I will quote from the report:

However, all affected controllers inherited from a base command line controller class, whose constructor performed an additional verification of the execution context. Any attempt to create (access) these controllers outside of a command-line context would raise an error, as demonstrated below:

-snip-

As such, while it was never possible to reach the affected controllers, this issue highlights the importance of not making assumptions when security could be affected. Thanks to robust code and multi-layered validation, the application prevented the issue from being exploitable – and in fact, entirely mitigated the vulnerability before it was even discovered. Report URI has since patched the issue to remove any residual risk.

It's hard to imagine what someone could do with this capability, maybe they could update our aggregate count data a little more frequently, but either way, it was still a bug that needed to be fixed!

Another Success!

I think it's fair to say that this test went exceptionally well and despite all of our best efforts over the last 12 months, there are still improvements that we've had to make as a result of getting the test done. As I've said before, you really want to help a tester as much as you can to get the most value out of a test like this, and as before, here is the full, unredacted PDF report if you'd like a read.

Report URI - 2023 Penetration Test Report

Report URI: A week in numbers! (2023 edition)

Scott Helme — Fri, 10 Nov 2023 15:12:49 GMT

I simply can't believe that Report URI has now processed 1,500,000,000,000+ reports, which is unreal! That's over one trillion, five hundred billion reports... 🤯

This tiny little project, that I had the idea of starting all those years ago, is now processing incredibly large amounts of data on a day-by-day basis. So why don't we take a look at the numbers?

Report URI

If you aren't familiar with Report URI, here's the TDLR: Modern web browsers have a heap of security and monitoring capabilities built in, and when something goes wrong on your site, they can send telemetry to let you know you have a problem. We ingest that telemetry on behalf of our customers, and extract the value from the data.

Over the years, as our customer base has grown, we are of course receiving more and more telemetry from more and more clients. This has also been expanded to include email servers which can also send telemetry about the security of emails you send! If you want any details on our product offering, the Products menu on our homepage will help you out, but this blog post isn't about that, it's about the numbers!

Our infrastructure

In a recent blog post I gave an overview of our infrastructure which really hasn't changed much over the years and has held up really well to the volume of traffic we're handling. I'll repeat the highlights here because it will help make the following data more understandable once I get into it. Here's the diagram that explains our traffic flow, along with the explanation, and you can see it's the exact same diagram I published over 5 years ago when I last spoke about it, and our infrastructure was the same long before that too:

Data is sent by web browsers as a POST request with a JSON payload.
Requests pass through our Cloudflare Worker which aggregates the JSON payloads from many requests, returning a 201 to the client.
Aggregated JSON payloads are dispatched to our origin 'ingestion' servers on a short time interval.
The ingestion servers process the reports into Redis.
The 'consumer' servers take batches of reports from Redis, applying advanced filters, threat intelligence, quota restrictions and per-user settings and filters, before placing them into persistent storage in Azure.

When traffic volumes are low during the day, this entire processing pipeline averages less than sixty seconds from us receiving the report to you having the data in your dashboard and visible to you. When all of America is awake and online, our busiest period of the day, we typically see processing times between three and four minutes, with the odd outliers taking possibly five or six minutes to make it through. Overall, we work as much as we can to get this time down and keep it down, but I think it's pretty reasonable.

The history of the service

I first launched Report URI in May 2015 after having worked on it and used it myself for quite some time and since then, I've covered it extensively right here on my blog. You can find the older blog posts using this tag, and the newer ones using this tag, but any change worth mentioning is something I've talked openly about. Here's a quick overview of how our report volume has grown over time.

Sep 2015 - 250,000 reports processed in a single week
Sep 2016 - 25,000,000 reports per week
Mar 2017 - 80,000,000 reports per week
Jun 2017 - 677,000,000 reports per week
Jul 2018 - 2,602,377,621 reports per week
Jun 2019 - 4,064,516,129 report per week
Mar 2021 - we hit half a trillion reports processed

That last one was a particularly big milestone and I feel like something that was really worth celebrating. Just think, half-a-trillion reports processed!

It's absolutely wild that I can now say we've processed over HALF A TRILLION REPORTS for our customers!

The current total as of this tweet stands at:

500,205,618,910 reports!!! 😲 https://t.co/jWQNQYX2dP
— Scott Helme (@Scott_Helme) March 14, 2021

We pushed on to hit our one trillionth report a little bit more quickly too! Michal even caught the point when it rolled over.

1 trillion OMG we've just processed 1 TRILLION reports 😲😍🍻 That's a loooooooooooot of JSON & XML! Good job everyone 🕺 @reporturi @Scott_Helme @troyhunt pic.twitter.com/fYohs4LmKD
— Michal Špaček (@spazef0rze) February 5, 2022

Now, as I sit here, we've pushed 1.5 trillion reports... But, enough about where we were, let's talk about where we are and what we're doing today, starting with what volumes we're processing now.

From the top!

Our first point of contact for any inbound report will be Cloudflare, so I'm going to grab the data for the last 7 days from our dashboard to take a look at. Here's just the raw number of requests that we've seen hit our edge.

In the last 7 days we saw over 3,870,000,000 requests coming in! You can see many of our common patterns trends in terms of the peaks and troughs throughout the day, and also that weekends are generally less busy than weekdays for us too. I love looking at our data egress graph because the only thing our reporting endpoint sends back is empty responses. They're usually either a 201 when we accept a report, or a 429 if you have exceeded your quota, but always empty, and there's a lot of them.

We've served a staggering 3.67TB of empty responses over the last 7 days! I also like to watch trends in how data reaches us and we can also gather some really interesting information at this scale. Take any of the following metrics for example, where we can look at how our service is being used, seeing that most people are, as expected, generating report volume in report-only mode or via our CSP Wizard.

We can also see some interesting data about clients sending reports too.

Of course, we see clients sending us many requests, but we've still received reports from over 80,000,000 unique clients!

That's a lot of browsers...

Through the Cloudflare Worker

All reports that hit our edge go through our Cloudflare Worker for processing. It does some basic sanitisation of the JSON payloads, normalisation, and maintains state about our users to know if they are exceeding their quota so reports can be dumped as early as possible. This of course means that the number of requests hitting our Worker is going to be the same as the number of requests we're receiving at the edge.

What's interesting to see is that at peak times we're receiving around 9,000 requests per second and that's a typical week for us. If a new customer joins suddenly, or an existing users deploys a misconfiguration suddenly, we've seen spikes up and over 16,000 requests per second coming in. As I mentioned in the opening paragraph though, our Worker batches up reports from multiple requests and sends them to our origin after a short period, which you can see in our Subrequests metric. Despite receiving 3,900,000,000 requests in the last 7 days, the Worker has only sent 388,310,000 requests to our origin, meaning we're batching up ~10,000 reports per request to our origin on average. This is a metric we track to fine tune our aggregation and load, but looking at it live right now, we can see that the numbers line up.

Size of the payload coming from the Worker

Number of distinct reports in the payload

Total number of reports in the payload

This translates to over 100 mb/s of JSON coming in to our origin per second, and bear in mind, this is massively normalised and deduplicated. Our ingestion servers take these reports and then process them into our Redis cache, so our outbound from these servers is pretty similar to the inbound, as not many reports are filtered/rejected at this stage.

Into the Redis cache

Our Redis cache for reports is a bit of a beast, as you may have guessed, and it's where reports sit for a short period of time. The cache acts as a buffer to absorb spikes and also allows for a little bit of deduplication of identical reports that arrive in a short time period, further helping us optimise. Looking at the RAM consumption, you can see reports flowing into the cache and also see when the consumer servers pull a batch of reports out for processing.

At present, we're not at our peak, but the Redis cache is handling almost 4,500 transactions per second, which isn't bad!

Consumption time!

The final stop in our infrastructure is through the consumer servers, which pull out batches of reports from the Redis cache to process into persistent storage in Azure Table Storage. We run a slightly smaller number of consumer servers with a lot more resources and these are the servers that are always being worked hard. Looking at their current CPU usage, they're sat at ~50% CPU as we start to approach the busiest time of the day, but they will still climb from here.

The consumers will only see small spikes in inbound traffic when they pull a batch of reports from Redis, but they will always have a fairly consistent outbound bandwidth to Azure as they're pushing that data out to persistent storage.

From here, it's on to the final stop for report data and after this point, it will visible in your dashboard to query, and any alerts/notifications will have been sent.

Azure Table Storage

I've used Azure Table Storage since the beginning of Report URI and it's something that I've never been given a good reason to move away from. It's a simple key:value store and is massively scalable, meaning I've never had to learn how to be a DBA and it's always taken care of for me. You can read some of my blog posts about Table Storage, but let's see how much we're using it.

This is our rate of transactions against Table Storage for the last week and as I happen to be writing this blog post on the 1st Nov 2023, all of our users have just had their monthly quota renewed. This happens at 00:00 UTC on the 1st day of each calendar month and it's why there is an enormous spike at the start of this graph and things were a lot more quiet before that. Of course, as we progress through a month, more of our users will exceed their monthly quota and reports will stop making it to persistent storage, mostly being dropped by the Cloudflare Worker and some by the consumer servers. It doesn't stop the reports being sent, it just means that we don't process them into persistent storage. Our current rate of transactions will slowly decline from where it is now down the the lowest levels by the end of the month again. As you would expect, our ingress and egress patterns follow the same trend.

Something that was introduced quite some time ago was our 90-day retention period on report data. We will keep aggregate, count and analysis data for as long as you want, but the raw JSON of each inbound report will be purged after 90 days. We had to, simply because we couldn't store that much information for an unlimited period of time. Despite that, we still have an impressive 4.9 TiB of data on disk consumed by 2,250,000,000 entities (reports)!

That's quite a lot of JSON! 😅

Where do we go from here?

Whilst all of the above are amazing numbers, you may have noticed that our current report volumes are lower than those we have previously peaked at which were detailed at the start of this post. This is a trend I've been following for a while now and I've been able to put it down to a few things. The biggest impact is that we've made really significant progress in helping our customers get up and running more quickly. Sites always send more reports and telemetry when they're first starting out using these technologies and the faster we can help them get their configurations matured, the faster they can reduce the volume of reports they're sending. Despite this, we're always adding new sites, so even though our users are using less reports on average, as we continue to grow, this has prevented our total volume from decreasing too much by constantly bringing new users on board.

Over the next year or so, we're also helping a lot of sites get ready for the new PCI DSS 4.0 requirements and we're hoping to bring larger providers on board to provide our solution through them. The PCI SSC are putting a huge pressure on sites with an e-commerce component to protect against Digital Skimming Attacks (a.k.a. Magecart) by locking down the JavaScript on their payment pages, something that CSP was literally designed for! As the natural choice for a reporting platform, we're well suited to help sites get their CSP defined, tested and deployed with the least friction possible.

I'm excited for what the rest of 2023 will bring, but I'm looking forward to 2024 already. Having built this company up from the first line of code almost 9 years ago, to where we are today, I wonder if it might be time to take the next big step soon! 😎