Enabling Read Access for Web Resources

W3C

Enabling Read Access for Web Resources

W3C Working Draft 18 June 2007

This Version:
http://www.w3.org/TR/2007/WD-access-control-20070618/
Latest Version:
http://www.w3.org/TR/access-control/
Previous Versions:
http://www.w3.org/TR/2007/WD-access-control-20070215/
http://www.w3.org/TR/2006/WD-access-control-20060517/
http://www.w3.org/TR/2005/NOTE-access-control-20050613/
Editor:
Anne van Kesteren (Opera Software ASA) <annevk@opera.com>

Abstract

This document defines a mechanism to selectively provide cross-site access to a web resource. Using either a HTTP header or an XML processing instruction (or both), resources can indicate they allow read access from specified hosts (optionally using patterns). When a pattern is used, one can also exclude certain hosts. For instance, allow read access from all subdomains of example.org (*.example.org) with the exception of public.example.org (public.example.org).

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is the 18 June 2007 Working Draft of the "Enabling Read Access for Web Resources" document. This document is produced by a Task Force of the Web Application Formats (WAF) Working Group. The WAF Working Group is part of the Rich Web Clients Activity in the W3C Interaction Domain.

Please send comments to the WAF Working Group's public mailing list public-appformats@w3.org with [access-control] at the start of the subject line. Archives of this list are available. See also W3C mailing list and archive usage guidelines.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

Table of Contents

1. Introduction

The web has a rich set of resources that can be combined to build content, applications and feature-rich web sites. A contributor to this richness is web sites including references (e.g. a link or an image inclusion) to resources residing in other domains.

For security reasons, user agents such as web browsers implement a "same origin policy" that allows a document (e.g. some JavaScript) to read, process, or otherwise interrogate the contents of another resource if and only if the other resource resides in the same domain.

This restriction on "read" access to web resources is very strict and generally appropriate. However, there are scenarios where an application would like to "read" data from another resource on the web without these restrictions and in these scenarios the browser's default "security sandbox" has to be extended or eased. For example, a car reservation web site may want to request trip itinerary data from an affiliated airline reservation website to streamline making a car reservation. The easing of read access restrictions is particularly important to web browsers that implement the XMLHttpRequest object and VoiceXML 2.1 browsers using the data element.

To facilitate clear and controlled read access to resources, this specification defines a read access control mechanism that enables a web resource to permit access to its content from external domains when such access would otherwise be prohibited by a same origin policy.

1.1. Conformance Criteria

User agents can not conform to this specification without also conforming to a specification that uses the access control read policy.

As well as sections marked as non-normative, all diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.

In this specification, The words must, must not, should, should not and may are to be interpreted as described in RFC 2119. [RFC2119]

A conformant specification is one that implements all the requirements (the must and must not statements) listed in this specification that are applicable to specifications.

A conformant user agent is one that implements all the requirements listed in this specification that are applicable to user agents, while also being consistent with the requirements listed in the specifications that use the access control read policy.

User agents may optimize any algorithm given in this specification, so long as the end result is indistinguishable from the result that would be obtained by the specification's algorithms. (The algorithms in this specification are generally written with more concern for clarity than efficiency.)

1.1.1. Terminology

The term ToASCII algorithm means that the ToASCII algorithm as described in RFC 3490 is applied with both the AllowUnassigned and UseSTD3ASCIIRules flags set. [RFC3490]

There is a case-insensitive match of strings s1 and s2 if after uppercasing both strings (by mapping a-z to A-Z) they are identical.

U+0009, U+000A, U+000D and U+0020 are space characters.

A space-separated list is a string of which the items are separated by one or more space characters (in any order). The string may also be prefixed or suffixed with zero or more of those characters.

To obtain the values from a space-separated list user agents must replace any sequence space characters (in any order) with a single U+0020 character, dropping any leading or trailing U+0020 character, and then chopping the resulting string at each occurrence of a U+0020 character, dropping that character in the process.

An XML MIME type is text/xml, application/xml or any MIME type ending in +xml.

1.2. Security Considerations

The mechanism defined in this specification extends the "default browser security sandbox" to allow read access for cross-site resources. The extension opens a constrained hole in the browser's "default sandbox".

A user agent running inside a trusted corporate network and executing untrusted content should enforce a sandboxing policy by denying access (to untrusted content). However, it may be appropriate to relax this policy when the user agent is executing only trusted applications that requires access to arbitrary resources on the local network. User agent vendors that allow this sandboxing policy to be configured are encouraged to provide guidance on the appropriate settings. It is critical that network administrators understand the security issues pertinent to their environment and configure their systems appropriately. In tandem, developers and web server administrators are to be aware of the dangers of trusting a user agent that can be configured to disable sandboxing.

User agents which implement this specification should take care not to expose other trusted data (cookies, HTTP header data) inappropriately.

User agents which implement this specification should also take care to properly normalize Unicode and to properly interpret IDNs to prevent URI spoofing attacks as outlined in the specification.

Application authors should be aware that content retrieved from another site is not itself trustable. Authors should take care to protect against exposing themselves to cross-site scripting attacks by rendering or executing the retrieved content directly without validation.

2. Access Control Read Policy

Specifications using the mechanism defined in this specification need to define when the access control read policy applies to a retrieved resource. For instance, a specification could define that in case of cross-site requests this mechanism is put in place.

The policy described is only safe for HEAD and GET requests. Specifications should not use it for other HTTP methods without specifying extra safety measures. [RFC2616]

An access item is a domain containing a wildcard prefixed by a scheme and must match the following EBNF:

access-item    ::= (scheme "://")? domain-pattern (":" port)? | "*"
domain-pattern ::= subdomain | "*." subdomain

scheme and port are used as defined in RFC 3986. subdomain is used as defined in RFC 1034. [RFC3986] [RFC1034]

In addition to matching the above EBNF the ToASCII algorithm must apply successfully (without errors) to each label component of the subdomain (if any) from the access item.

If the port or scheme is omitted a wildcard match is performed on them.

An access item of * matches anything. When * is used as part of domain-pattern it matches any number of label components before the subdomain.

Several examples of conforming access items:

The following access items would make the user agent deny access to the resource:

The following access items are not identical:

2.1. Content-Access-Control header

Resources to which the access control read policy applies can have one or more Content-Access-Control headers defined which must match the following EBNF:

Content-Access-Control ::= "Content-Access-Control" ":" LWS? ruleset
ruleset        ::= LWS? rule LWS? ("," LWS? rule LWS?)*
rule           ::= rule-type (LWS pattern)+ (LWS "exclude" (LWS pattern)+)?
rule-type      ::= "allow" | "deny"
pattern        ::= "<" access item ">"

As stated by RFC 2616, multiple Content-Access-Control headers may be combined.

LWS is used as defined by RFC 2616. [RFC2616]

In case resources on a domain are not all in the control of a single person "deny" rules can be used by authors to deny read access from external resources to the entire domain. Read access from other domains is by default disallowed but individual resources on the domain could have <?access-control?> processing instructions specified which can allow access from other domains. Although files containing such processing instructions HTTP headers can be set accross an entire server making them far more effective. The "exclude" clause can be used to list exclusions to these "deny" rules.

"allow" rules can be used to allow read access from particular domains as long as those domains don't match any of the patterns listed in "exclude".

Content-Access-Control: allow <*.example.org> exclude <*.public.example.org>
Content-Access-Control: allow <webmaster.public.example.org>

Means that every subdomain of example.org can access the resource including webmaster.public.example.org, but with the exclusion of all other subdomains of public.example.org.

Content-Access-Control: allow <example.org> <*.example.org>

Means that example.org and all its subdomains can access the resource.

2.2. <?access-control?> processing instruction

XML resources may include an <?access-control?> processing instruction within the XML Prolog to indicate in cases where the access control read policy applies from which domains they can be fetched. [XML]

The processing instruction takes three pseudo-attributes which each take a space-separated list of access items. These pseudo-attributes are allow, deny and exclude. Either the allow or deny pseudo-attribute must be specified. allow and deny must not be specified at the same time. If an attribute is specified it must at least contain an access item.

An <?access-control?> processing instruction that is part of the XML Prolog must be parsed using the same syntax rules as described in the XML Stylesheet PI specification. <?access-control?> processing instructions outside the XML Prolog are ignored. [XMLSSPI]

The above means that the following examples would be non-conforming and would make the user agent deny access to the resource:

3. User agent processing requirements

When a resource is requested to which the access control read policy is said to apply the user agent must then associate the following with that resource:

The match lists and exclude lists are unordered lists of access items. The match lists are guaranteed to be non-empty and the exclude lists can be empty.

After associating the aforementioned lists and when all HTTP headers have been received the user agent must run the following algorithm (unless stated otherwise):

  1. Parse the Content-Access-Control headers. If any value does not conform to the syntax required deny access to the resource and terminate the algorithm. If parsed successfully then for each rule run the following steps:

    1. If rule-type is "allow" append a new list item to the HTTP access control allow list where the match list is constructed of each access item following "allow" and the exclude list of each access item following "exclude". If "exclude" is not present the exclude list will be empty.

    2. If rule-type is "deny" append a new list item to the HTTP access control deny list where the match list is constructed of each access item following "deny" and the exclude list of each access item following "exclude". If "exclude" is not present the exclude list will be empty.

  2. Then run the following steps for each list item (if any) in the HTTP access control deny list:

    1. If there is no match for any access item from the match list against the requesting URI process the next list item. If there is no next list item go to the next step in the overall set of steps.

    2. If the exclude list is non-empty and there is a match for any access item from the exclude list against the requesting URI process the next list item. If there is no next list item go to the next step in the overall set of steps.

    3. Deny access to the resource and terminate the overall algorithm.

  3. Run the following steps for each list item (if any) in the HTTP access control allow list:

    1. If there is no match for any access item from the match list against the requesting URI process the next list item. If there is no next list item go to the next step in the overall set of steps.

    2. If the exclude list is non-empty and there is a match for any access item from the exclude list against the requesting URI process the next list item. If there is no next list item go to the next step in the overall set of steps.

    3. Set the allow access flag to "true" and go to the next step in the overall set of steps.

  4. If the requested resource has an XML MIME type go to the next step. Otherwise, if the allow access flag is "false" deny access to the resource and terminate the overall algorithm. If the allow access flag is "true" user agents should grant access to the resource and must terminate the overall algorithm.

  5. Parse the resource as an XML document using a streaming XML parser following the rules set forth in the XML specification up to and including the root element start tag. Then process the encountered <?access-control?> processing instructions (if any).

    If there is either an XML parse error or failure to parse the processing instructions deny access to the resource and terminate the overall algorithm. Otherwise, run the following steps for each <?access-control?> processing instruction:

    1. If the processing instruction has any other pseudo-attributes than deny, allow and exclude, has not exactly two pseudo-attributes or has both deny and allow specified terminate the overall algorithm and deny access to the resource.

    2. Let temp match list be the result of parsing the allow or deny pseudo-attribute value, whichever is present. If any obtained value does not match the access item syntax or if no values was obtained terminate the overall algorithm and deny access to the resource.

    3. If there is an exclude pseudo-attribute let temp exclude list be the result of parsing the exclude pseudo-attribute value. If any obtained value does not match the access item syntax or if no value was obtained terminate the overall algorithm and deny access to the resource. If there is no such pseudo-attribute let temp exclude list be empty.

    4. If there is an allow pseudo-attribute append a new list item to the PI access control allow list where the match list is temp match list and the exclude list is temp exclude list.

      Otherwise, there is a deny psuedo-attribute. Append a new list item to the PI access control deny list where the match list is temp match list and the exclude list is temp exclude list.

  6. Then run the following steps for each list item (if any) in the PI access control deny list:

    1. If there is no match for any access item from the match list against the requesting URI process the next list item. If there is no next list item go to the next step in the overall set of steps.

    2. If the exclude list is non-empty and there is a match for any access item from the exclude list against the requesting URI process the next list item. If there is no next list item go to the next step in the overall set of steps.

    3. Deny access to the resource and terminate the overall algorithm.

  7. Then run the following steps for each list item (if any) in the PI access control allow list:

    1. If there is no match for any access item from the match list against the requesting URI process the next list item. If there is no next list item go to the next step in the overall set of steps.

    2. If the exclude list is non-empty and there is a match for any access item from the exclude list against the requesting URI process the next list item. If there is no next list item go to the next step in the overall set of steps.

    3. Set the allow access flag to "true" and go to the next step in the overall set of steps.

  8. If the allow access flag is "false" deny access to the resource. If the allow access flag is "true" user agents should grant access to the resource.

The requesting URI is the scheme followed by ://, followed by the domain without any trailing U+002E (.) (if any), followed by :, followed by the port (defaulting to the default port for the scheme) of the resource from which the request originated. If the resource does not have a host-based authority (data: URI scheme for instance) the requesting URI is "null".

Define the above in terms of "origin"? See HTML5...

To determine whether a requesting URI and an access item match user agents must run the following algorithm:

  1. Let requesting URI be origin and access item be item.

  2. If item is a single U+002A (*) there is a match. Terminate this algorithm.

  3. If origin is "null" there is no match. Terminate this algorithm.

  4. If item has a scheme and it does not case-insensitively match the scheme from origin there is no match. Terminate this algorithm.

  5. If either item or origin has a scheme remove it including the :// sequence following it.

  6. If item has a port and it does not match the port from origin there is no match. Terminate this algorithm.

  7. If either item or origin has a port remove it including the U+003A (:) preceding it.

  8. Let origin list be origin split on the U+002E (.) character (dropping that character in the process) and item list be item split on the U+002E (.) character (dropping that character in the process). Ensure that the order is preserved.

  9. Reverse the order of origin list and item list.

  10. Now process the first list item of both origin list and item list using the following steps:

    1. Let the item from origin list be origin label and the item from item list be item label.

    2. If item label is a single U+002A (*) character move to the next step in the overall set of steps.

    3. Apply the ToASCII algorithm to origin label and item label and store the result in those variables respectively.

    4. If origin label does not case-insensitively match item label there is no match (terminate the overall algorithm).

      Otherwise, apply these set of steps to the next list item of both origin list and item list. If either of them has no next list item there is no match (terminate the overall algorithm.) If both no longer have a next list item go to the next step in the overall set of steps.

  11. There is a match. Terminate this algorithm.

References

[RFC1034]
DOMAIN NAMES - CONCEPTS AND FACILITIES, P. Mockapetris. IETF, November 1987.
[RFC2119]
Key words for use in RFCs to Indicate Requirement Levels, S. Bradner. IETF, March 1997.
[RFC2616]
Hypertext Transfer Protocol -- HTTP/1.1, R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, T. Berners-Lee, editors. IETF, June 1999
[RFC3490]
Internationalizing Domain Names in Applications (IDNA), P. Faltstrom, P. Hoffman, A. Costello. IETF, March 2003.
[RFC3986]
Uniform Resource Identifier (URI): Generic Syntax, T. Berners-Lee, R. Fielding, L. Masinter, editors. IETF, January 2005.
[XML]
Extensible Markup Language (XML) 1.0 (Fourth Edition), T. Bray et al., editors. W3C, August 2006.
Namespaces in XML 1.0 (Second Edition), T. Bray et al., editors. W3C, August 2006.
[XMLSSPI]
Associating Style Sheets with XML documents, J. Clark, editor. W3C, June 1999

Acknowledgements

The editor would like to thank the following people for their contributions to this specification (ordered by first name):

Special thanks to Brad Porter, Matt Oshry and R. Auburn who helped editing earlier versions of this document.