Copyright © 2012 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
This specification defines the term URL, various algorithms for dealing with URLs, and an API for constructing, parsing, and resolving URLs.
The behavior specified in this document for how browsers process URLs might or might not match any particular browser, but browsers might be well-served by adopting the behavior defined herein.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This is the 24 May 2012 First Public Working Draft of the URL specification. Please send comments to public-webapps@w3.org (archived) with [url] at the start of the subject line.
This document is produced by the Web Applications (WebApps) Working Group. The WebApps Working Group is part of the Rich Web Clients Activity in the W3C Interaction Domain.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
URL
filename
attributeorigin
attributegetParameterNames()
methodgetParameterValues()
methodhasParameter()
methodgetParameter()
methodsetParameter()
methodaddParameter()
methodremoveParameter()
methodclearParameters()
methodThis section is non-normative.
This specification is intended to be referenced by other specifications which need conformance requirements for dealing with URLs—principally, conformance requirements for user agents. To that end, this specification:
This section is non-normative.
Browsers parse URLs differently depending on which operating system they’re running on. The problem is that they want to do sensible things for file paths, but file paths look different on Windows and Unix systems.
How should we handle cases where browsers disagree with the regular expression in RFC 3986? Currently, this document aims to describe how browsers behave, but we’ll likely need to compare that to RFC 3986 at some point. Some specific differences that have been brought up on the mailing list:
Everything in this specification is normative except for diagrams, examples, notes and sections marked non-normative.
The key word must in this document is to be interpreted as described in RFC 2119. [RFC2119]
A user agent must also be a conforming implementation of the IDL fragments in this specification, as described in the Web IDL specification. [WEBIDL]
This specification uses terminology from DOM4 and The Web Origin Concept. [DOM] [ORIGIN]
A URL is a string used to identify a resource.
A parsed URL is a user-agent’s in-memory representation stored as the result of parsing a URL.
A URL is an absolute URL if resolving it results in the same output regardless of what it is resolved relative to, and that output is not a failure.
An absolute URL is a hierarchical URL if, when resolved and then parsed, there is a character immediately after the scheme component and it is a U+002F SOLIDUS character (/).
An absolute URL is an authority-based URL if, when resolved and then parsed, there are two characters immediately after the scheme component and they are both U+002F SOLIDUS characters (//).
A URL is a valid URL if at least one of the following conditions holds:
The URL is a valid IRI reference and it has no query component. [RFC3987]
The URL is a valid IRI reference and its query component contains no unescaped non-ASCII characters. [RFC3987]
The URL is a valid IRI reference and the
character encoding
of the URL's Document
is UTF-8 or
a UTF-16 encoding.
[RFC3987]
A string is a valid non-empty URL if it is a valid URL but it is not the empty string.
A string is a valid URL potentially surrounded by spaces if, after stripping leading and trailing whitespace from it, it is a valid URL.
A string is a valid non-empty URL potentially surrounded by spaces if, after stripping leading and trailing whitespace from it, it is a valid non-empty URL.
When a user agent is to strip leading and trailing whitespace from a string, the user agent must remove all space characters that are at the start or end of the string.
The space characters, for the purposes of this specification, are U+0020 SPACE, U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF), U+000C FORM FEED (FF), and U+000D CARRIAGE RETURN (CR).
Comparing two strings in an ASCII case-insensitive manner means comparing them exactly, code point for code point, except that the characters in the range U+0041 .. U+005A (that is, LATIN CAPITAL LETTER A to LATIN CAPITAL LETTER Z) and the corresponding characters in the range U+0061 .. U+007A (that is, LATIN SMALL LETTER A to LATIN SMALL LETTER Z) are considered to also match.
A control character is a character whose value is less than or equal to U+0020 (" ").
A slash character is either U+002F ("/") or U+005C ("\").
TODO: There’s some question as to whether this is necessary for non-file URLs.
An authority terminating character is either a slash character, U+003F ("?"), U+0023 ("#"), or U+003B (";").
TODO: Why is ";" on this list?
During a parsing algorithm, the remaining string is the characters of the input that have not yet been consumed.
The term a UTF-16 encoding refers to any variant of UTF-16: self-describing UTF-16 with a BOM, ambiguous UTF-16 without a BOM, raw UTF-16LE, and raw UTF-16BE. [RFC2781]
This section defines algorithms for dealing with URLs.
To parse a URL into its component parts, the user agent must use the following steps:
Don’t we actually want to “strip leading and trailing whitespace” here? (= removing “space characters” as defined in the Terminology section)
(TODO: Just ALPHA?)
TODO: Windows drive specs!
file
",
TODO: File URLs!
mailto
",
I think mailto URLs are special, but more testing is required.
TODO: How are we supposed to know at this point if the scheme is hierarchical? Determining if the scheme is hierarchical requires looking at the first character of the after-scheme is a solidus/slash. So perhaps we need to explicitly say that here or in the “Find a scheme” algorithm; that is, explicitly say, “If the first character of the after-scheme is a solidus (slash character?), then the URL is a hierarchical URL.” Related to that, would it be better to only use the term “hierarchical URL” consistently rather than talking about the scheme being hierarchical? (After all, it really is the URL that’s hierarchical, not the scheme...)
TODO: This might not be the best approach. We need to do more testing of data and javascript URLs.
To find the scheme, the user agent must use the following steps:
To find the authority, path, query, and fragment, the user agent must use the following steps:
To find the user-info, host, and port, the user agent must use the following steps:
To find the username and password, the user agent must use the following steps:
Should we use absolute URL here (as the HTML spec
does), instead of resolved URL? Text from the HTML spec:
Resolving a URL is the process of taking a relative URL and
obtaining the absolute URL that it implies.
To resolve a URL to an absolute URL
relative to either another absolute URL or an element,
the user agent must use the following steps. Resolving a URL can
result in an error, in which case the URL is not resolvable.
Resolving a URL is the process of taking a relative URL and obtaining the resolved URL that it implies.
To resolve a string relative to a base URL, the user agent must use the following steps:
TODO: We probably need to trim leading and trailing control characters.
TODO: Define valid scheme characters
To resolve a string as a relative URL, the user agent must use the steps in this section.
Given a string relative-url and a parsed URL base-url, determine the resolved URL as follows:
TODO: If base-url’s scheme is not hierarchical, we can’t resolve as a relative URL. We’ll probably want to return an invalid URL. Check what happens when resolving an empty string as a relative URL with a non-hierarchical base.
TODO: Think about the case where the relative-url is empty.
To resolve a string as a scheme-relative URL, the user agent must use the steps in this section.
To resolve a string as an authority-relative URL, the user agent must use the steps in this section.
To resolve a string as a path-relative URL, the user agent must use the steps in this section.
TODO: Can the first character of relative-url be a slash character at this point?
TODO: Can we assume base-url is canonicalized here so that it always has at least one “/” character?
To resolve a string as a query-relative URL, the user agent must use the steps in this section.
To resolve a string as a fragment-relative URL, the user agent must use the steps in this section.
Canonicalizing a URL is the process of taking a parsed URL string and constructing a canonical version of it.
TODO: We probably should mention somewhere that there is not a unique canonicalization for every URL.
To canonicalize a URL, the user agent must use the steps in this section.
TODO: Handle file URLs.
TODO: Distinguish between empty and non-existent queries)
TODO: Distinguish between empty and non-existent fragments
To canonicalize a scheme, the user agent must use the steps in this section.
To canonicalize a user-info, the user agent must use the steps in this section.
TODO: which characters?
TODO: which characters?
To canonicalize a host, the user agent must use the steps in this section.
TODO: Handle IP addresses.
TODO: Properly reference IDNA's to-ascii algorithm (we might need a wrapper like we do in the cookie spec).
To perform host escape normalization, the user agent must use the steps in this section.
TODO: Handle percent-unescaping.
To canonicalize a port, the user agent must use the steps in this section.
TODO: ...
To canonicalize a path, the user agent must use the steps in this section.
TODO: Do we need to ensure that path's always start with a slash character?
TODO: Handle "." collapsing.
TODO: Handle percent-unescaping.
TODO: Handle the ambient encoding case.
To canonicalize a query, the user agent must use the steps in this section.
TODO: which characters?
TODO: We need to handle the goofy query escaping format.
To canonicalize a fragment, the user agent must use the steps in this section.
The above algorithm results in the canonicalized fragment containing non-US-ASCII characters.
The query parameter canonicalization of a string s is the query canonicalization of s, modified as follows:
Replace all instances of the U+0026 AMPERSAND (&) character with %26.
Replace all instances of the U+003D EQUALS SIGN (=) character with %3D.
To collect the URL parameters from a string input, run the following algorithm:
Let result be the empty list.
Let parameters be the result of splitting input on the U+0026 AMPERSAND (&) character.
Process each parameter in parameters:
If parameter is the empty string, continue to the next parameter, if any.
If parameter does not contain an U+003D EQUALS SIGN (=) character:
Append an parameter with name parameter and a null value to result.
Continue to the next parameter, if any.
Let name be the (possibly empty) sequence of characters of parameter up to, but not including the first U+003D EQUALS SIGN (=) character.
Let value be the (possibly empty) sequence of characters of parameter after the first U+003D EQUALS SIGN (=) character.
Append a parameter with name name and value value to result.
Return result.
The URL parameter serialization of a list of parameters parameters, is the result of the following algorithm:
Let result be the empty list.
Process each parameter in parameters:
Let s be the query parameter canonicalization of the parameter's name.
If the parameter's value is non-null:
Append a U+003D EQUALS SIGN (=) character to s.
Append the query parameter canonicalization of the parameter's value to s.
Append s to result.
Return the elements of result concatenated, each separated from the next by a U+0026 AMPERSAND (&) character.
The port setter preprocessor of the input string is the result of the following algorithm:
If the first character in input is not in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9) then return a one character string containing a single U+0030 DIGIT ZERO (0) character.
Let result be the empty string.
Let c be the first character in input.
While c is U+0030 DIGIT ZERO (0):
Let c be the next character in input.
While c is in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9):
Append c to the result.
Let c be the next character in input.
Return result.
URL
The URL
object can be used by scripts to programmatically construct, parse, and resolve URLs.
[Constructor(DOMString url, optional DOMString baseURL)] interface URL { attribute DOMString protocol; attribute DOMString username; attribute DOMString password; attribute DOMString host; attribute DOMString hostname; attribute DOMString port; attribute DOMString pathname; attribute DOMString search; attribute DOMString hash; attribute DOMString filename; readonly attribute DOMString origin; sequence<DOMString> getParameterNames(); sequence<DOMString> getParameterValues(DOMString name); boolean hasParameter(DOMString name); DOMString? getParameter(DOMString name); void setParameter(DOMString name, DOMString value); void addParameter(DOMString name, DOMString value); void removeParameter(DOMString name); void clearParameters(); stringifier attribute DOMString href; };
When the URL(url, baseURL)
constructor is invoked, these steps must be run:
Store the parsed URL.
protocol
[ = value ]Returns the current scheme of the underlying URL.
Can be set, to change the underlying URL's scheme.
host
[ = value ]Returns the current host and port (if it's not the default port) in the underlying URL.
Can be set, to change the underlying URL's host and port.
The host and the port are separated by a colon. The port part, if omitted, will be assumed to be the current scheme's default port.
username
[ = value ]TODO: ...
password
[ = value ]TODO: ...
hostname
[ = value ]Returns the current host in the underlying URL.
Can be set, to change the underlying URL's host.
port
[ = value ]Returns the current port in the underlying URL.
Can be set, to change the underlying URL's port.
pathname
[ = value ]Returns the current path in the underlying URL.
Can be set, to change the underlying URL's path.
search
[ = value ]Returns the current query component in the underlying URL.
Can be set, to change the underlying URL's query component.
hash
[ = value ]Returns the current fragment identifier in the underlying URL.
Can be set, to change the underlying URL's fragment identifier.
href
[ = value ]TODO: ...
The URL decomposition IDL attributes must act as described in this section.
In addition, the
URL
interface
defines an input, which is a URL
that the attributes act on, and a
common setter action,
which is a set of steps invoked when any of the attributes' setters are
invoked.
The ten URL decomposition IDL attributes have similar requirements.
On getting, if the input is an absolute URL that fulfills the condition given in the "getter condition" column corresponding to the attribute in the table below, the user agent must return the part of the input URL given in the "component" column, with any prefixes specified in the "prefix" column appropriately added to the start of the string and any suffixes specified in the "suffix" column appropriately added to the end of the string. Otherwise, the attribute must return the empty string.
On setting, the new value must first be mutated as described by the "setter preprocessor" column, then mutated by %-escaping any characters in the new value that are not valid in the relevant component as given by the "component" column. Then, if the input is an absolute URL and the resulting new value fulfills the condition given in the "setter condition" column, the user agent must make a new string output by replacing the component of the URL given by the "component" column in the input URL with the new value; otherwise, the user agent must let output be equal to the input. Finally, the user agent must invoke the common setter action with the value of output.
When replacing a component in the URL, if the component is part of an optional group in the URL syntax consisting of a character followed by the component, the component (including its prefix character) must be included even if the new value is the empty string.
The previous paragraph applies in particular to the
":
" before a <port> component, the "?
" before a <query> component, and the "#
" before a <fragment> component.
For the purposes of the above definitions, URLs must be parsed using the URL parsing rules defined in this specification.
TODO: Fill out the details for the username, password, and href attributes.
Attribute | Component | Getter Condition | Prefix | Suffix | Setter Preprocessor | Setter Condition |
---|---|---|---|---|---|---|
protocol
| <scheme> | — | — | U+003A COLON (:) | Remove all trailing U+003A COLON characters (:) | The new value is not the empty string |
username
| <username> | |||||
password
| <password> | |||||
host
| <hostport> | input is an authority-based URL | — | — | — | The new value is not the empty string and input is an authority-based URL |
hostname
| <host> | input is an authority-based URL | — | — | Remove all leading U+002F SOLIDUS characters (/) | The new value is not the empty string and input is an authority-based URL |
port
| <port> | input is an authority-based URL, and contained a <port> component (possibly an empty one) | — | — | Run the port setter preprocesser algorithm, passing the input. | input is an authority-based URL, and the new value, when interpreted as a base-ten integer, is less than or equal to 65535 |
pathname
| <path> | input is a hierarchical URL | — | — | If it has no leading U+002F SOLIDUS character (/), prepend a U+002F SOLIDUS character (/) to the new value | input is hierarchical |
search
| <query> | input is a hierarchical URL, and contained a <query> component (possibly an empty one) | U+003F QUESTION MARK (?) | — | Remove one leading U+003F QUESTION MARK character (?), if any | input is a hierarchical URL |
hash
| <fragment> | input contained a non-empty <fragment> component | U+0023 NUMBER SIGN (#) | — | Remove one leading U+0023 NUMBER SIGN character (#), if any | — |
href
| <href> |
The table below demonstrates how the getter condition for search
results in different results
depending on the exact original syntax of the URL:
Input URL | search value
| Explanation |
---|---|---|
http://example.com/
| empty string | No <query> component in input URL. |
http://example.com/?
| ?
| There is a <query> component, but it is empty. The question mark in the resulting value is the prefix. |
http://example.com/?test
| ?test
| The <query> component has the value "test ".
|
http://example.com/?test#
| ?test
| The (empty) <fragment> component is not part of the <query> component. |
The following table is similar; it provides a list of what each of the URL decomposition IDL attributes returns for a given input URL.
Input | protocol
| host
| hostname
| port
| pathname
| search
| hash
|
---|---|---|---|---|---|---|---|
http://example.com/carrot#question%3f
| http:
| example.com
| example.com
| (empty string) | /carrot
| (empty string) | #question%3f
|
https://www.example.com:4443?
| https:
| www.example.com:4443
| www.example.com
| 4443
| /
| ?
| (empty string) |
filename
attributeThe filename
attribute
must return the (possibly empty) substring of pathname after the last
U+002F SOLIDUS (/) character. (Notice that pathname must contain at least
one U+002F SOLIDUS (/) character.)
On setting...
origin
attributeThe origin
attribute must
return Unicode serialization of the
stored URL's origin.
getParameterNames()
methodThe getParameterNames
method must run these steps:
Collect the URL parameters from the stored URL's query component and let parameters be the result.
Let result be the empty array.
For each parameter in parameters, if the parameter's name is not contained in result, append the parameter's name to result.
Return result.
getParameterValues()
methodThe getParameterValues
method must run these steps:
Collect the URL parameters from the stored URL's query component and let parameters be the result.
Let result be the empty array.
For each parameter in parameters, if the parameter's name is equal to name, append the parameter's value to result.
Return result.
hasParameter()
methodThe hasParameter
method must run these steps:
Collect the URL parameters from the stored URL's query component and let parameters be the result.
For each parameter in parameters, if the parameter's name is equal to name return true
.
Return false
.
getParameter()
methodThe getParameter
method must run these steps:
Let values be the result of invoking the getParametersAll()
method with name as argument.
If values is empty, return null
, and terminate these steps.
Return the first element of values.
setParameter()
methodThe setParameter
method must run these steps:
If name is the empty string and name is null
, throw a "SyntaxError
" and terminate these steps.
Collect the URL parameters from the stored URL's query component and let parameters be the result.
Remove all parameters from parameters with name name.
Append a parameter to parameters with name name and value values.
Let serialized-parameters be the URL parameter serialization of parameters.
Replace the stored URL's query component with serialized-parameters.
addParameter()
methodThe addParameter
method must run these steps:
If name is the empty string and name is null
, throw a "SyntaxError
" and terminate these steps.
Let parameters be the empty list.
Append a parameter to parameters with name name and value values.
Let serialized-parameters be the URL parameter serialization of parameters.
Replace the stored URL's query component with serialized-parameters.
removeParameter()
methodThe removeParameter
method must run these steps:
Collect the URL parameters from the stored URL's query component and let parameters be the result.
Remove all parameters from parameters with name name.
Let serialized-parameters be the URL parameter serialization of parameters.
Replace the stored URL's query component with serialized-parameters.
clearParameters()
methodThe clearParameters
method must run these steps:
Replace the stored URL's query component with the empty string.