乐胖代购免代理版

Abstract

This specification defines the term URL, various algorithms for dealing with URLs, and an API for constructing, parsing, and resolving URLs.

The behavior specified in this document for how browsers process URLs might or might not match any particular browser, but browsers might be well-served by adopting the behavior defined herein.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is the 24 May 2012 First Public Working Draft of the URL specification. Please send comments to public-webapps@w3.org (archived) with [url] at the start of the subject line.

This document is produced by the Web Applications (WebApps) Working Group. The WebApps Working Group is part of the Rich Web Clients Activity in the W3C Interaction Domain.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

Goals

This specification is intended to be referenced by other specifications which need conformance requirements for dealing with URLs—principally, conformance requirements for user agents. To that end, this specification:

Issues

1 Conformance

Everything in this specification is normative except for diagrams, examples, notes and sections marked non-normative.

The key word must in this document is to be interpreted as described in RFC 2119. [RFC2119]

A user agent must also be a conforming implementation of the IDL fragments in this specification, as described in the Web IDL specification. [WEBIDL]

This specification uses terminology from DOM4 and The Web Origin Concept. [DOM] [ORIGIN]

2 Terminology

A parsed URL is a user-agent’s in-memory representation stored as the result of parsing a URL.

A URL is an absolute URL if resolving it results in the same output regardless of what it is resolved relative to, and that output is not a failure.

An absolute URL is a hierarchical URL if, when resolved and then parsed, there is a character immediately after the scheme component and it is a U+002F SOLIDUS character (/).

An absolute URL is an authority-based URL if, when resolved and then parsed, there are two characters immediately after the scheme component and they are both U+002F SOLIDUS characters (//).

A string is a valid non-empty URL if it is a valid URL but it is not the empty string.

When a user agent is to strip leading and trailing whitespace from a string, the user agent must remove all space characters that are at the start or end of the string.

The space characters, for the purposes of this specification, are U+0020 SPACE, U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF), U+000C FORM FEED (FF), and U+000D CARRIAGE RETURN (CR).

Comparing two strings in an ASCII case-insensitive manner means comparing them exactly, code point for code point, except that the characters in the range U+0041 .. U+005A (that is, LATIN CAPITAL LETTER A to LATIN CAPITAL LETTER Z) and the corresponding characters in the range U+0061 .. U+007A (that is, LATIN SMALL LETTER A to LATIN SMALL LETTER Z) are considered to also match.

A control character is a character whose value is less than or equal to U+0020 (" ").

TODO: There’s some question as to whether this is necessary for non-file URLs.

An authority terminating character is either a slash character, U+003F ("?"), U+0023 ("#"), or U+003B (";").

During a parsing algorithm, the remaining string is the characters of the input that have not yet been consumed.

The term a UTF-16 encoding refers to any variant of UTF-16: self-describing UTF-16 with a BOM, ambiguous UTF-16 without a BOM, raw UTF-16LE, and raw UTF-16BE. [RFC2781]

3 Algorithms

3.1 Parse a URL

To parse a URL into its component parts, the user agent must use the following steps:

3.1.1 Find the scheme

3.1.2 Find the authority, path, query, and fragment

To find the authority, path, query, and fragment, the user agent must use the following steps:

3.1.3 Find the user-info, host, and port

To find the user-info, host, and port, the user agent must use the following steps:

3.1.4 Find the username and password

3.2 Resolve a URL

Should we use absolute URL here (as the HTML spec does), instead of resolved URL? Text from the HTML spec:
Resolving a URL is the process of taking a relative URL and obtaining the absolute URL that it implies.
To resolve a URL to an absolute URL relative to either another absolute URL or an element, the user agent must use the following steps. Resolving a URL can result in an error, in which case the URL is not resolvable.

Resolving a URL is the process of taking a relative URL and obtaining the resolved URL that it implies.

To resolve a string relative to a base URL, the user agent must use the following steps:

3.2.1 Resolve a string as a relative URL

To resolve a string as a relative URL, the user agent must use the steps in this section.

Given a string relative-url and a parsed URL base-url, determine the resolved URL as follows:

TODO: If base-url’s scheme is not hierarchical, we can’t resolve as a relative URL. We’ll probably want to return an invalid URL. Check what happens when resolving an empty string as a relative URL with a non-hierarchical base.

3.2.2 Resolve a string as a scheme-relative URL

To resolve a string as a scheme-relative URL, the user agent must use the steps in this section.

3.2.3 Resolve a string as an authority-relative URL

To resolve a string as an authority-relative URL, the user agent must use the steps in this section.

3.2.4 Resolve a string as a path-relative URL

To resolve a string as a path-relative URL, the user agent must use the steps in this section.

TODO: Can the first character of relative-url be a slash character at this point?

TODO: Can we assume base-url is canonicalized here so that it always has at least one “/” character?

3.2.5 Resolve a string as a query-relative URL

To resolve a string as a query-relative URL, the user agent must use the steps in this section.

3.2.6 Resolve a string as a fragment-relative URL

To resolve a string as a fragment-relative URL, the user agent must use the steps in this section.

3.3 Canonicalize a URL

Canonicalizing a URL is the process of taking a parsed URL string and constructing a canonical version of it.

TODO: We probably should mention somewhere that there is not a unique canonicalization for every URL.

3.3.1 Canonicalize a scheme

3.3.2 Canonicalize a user-info

3.3.3 Canonicalize a host

3.3.3.1 Host escape normalization

To perform host escape normalization, the user agent must use the steps in this section.

3.3.4 Canonicalize a port

3.3.5 Canonicalize a path

3.3.6 Canonicalize a query

3.3.7 Canonicalize a fragment

The above algorithm results in the canonicalized fragment containing non-US-ASCII characters.

3.4 Canonicalize query parameters

The query parameter canonicalization of a string s is the query canonicalization of s, modified as follows:

3.5 Collect URL parameters

To collect the URL parameters from a string input, run the following algorithm:

3.6 URL parameter serialization

The URL parameter serialization of a list of parameters parameters, is the result of the following algorithm:

3.7 Port setter preprocessor

The port setter preprocessor of the input string is the result of the following algorithm:

4 Interface URL

The URL object can be used by scripts to programmatically construct, parse, and resolve URLs.

4.1 Constructor

4.2 The URL decomposition IDL attributes

The URL decomposition IDL attributes must act as described in this section.

In addition, the URL interface defines an input, which is a URL that the attributes act on, and a common setter action, which is a set of steps invoked when any of the attributes' setters are invoked.

The ten URL decomposition IDL attributes have similar requirements.

On getting, if the input is an absolute URL that fulfills the condition given in the "getter condition" column corresponding to the attribute in the table below, the user agent must return the part of the input URL given in the "component" column, with any prefixes specified in the "prefix" column appropriately added to the start of the string and any suffixes specified in the "suffix" column appropriately added to the end of the string. Otherwise, the attribute must return the empty string.

On setting, the new value must first be mutated as described by the "setter preprocessor" column, then mutated by %-escaping any characters in the new value that are not valid in the relevant component as given by the "component" column. Then, if the input is an absolute URL and the resulting new value fulfills the condition given in the "setter condition" column, the user agent must make a new string output by replacing the component of the URL given by the "component" column in the input URL with the new value; otherwise, the user agent must let output be equal to the input. Finally, the user agent must invoke the common setter action with the value of output.

When replacing a component in the URL, if the component is part of an optional group in the URL syntax consisting of a character followed by the component, the component (including its prefix character) must be included even if the new value is the empty string.

The previous paragraph applies in particular to the ":" before a <port> component, the "?" before a <query> component, and the "#" before a <fragment> component.

For the purposes of the above definitions, URLs must be parsed using the URL parsing rules defined in this specification.

TODO: Fill out the details for the username, password, and href attributes.

Attribute	Component	Getter Condition	Prefix	Suffix	Setter Preprocessor	Setter Condition
`protocol`	<scheme>	—	—	U+003A COLON (:)	Remove all trailing U+003A COLON characters (:)	The new value is not the empty string
`username`	<username>
`password`	<password>
`host`	<hostport>	input is an authority-based URL	—	—	—	The new value is not the empty string and input is an authority-based URL
`hostname`	<host>	input is an authority-based URL	—	—	Remove all leading U+002F SOLIDUS characters (/)	The new value is not the empty string and input is an authority-based URL
`port`	<port>	input is an authority-based URL, and contained a <port> component (possibly an empty one)	—	—	Run the port setter preprocesser algorithm, passing the input.	input is an authority-based URL, and the new value, when interpreted as a base-ten integer, is less than or equal to 65535
`pathname`	<path>	input is a hierarchical URL	—	—	If it has no leading U+002F SOLIDUS character (/), prepend a U+002F SOLIDUS character (/) to the new value	input is hierarchical
`search`	<query>	input is a hierarchical URL, and contained a <query> component (possibly an empty one)	U+003F QUESTION MARK (?)	—	Remove one leading U+003F QUESTION MARK character (?), if any	input is a hierarchical URL
`hash`	<fragment>	input contained a non-empty <fragment> component	U+0023 NUMBER SIGN (#)	—	Remove one leading U+0023 NUMBER SIGN character (#), if any	—
`href`	<href>

The table below demonstrates how the getter condition for search results in different results depending on the exact original syntax of the URL:

Input URL	`search` value	Explanation
`http://example.com/`	empty string	No <query> component in input URL.
`http://example.com/?`	`?`	There is a <query> component, but it is empty. The question mark in the resulting value is the prefix.
`http://example.com/?test`	`?test`	The <query> component has the value "`test`".
`http://example.com/?test#`	`?test`	The (empty) <fragment> component is not part of the <query> component.

The following table is similar; it provides a list of what each of the URL decomposition IDL attributes returns for a given input URL.

Input	`protocol`	`host`	`hostname`	`port`	`pathname`	`search`	`hash`
`http://example.com/carrot#question%3f`	`http:`	`example.com`	`example.com`	(empty string)	`/carrot`	(empty string)	`#question%3f`
`https://www.example.com:4443?`	`https:`	`www.example.com:4443`	`www.example.com`	`4443`	`/`	`?`	(empty string)

4.3 The filename attribute

The filename attribute must return the (possibly empty) substring of pathname after the last U+002F SOLIDUS (/) character. (Notice that pathname must contain at least one U+002F SOLIDUS (/) character.)

URL

W3C Working Draft 24 May 2012

Abstract

Status of this Document

Table of Contents

Goals

Issues

1 Conformance

2 Terminology

3 Algorithms

3.1 Parse a URL

3.1.1 Find the scheme

3.1.2 Find the authority, path, query, and fragment

3.1.3 Find the user-info, host, and port

3.1.4 Find the username and password

3.2 Resolve a URL

3.2.1 Resolve a string as a relative URL

3.2.2 Resolve a string as a scheme-relative URL

3.2.3 Resolve a string as an authority-relative URL

3.2.4 Resolve a string as a path-relative URL

3.2.5 Resolve a string as a query-relative URL

3.2.6 Resolve a string as a fragment-relative URL

3.3 Canonicalize a URL

3.3.1 Canonicalize a scheme

3.3.2 Canonicalize a user-info

3.3.3 Canonicalize a host

3.3.3.1 Host escape normalization

3.3.4 Canonicalize a port

3.3.5 Canonicalize a path

3.3.6 Canonicalize a query

3.3.7 Canonicalize a fragment

3.4 Canonicalize query parameters

3.5 Collect URL parameters

3.6 URL parameter serialization

3.7 Port setter preprocessor

4 Interface URL

4.1 Constructor

4.2 The URL decomposition IDL attributes

4.3 The filename attribute

4.4 The origin attribute

4.5 The getParameterNames() method

4.6 The getParameterValues() method

4.7 The hasParameter() method

4.8 The getParameter() method

4.9 The setParameter() method

4.10 The addParameter() method

4.11 The removeParameter() method

4.12 The clearParameters() method

References

Normative references

4 Interface `URL`

4.3 The `filename` attribute

4.4 The `origin` attribute

4.5 The `getParameterNames()` method

4.6 The `getParameterValues()` method

4.7 The `hasParameter()` method

4.8 The `getParameter()` method

4.9 The `setParameter()` method

4.10 The `addParameter()` method

4.11 The `removeParameter()` method

4.12 The `clearParameters()` method