Robot policy - Wikitech Jump to content

Robot policy

From Wikitech
This page may be outdated or contain incorrect details. Please update it if you can.

This is a summary of the policies for crawlers, bots and remote loading websites that wish to operate on Wikipedia and other Wikimedia websites.

User agent

Main article: m:User-Agent policy

Send a User-Agent header which identifies you (the bot operator). It should contain a URL or email address which allows us to contact you if something goes wrong and we need to impose a block.

Wikimedia does not vary its content depending on User-Agent header. You do not need to send a browser-like string to get the same content as everyone else.

Cache hit rate

To minimise the effect of your crawler on our server cluster resources, please attempt to cooperate with our frontend caching system.

Send the header Accept-Encoding: gzip or similar, to reduce the amount of data sent over the network.

If you wish to have the HTML article content, either fetch the main article page located at http://en.wikipedia.org/wiki/ARTICLE_NAME or use the page/html/ REST API endpoint and extract the content you need from one of those pages. Do not use api.php or other URL schemes.

When working with the Action API, consider setting the maxage and/or smaxage parameters to enable caching of the response.

API usage

Main article: mw:API:Etiquette

Some api.php queries are very expensive, due to underlying database access. Database servers are expensive and time on them needs to be strictly limited. Therefore, API queries which take more than one second of server time should not be executed in any significant volume, regardless of request rate.

Wikipedia's bulk data is stored on hard drives. Random access to that data requires seeking and is expensive for us. Thus, rvprop=content can be used to fetch old revisions of pages, but only in small volumes and low request rates (single threaded, 2 second delay). Very recent revisions, less than 3 hours old, are much cheaper to fetch than old revisions, due to caching. Downloading every new revision shortly after the edit occurs is acceptable if you have a strong need for this data. Downloading every new revision on a batched daily basis is not acceptable.

In general, XML and SQL downloads are preferred over any kind of API request as a means of fetching bulk data.

Avoiding accidental DoS

Crawlers which have a negative impact on general site performance may be blocked indefinitely.

The most common cause of accidental DoS is uncontrolled concurrency. Limit the number of connections you make to our servers. This limit should be on a per-destination-IP basis, not per-domain, since our server cluster hosts many domains.

To avoid exacerbating a temporary server overload, throttle your request rate in proportion to our response time.

Request rate

Most users should limit themselves to a single connection, with a small delay (100ms or so) between requests to avoid a tight loop when there is an error condition.

Some users (for example the major search engines) have been specifically authorised to request pages at a higher rate than this.

Remote loaders

Wikimedia system administrators take a dim view of websites which proxy client requests to Wikimedia in order to dynamically generate a rebranded version of Wikipedia which is framed with ads. We consider this to be an inappropriate, uncompensated co-option of our server resources for the profit of an external party. These websites have often been blocked.

However, we are not opposed in principle to reuse of the content of Wikimedia projects, as long as that reuse follows the relevant content license. We do not currently offer any means for remote loading websites to compensate us for use of server resources under contract, due to the high administration costs of such a scheme.

Hence, our current advice to remote loading websites is:

  • Minimise your use of our server resources by cooperating with our caching system (see above), and by implementing a cache on your own server.
  • Avoid allowing your users to DoS our servers by proxy, for example by limiting the number of apache children running on your server.
  • Express your gratitude to the Wikimedia Foundation for providing this service by donating a portion of your income, or by sponsoring an event. Wikimedia is a charity, supported by donations from users.