源码版本为MongoDB 2.6分支
mongod数据查询操作
在mongod的初始化过程中说过,服务端接收到客户端消息后调用MyMessageHandler::process函数处理消息。
class MyMessageHandler : public MessageHandler {
public:
...
virtual void process( Message& m , AbstractMessagingPort* port , LastError * le) {
while ( true ) {
...
DbResponse dbresponse;
try {
assembleResponse( m, dbresponse, port->remote() );
}
catch ( const ClockSkewException & ) {
log() << "ClockSkewException - shutting down" << endl;
exitCleanly( EXIT_CLOCK_SKEW );
}
...
}
}
};
DbResponse dbresponse;封装了服务器处理消息后的响应数据。在进入数据处理分析之前先看一个枚举类型Operations ,Operations表示了所有MongoDB的操作 类型 :
enum Operations {
opReply = 1, /* reply. responseTo is set. */
dbMsg = 1000, /* generic msg command followed by a string */
dbUpdate = 2001, /* update object */
dbInsert = 2002, //数据插入
//dbGetByOID = 2003,
dbQuery = 2004, //数据查询
dbGetMore = 2005, //可能是数据同步
dbDelete = 2006, //数据删除
dbKillCursors = 2007 //关闭cursor
};
Message对象中封装了当前message的操作类型。之后本篇文章只分析dbQuery 部分,其他部分将会在其他文章中分析。
可以看到process中调用了assembleResponse来处理消息并封装响应对象(DbResponse dbresponse),下面部分我们将分析assembleResponse函数:
int op = m.operation();
bool isCommand = false;
DbMessage dbmsg(m);
if ( op == dbQuery ) {
const char *ns = dbmsg.getns();
if (strstr(ns, ".$cmd")) {
isCommand = true;
opwrite(m);
if( strstr(ns, ".$cmd.sys.") ) {
if( strstr(ns, "$cmd.sys.inprog") ) {
inProgCmd(m, dbresponse);
return;
}
if( strstr(ns, "$cmd.sys.killop") ) {
killOp(m, dbresponse);
return;
}
if( strstr(ns, "$cmd.sys.unlock") ) {
unlockFsync(ns, m, dbresponse);
return;
}
}
}
else {
opread(m); //如果不是命令则记录日志
}
}
...
在阅读上面的代码之前首先要了解MongoDB源码里面的一个概念——namespace(缩写ns),一个ns代表一个collection和对应的db,一般表示为:”db name” + “.” + “collection name”,如果ns名称中包含“.$cmd”则表示当前操作为一个命令。所以上面代码先判断了是否为数据库命令,如果是则处理,然后返回。
// Increment op counters.
switch (op) {
case dbQuery:
if (!isCommand) {
//增加查询操作计数,暂时没发现有什么作用~
globalOpCounters.gotQuery();
}
else {
// Command counting is deferred, since it is not known yet whether the command
// needs counting.
}
break;
...
}
...
//进入正题,查询数据
if ( op == dbQuery ) {
if ( handlePossibleShardedMessage( m , &dbresponse ) )
return;
receivedQuery(c , dbresponse, m );
}
之前的这些代码都只是做了一下操作分发操作,就是把不同的操作请求分配给相应的函数去处理,而查询请求则由receivedQuery函数处理。
static bool receivedQuery(Client& c, DbResponse& dbresponse, Message& m ) {
...
DbMessage d(m);
QueryMessage q(d);
auto_ptr< Message > resp( new Message() );
CurOp& op = *(c.curop());
try {
NamespaceString ns(d.getns());
cout << "receivedQuery NamespaceString : " << d.getns() << endl;
if (!ns.isCommand()) {
//查询权限认证
// Auth checking for Commands happens later.
Client* client = &cc();
Status status = client->getAuthorizationSession()->checkAuthForQuery(ns, q.query);
audit::logQueryAuthzCheck(client, ns, q.query, status.code());
uassertStatusOK(status);
}
dbresponse.exhaustNS = newRunQuery(m, q, op, *resp);
verify( !resp->empty() );
}
catch (...)
{
...
}
...
return ok;
}
receivedQuery主要分为两部分,第一部分是查询操作,第二部分是操作结果处理(这一部分我给省略了),可以看到,进行查询操作前先进行了查询操作认证,如果当前用户对这个集合没有权限则会抛出异常。如果认证通过则会调用newRunQuery函数进行查询。
/**
* Run the query 'q' and place the result in 'result'.
*/
std::string newRunQuery(Message& m, QueryMessage& q, CurOp& curop, Message &result);
接下来才是查询操作的重头戏,整个过程包括数据的加载,查询命令解析,集合数据扫描匹配等步骤,由于目前对MongoDB的还不是很熟悉,很多地方我个人还是理解不了,所以具体的数据扫描匹配细节会暂时略过,先分析查找流程,具体细节以后深入之后再学习。
const NamespaceString nsString(ns);
uassert(16256, str::stream() << "Invalid ns [" << ns << "]", nsString.isValid());
// Set curop information.
curop.debug().ns = ns;
curop.debug().ntoreturn = q.ntoreturn;
curop.debug().query = q.query;
curop.setQuery(q.query);
// If the query is really a command, run it.
if (nsString.isCommand()) {
int nToReturn = q.ntoreturn;
uassert(16979, str::stream() << "bad numberToReturn (" << nToReturn
<< ") for $cmd type ns - can only be 1 or -1",
nToReturn == 1 || nToReturn == -1);
curop.markCommand();
BufBuilder bb;
bb.skip(sizeof(QueryResult));
BSONObjBuilder cmdResBuf;
if (!runCommands(ns, q.query, curop, bb, cmdResBuf, false, q.queryOptions)) {
uasserted(13530, "bad or malformed command request?");
}
curop.debug().iscommand = true;
// TODO: Does this get overwritten/do we really need to set this twice?
curop.debug().query = q.query;
QueryResult* qr = reinterpret_cast<QueryResult*>(bb.buf());
bb.decouple();
qr->setResultFlagsToOk();
qr->len = bb.len();
curop.debug().responseLength = bb.len();
qr->setOperation(opReply);
qr->cursorId = 0;
qr->startingFrom = 0;
qr->nReturned = 1;
result.setData(qr, true);
return "";
}
之前的代码中已经对部分killop,unlock等部分命令进行了处理,这个地方对之前没有处理的命令再次进行处理,然后直接返回。如果不是命令则继续往下执行,下面就是整个算法最核心的部分:
// This is a read lock. We require this because if we're parsing a $where, the
// where-specific parsing code assumes we have a lock and creates execution machinery that
// requires it.
Client::ReadContext ctx(q.ns);
Collection* collection = ctx.ctx().db()->getCollection( ns );
// Parse the qm into a CanonicalQuery.
CanonicalQuery* cq;
Status canonStatus = CanonicalQuery::canonicalize(q, &cq);
if (!canonStatus.isOK()) {
uasserted(17287, str::stream() << "Can't canonicalize query: " << canonStatus.toString());
}
verify(cq);
QLOG() << "Running query:\n" << cq->toString();
LOG(2) << "Running query: " << cq->toStringShort();
// Parse, canonicalize, plan, transcribe, and get a runner.
Runner* rawRunner = NULL;
// We use this a lot below.
const LiteParsedQuery& pq = cq->getParsed();
// We'll now try to get the query runner that will execute this query for us. There
// are a few cases in which we know upfront which runner we should get and, therefore,
// we shortcut the selection process here.
//
// (a) If the query is over a collection that doesn't exist, we get a special runner
// that's is so (a runner) which doesn't return results, the EOFRunner.
//
// (b) if the query is a replication's initial sync one, we get a SingleSolutinRunner
// that uses a specifically designed stage that skips extents faster (see details in
// exec/oplogstart.h)
//
// Otherwise we go through the selection of which runner is most suited to the
// query + run-time context at hand.
Status status = Status::OK();
if (collection == NULL) {
rawRunner = new EOFRunner(cq, cq->ns());
}
else if (pq.hasOption(QueryOption_OplogReplay)) {
status = getOplogStartHack(collection, cq, &rawRunner);
}
else {
// Takes ownership of cq.
size_t options = QueryPlannerParams::DEFAULT;
if (shardingState.needCollectionMetadata(pq.ns())) {
options |= QueryPlannerParams::INCLUDE_SHARD_FILTER;
}
status = getRunner(cq, &rawRunner, options);
}
if (!status.isOK()) {
// NOTE: Do not access cq as getRunner has deleted it.
uasserted(17007, "Unable to execute query: " + status.reason());
}
上面部分代码包含数据加载,查询数据解析,查询算法匹配等过程,下面稍微详细的分析一下过程。
// This is a read lock. We require this because if we're parsing a $where, the
// where-specific parsing code assumes we have a lock and creates execution machinery that
// requires it.
Client::ReadContext ctx(q.ns);
从注释中可以看着,这是一个“读锁”,但是他实际的功能并不止这些。
/** "read lock, and set my context, all in one operation"
* This handles (if not recursively locked) opening an unopened database.
*/
class ReadContext : boost::noncopyable {
public:
ReadContext(const std::string& ns, const std::string& path=storageGlobalParams.dbpath);
Context& ctx() { return *c.get(); }
private:
scoped_ptr<Lock::DBRead> lk;
scoped_ptr<Context> c;
};
ReadContext 有点像一个代理或者是适配器,实际包含了一个Context对象,然后利用Lock::DBRead添加“读锁”操作。
/** "read lock, and set my context, all in one operation"
* This handles (if not recursively locked) opening an unopened database.
*/
Client::ReadContext::ReadContext(const string& ns, const std::string& path) {
{
lk.reset( new Lock::DBRead(ns) );
Database *db = dbHolder().get(ns, path);
if( db ) {
c.reset( new Context(path, ns, db) );
return;
}
}
// we usually don't get here, so doesn't matter how fast this part is
{
if( Lock::isW() ) {
// write locked already
DEV RARELY log() << "write locked on ReadContext construction " << ns << endl;
c.reset(new Context(ns, path));
}
else if( !Lock::nested() ) {
lk.reset(0);
{
Lock::GlobalWrite w;
Context c(ns, path);
}
// db could be closed at this interim point -- that is ok, we will throw, and don't mind throwing.
lk.reset( new Lock::DBRead(ns) );
c.reset(new Context(ns, path));
}
else {
uasserted(15928, str::stream() << "can't open a database from a nested read lock " << ns);
}
}
}
可以看到在ReadContext构造函数中先根据ns来锁住数据库(之前已经说过,ns包含数据库名称和集合名称),然后在根据ns和数据库路径来获取Database对象,一个Database对象代表一个数据库(这部分包含数据库数据加载,暂时不分析),如果获取到db对象,则设置上下文信息。
如果没有获取到db对象,则会进入到下面:
lk.reset( new Lock::DBRead(ns) );
c.reset(new Context(ns, path));
Context提供了多个构造函数,这个构造函数中会去创建db对象,并加载数据。锁住数据库之后将进入核心查询部分。
// Parse the qm into a CanonicalQuery.
CanonicalQuery* cq;
Status canonStatus = CanonicalQuery::canonicalize(q, &cq);
首先会解析查询消息为标准化的查询对象,主要是将BSON结构数据转换为MatchExpression方便使用。
之后会获取一个Runner对象来执行查询:
Runner* rawRunner = NULL;
// We use this a lot below.
const LiteParsedQuery& pq = cq->getParsed();
// We'll now try to get the query runner that will execute this query for us. There
// are a few cases in which we know upfront which runner we should get and, therefore,
// we shortcut the selection process here.
//
// (a) If the query is over a collection that doesn't exist, we get a special runner
// that's is so (a runner) which doesn't return results, the EOFRunner.
//
// (b) if the query is a replication's initial sync one, we get a SingleSolutinRunner
// that uses a specifically designed stage that skips extents faster (see details in
// exec/oplogstart.h)
//
// Otherwise we go through the selection of which runner is most suited to the
// query + run-time context at hand.
Status status = Status::OK();
if (collection == NULL) {
rawRunner = new EOFRunner(cq, cq->ns());
}
else if (pq.hasOption(QueryOption_OplogReplay)) {
status = getOplogStartHack(collection, cq, &rawRunner);
}
else {
// Takes ownership of cq.
size_t options = QueryPlannerParams::DEFAULT;
if (shardingState.needCollectionMetadata(pq.ns())) {
options |= QueryPlannerParams::INCLUDE_SHARD_FILTER;
}
status = getRunner(cq, &rawRunner, options);
}
上面代码调用getRunner函数来返回一个Runner对象,该Runner对象会对集合进行遍历,然后找到符合查询条件的结果并返回。
一个Runner就代表一种数据查询方式,mongo会根据之前的查询BSON解析结果来判断应该使用哪一种Runner来执行查询,有点类似策略模式。
IDHackRunner : 当前集合是以“_id”作为索引或者查询条件中包含”_id”时就使用此来查询。
CachedPlanRunner:如果之前已经有缓存plan,则使用此来查询。
MultiPlanRunner:使用QueryPlanner来plan 查询条件,如果结果为多个QuerySolution,则使用此来执行查询。
SingleSolutionRunner:和multi相对应,对应一些简单的查询则使用此来执行。
SubPlanRunner:没搞明白…
上面这些Runner都比较复杂,详细分析的话每一个都能需要耗费很多时间,其中包含了对集合的扫描算法,对查询的分段处理等等,整个mongod的核心查询算法都封装在这里面,暂时就不深入研究了。
// Run the query.
// bb is used to hold query results
// this buffer should contain either requested documents per query or
// explain information, but not both
BufBuilder bb(32768);
bb.skip(sizeof(QueryResult));
...
while (Runner::RUNNER_ADVANCED == (state = runner->getNext(&obj, NULL))) {
// Add result to output buffer. This is unnecessary if explain info is requested
if (!isExplain) {
bb.appendBuf((void*)obj.objdata(), obj.objsize());
}
// Count the result.
++numResults;
...
}
获取Runner对象后当然是使用该对象来获取查询结果,Runner提供一个getNext函数来获取下一个结果,之后的就是将查询结果放到result中,然后返回给客户端。
至此,整个数据查询的轮廓已经出来了,其中数据加载和查询算法部分我都很只是提了一下然后略过,主要是水平有限,很多东西我自己还没弄明白,写出来也都是错的,MongoDB的每一个版本代码改动都很大,参考了很多前辈对其他版本的分析,真是很佩服他们,很多东西都分析很透彻,但是对照来看这个版本的源码还是有很多迷惑的地方,所以数据加载和查询算法两个部分之研究明白之后再单独开篇吧。