实验背景
近一年来发生几起的数据库连接被打满的情况,初步分析是应用使用连接数量“超过了”连接池(DBCP1.4)的上限,导致数据库连接被打满,其中一个结论是连接池的bug导致
问题分析
1、DBCP连接池的Bug导致连接数超过大小设置,其根本原因是在连接池初始化的时候,有可能创建多个连接池导致
protected synchronized DataSource createDataSource()
throws SQLException {
if (closed) {
throw new SQLException("Data source is closed");
}
// Return the pool if we have already created it
if (dataSource != null) {
return (dataSource);
}
// create factory which returns raw physical connections
ConnectionFactory driverConnectionFactory = createConnectionFactory();
// create a pool for our connections
createConnectionPool();
// Set up statement pool, if desired
GenericKeyedObjectPoolFactory statementPoolFactory = null;
if (isPoolPreparedStatements()) {
statementPoolFactory = new GenericKeyedObjectPoolFactory(null,
-1, // unlimited maxActive (per key)
GenericKeyedObjectPool.WHEN_EXHAUSTED_FAIL,
0, // maxWait
1, // maxIdle (per key)
maxOpenPreparedStatements);
}
// Set up the poolable connection factory
createPoolableConnectionFactory(driverConnectionFactory, statementPoolFactory, abandonedConfig);
// Create and return the pooling data source to manage the connections
createDataSourceInstance();
try {
for (int i = 0 ; i < initialSize ; i++) {
connectionPool.addObject();
}
} catch (Exception e) {
throw new SQLNestedException("Error preloading the connection pool", e);
}
return dataSource;
}
createDataSource方法里边调用createConnectionPool方法,如果后面方式执行失败,比如createPoolableConnectionFactory,当再次调用createDataSource,又会再次调用createConnectionPool,从而初始化多次连接池,下面是createConnectionPool
protected void createConnectionPool() {
// Create an object pool to contain our active connections
GenericObjectPool gop;
if ((abandonedConfig != null) && (abandonedConfig.getRemoveAbandoned())) {
gop = new AbandonedObjectPool(null,abandonedConfig);
}
else {
gop = new GenericObjectPool();
}
gop.setMaxActive(maxActive);
gop.setMaxIdle(maxIdle);
gop.setMinIdle(minIdle);
gop.setMaxWait(maxWait);
gop.setTestOnBorrow(testOnBorrow);
gop.setTestOnReturn(testOnReturn);
gop.setTimeBetweenEvictionRunsMillis(timeBetweenEvictionRunsMillis);
gop.setNumTestsPerEvictionRun(numTestsPerEvictionRun);
gop.setMinEvictableIdleTimeMillis(minEvictableIdleTimeMillis);
gop.setTestWhileIdle(testWhileIdle);
connectionPool = gop;
}
显然这个情况只能是模块启动初始化的时候产生,且连接都不会被使用,而线上出现的问题是连接都在执行sql,不是这种情况!在1.4x的这个bug已经修复
2、启动参数分析,从启动参数入手,看一下是不是连接池主动释放连接,导致正在使用的连接被释放,从而创建新的连接,给我们的现象是“连接数”超过了连接池的限制,下面是一个入库的连接池参数
partition1.driverClassName=com.mysql.jdbc.Driver
partition1.initialSize=2
partition1.maxActive=25
partition1.minIdle=2
partition1.maxIdle=5
partition1.maxWait=3000
partition1.threadPoolSize=10
partition1.logAbandoned=true
partition1.testWhileIdle=true
partition1.testOnReturn=false
partition1.testOnBorrow=true
partition1.validationQuery=select now()
//在每次空闲连接回收器线程(如果有)运行时检查的连接数量
partition1.numTestsPerEvictionRun=5
//在空闲连接回收器线程运行期间休眠的时间值,以毫秒为单位
partition1.timeBetweenEvictionRunsMillis=30000
//连接在池中保持空闲而不被空闲连接回收器线程
partition1.minEvictableIdleTimeMillis=180000
//设置了rmoveAbandoned=true 那么在getNumActive()快要到getMaxActive()的时候,系统会进行无效的Connection的回收,回收的 Connection为removeAbandonedTimeout(默认300秒)中设置的秒数后没有使用的Connection
partition1.removeAbandoned=true
//强制回收连接的时间,单位秒
partition1.removeAbandonedTimeout=18
注意connectionProperties参数
<bean id="partition[j]" class="org.apache.commons.dbcp.BasicDataSource"
destroy-method="close">
<property name="driverClassName" value="${partition[j].driverClassName}" ></property>
<property name="url" value="${partition[j].url}" ></property>
<property name="username" value="${partition[j].username}" ></property>
<property name="password" value="${partition[j].password}" ></property>
<property name="defaultAutoCommit" value="false" ></property>
<property name="maxActive" value="${partition[j].maxActive}" ></property>
<property name="maxIdle" value="${partition[j].maxIdle}" ></property>
<property name="maxWait" value="${partition[j].maxWait}" ></property>
<property name="initialSize" value="${partition[j].initialSize}" ></property>
<property name="minIdle" value="${partition[j].minIdle}" ></property>
<property name="logAbandoned" value="${partition[j].logAbandoned}" ></property>
<property name="testWhileIdle" value="${partition[j].testWhileIdle}" ></property>
<property name="testOnReturn" value="${partition[j].testOnReturn}" ></property>
<property name="testOnBorrow" value="${partition[j].testOnBorrow}" ></property>
<property name="validationQuery" value="${partition[j].validationQuery}" ></property>
<property name="numTestsPerEvictionRun" value="${partition[j].numTestsPerEvictionRun}" ></property>
<property name="timeBetweenEvictionRunsMillis" value="${partition[j].timeBetweenEvictionRunsMillis}" ></property>
<property name="minEvictableIdleTimeMillis" value="${partition[j].minEvictableIdleTimeMillis}" ></property>
<property name="removeAbandoned" value="${partition[j].removeAbandoned}" ></property>
<property name="removeAbandonedTimeout" value="${partition[j].removeAbandonedTimeout}" ></property>
<property name="connectionProperties" value="useUnicode=true;
characterEncoding=utf8;initialTimeout=1;connectTimeout=1000;socketTimeout=6000;
rewriteBatchedStatements=true;autoReconnectForPools=true;autoReconnect=true;maxReconnects=1;
failOverReadOnly=false;roundRobinLoadBalance=true;allowMultiQueries=true"></property>
</bean>
从上面参数我们重点关注removeAbandonedTimeout 这个参数的意义
//创建连接
public Object borrowObject() throws Exception {
if (config != null
&& config.getRemoveAbandoned()
&& (getNumIdle() < 2)
&& (getNumActive() > getMaxActive() - 3) ) {
removeAbandoned();
}
Object obj = super.borrowObject();
if (obj instanceof AbandonedTrace) {
((AbandonedTrace) obj).setStackTrace();
}
if (obj != null && config != null && config.getRemoveAbandoned()) {
synchronized (trace) {
trace.add(obj);
}
}
...
private void removeAbandoned() {
// Generate a list of abandoned connections to remove
long now = System.currentTimeMillis();
long timeout = now - (config.getRemoveAbandonedTimeout() * 1000);
ArrayList remove = new ArrayList();
synchronized (trace) {
Iterator it = trace.iterator();
while (it.hasNext()) {
AbandonedTrace pc = (AbandonedTrace) it.next();
if (pc.getLastUsed() > timeout) {
continue;
}
if (pc.getLastUsed() > 0) {
remove.add(pc);
}
}
}
// Now remove the abandoned connections
Iterator it = remove.iterator();
while (it.hasNext()) {
AbandonedTrace pc = (AbandonedTrace) it.next();
if (config.getLogAbandoned()) {
pc.printStackTrace();
}
try {
invalidateObject(pc);
} catch (Exception e) {
e.printStackTrace();
}
}
}
代从代码可以看出,配置了removeAbandonedOnBorrow和removeAbandonedTimeout就是触发正在使用的连接remove,假如这个时候,数据库正在执行sql,就会出现连接池已经断开连接,而数据库的连接还在执行,造成数据库的连接超过连接池的显现,如果少量慢sql执行时间大于removeAbandonedTimeout,不会出现问题,出现大量的慢sql,就会导致数据库中的这种连接越来越多,最后把数据库打满
实验过程
创建测试库
mkdir /usr/local/data/mysql
docker run -d -e MYSQL_ROOT_PASSWORD=root --name centos/mysql-57-centos7 -v /usr/local/data/mysql:/var/lib/mysql -p 3306:3306 mysql --character-set-server=utf8mb4 --collation-server=utf8mb4_unicode_ci --lower_case_table_names=1
创建测试库
create database test ;
构建测试类,核心代码
public void setUp() throws Exception {
ds = createDataSource();
ds.setDriverClassName("com.mysql.jdbc.Driver");
ds.setUrl("jdbc:mysql://127.0.0.1:3306/test?allowMultiQueries=true&characterEncoding=UTF-8&autoReconnect=true&failOverReadOnly=false&useSSL=false&allowPublicKeyRetrieval=true");
ds.setUsername("root");
ds.setPassword("Htbuy@2016");
ds.setMaxActive(1);
ds.setMaxWait(1000);
ds.setTestWhileIdle(true);
ds.setTestOnBorrow(true);
ds.setTestOnReturn(false);
ds.setValidationQuery("select now()");
ds.setNumTestsPerEvictionRun(5);
ds.setMinEvictableIdleTimeMillis(2000);
ds.setLogAbandoned(true);
ds.setRemoveAbandoned(true);
ds.setRemoveAbandonedTimeout(1);
}
public void testAbandoned() throws Exception {
for (int i = 0; i < 20; i++) {
Thread t = new Thread(new Runnable() {
@Override
public void run() {
try {
Connection conn = ds.getConnection();
Statement statement = conn.createStatement();
ResultSet resultSet = null;
#模拟慢sql
resultSet = statement.executeQuery("select sleep(100),now()");
while (resultSet.next()) {
System.out.println("result+" + resultSet.getString(1));
}
resultSet.close();
statement.close();
conn.close();
} catch (Exception ex) {
System.out.println(ex.getMessage());
}
System.out.println(Thread.currentThread().getName() + "---------------------- end----------------------");
}
});
t.setName(i + "");
t.start();
Thread.sleep((i + 1) * 1000);
}
System.out.println(Thread.currentThread().getName() + "---------------------- end----------------------");
Thread.sleep(1000000);
}
通过运行上面代码,并没有像我们所期望那样,超过连接的最大限制1,通过RemoveAbandoned=true和RemoveAbandonedTimeout=1 应该已经触发了连接池的Abandoned机制,但是都阻塞到下面的一行代码,DelegatingStatement的close方法
/**
* Close this DelegatingStatement, and close
* any ResultSets that were not explicitly closed.
*/
public void close() throws SQLException {
try {
try {
if (_conn != null) {
_conn.removeTrace(this);
_conn = null;
}
// The JDBC spec requires that a statment close any open
// ResultSet's when it is closed.
// FIXME The PreparedStatement we're wrapping should handle this for us.
// See bug 17301 for what could happen when ResultSets are closed twice.
List resultSets = getTrace();
if( resultSets != null) {
ResultSet[] set = (ResultSet[]) resultSets.toArray(new ResultSet[resultSets.size()]);
for (int i = 0; i < set.length; i++) {
set[i].close();
}
clearTrace();
}
//阻塞地方
_stmt.close();
}
catch (SQLException e) {
handleException(e);
}
}
finally {
_closed = true;
}
}
当我们在url设置socketTimeout=1000,这时候阻塞的地方成功执行完成,超过数据库出现大量的连接数1,问题重现!同时出现典型的日志The last packet successfully received from the server was 1,001 milliseconds ago. The last packet sent successfully to the server was 1,001 milliseconds ago.
当设置socketTimeout=1000,RemoveAbandnotallow=false ,并没有重现问题,且都是大量的等待连接超时
原因是下面的代码差异,开启RemoveAbandoned,使用的是AbandonedObjectPool,而不开启是默认的GenericObjectPool,AbandonedObjectPool增加了Abandoned逻辑
*/
protected void createConnectionPool() {
// Create an object pool to contain our active connections
GenericObjectPool gop;
if ((abandonedConfig != null) && (abandonedConfig.getRemoveAbandoned())) {
gop = new AbandonedObjectPool(null,abandonedConfig);
}
else {
gop = new GenericObjectPool();
}
gop.setMaxActive(maxActive);
gop.setMaxIdle(maxIdle);
gop.setMinIdle(minIdle);
gop.setMaxWait(maxWait);
gop.setTestOnBorrow(testOnBorrow);
gop.setTestOnReturn(testOnReturn);
gop.setTimeBetweenEvictionRunsMillis(timeBetweenEvictionRunsMillis);
gop.setNumTestsPerEvictionRun(numTestsPerEvictionRun);
gop.setMinEvictableIdleTimeMillis(minEvictableIdleTimeMillis);
gop.setTestWhileIdle(testWhileIdle);
connectionPool = gop;
}
总结
当RemoveAbandoned=true, 且执行时间超过socketTimeout ,达到RemoveAbandonedTimeout的触发点时,就会导致数据库连接数超过连接池的限制,注意这种情况关闭模块是没用的,sql还在数据库中执行,应该直接kill或者切库操作!!
优化方案
1、关闭RemoveAbandoned设置,目前wms系统的还是AP和TP混合型,有很大的几率触发这种情况
2、经过邱玉堃复核模块测试,需要关闭socketTimeout参数,引用的1.4.jar版本与源码的1.4版本存在差异
3、 注意当mysql-connector-java 版本小于5.1.45就会出现这个bug