问题现象

1. 应用日志输出:"[DUBBO] Thread pool is EXHAUSTED!"

arthas 分析dubbo线程池满了 dubbo线程池耗尽_arthas 分析dubbo线程池满了

2. 当时服务提供者的处理时间变长 

arthas 分析dubbo线程池满了 dubbo线程池耗尽_数据库_02

 

原因分析

    我们都知道线程池溢出会触发拒绝策略,dubbo的线程池拒绝策略是抛异常,但是在异常抛出前dump一份当前jvm的堆栈信息(非常棒!),默认保存在当前用户的路径下(可通过dump.directory修改dump路径)。

    粗看一下栈信息,大量的dubbo线程(dubbo提供者的线程命名规则是:DubboServerHander-ip:port-thread-id)都在等待同一把锁。 很明显,这些线程都排队在从druid的数据库连接池获取连接

"DubboServerHandler--thread-158" Id=588 WAITING on java.util.concurrent.locks.ReentrantLock$FairSync@4f8c2a0e owned by "DubboServerHandler--thread-155" Id=586
	at sun.misc.Unsafe.park(Native Method)
	-  waiting on java.util.concurrent.locks.ReentrantLock$FairSync@4f8c2a0e
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireInterruptibly(AbstractQueuedSynchronizer.java:897)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1222)
	at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
	at com.alibaba.druid.pool.DruidDataSource.getConnectionInternal(DruidDataSource.java:1207)
	at com.alibaba.druid.pool.DruidDataSource.getConnectionDirect(DruidDataSource.java:1095)
	at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:1075)
	at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:1065)
	at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:104)
	at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:111)
	at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:77)
	at org.mybatis.spring.transaction.SpringManagedTransaction.openConnection(SpringManagedTransaction.java:82)
	at org.mybatis.spring.transaction.SpringManagedTransaction.getConnection(SpringManagedTransaction.java:68)

继续往下看还有几个Dubbo线程在另一个位置竞争上面的锁,这里的线程是从数据库拿回数据后准备把"数据库连接"还给druid的连接池。

"DubboServerHandler--thread-157" Id=600 WAITING on java.util.concurrent.locks.ReentrantLock$FairSync@4f8c2a0e owned by "DubboServerHandler--thread-155" Id=586
	at sun.misc.Unsafe.park(Native Method)
	-  waiting on java.util.concurrent.locks.ReentrantLock$FairSync@4f8c2a0e
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireInterruptibly(AbstractQueuedSynchronizer.java:897)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1222)
	at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
	at com.alibaba.druid.pool.DruidDataSource.recycle(DruidDataSource.java:1434)
	at com.alibaba.druid.pool.DruidPooledConnection.recycle(DruidPooledConnection.java:320)
	at com.alibaba.druid.pool.DruidPooledConnection.close(DruidPooledConnection.java:265)
	at org.springframework.jdbc.datasource.DataSourceUtils.doCloseConnection(DataSourceUtils.java:343)
	at org.springframework.jdbc.datasource.DataSourceUtils.doReleaseConnection(DataSourceUtils.java:330)
	at org.springframework.jdbc.datasource.DataSourceUtils.releaseConnection(DataSourceUtils.java:296)
	at org.mybatis.spring.transaction.SpringManagedTransaction.close(SpringManagedTransaction.java:127)
	at org.apache.ibatis.executor.BaseExecutor.close(BaseExecutor.java:90)
	at org.apache.ibatis.executor.CachingExecutor.close(CachingExecutor.java:64)
	at sun.reflect.GeneratedMethodAccessor556.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.ibatis.plugin.Plugin.invoke(Plugin.java:63)
	at com.sun.proxy.$Proxy210.close(Unknown Source)
	at org.apache.ibatis.session.defaults.DefaultSqlSession.close(DefaultSqlSession.java:264)
	at org.mybatis.spring.SqlSessionUtils.closeSqlSession(SqlSessionUtils.java:193)
	at org.mybatis.spring.SqlSessionTemplate$SqlSessionInterceptor.invoke(SqlSessionTemplate.java:454)
	at com.sun.proxy.$Proxy133.selectOne(Unknown Source)
	at org.mybatis.spring.SqlSessionTemplate.selectOne(SqlSessionTemplate.java:166)

心里大概有点谱了,再往下看,发现部分的dubbo线程在等待一个条件变量,看druid的源码,这里的线程先是获取到了锁,然后去"druid连接池"拿"数据库连接"时发现连接池是空的,所以在"条件变量notEmpty"等待并释放锁,同时唤醒创建"数据库连接"的线程

"DubboServerHandler--thread-156" Id=587 TIMED_WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@5f99b4a9
	at sun.misc.Unsafe.park(Native Method)
	-  waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@5f99b4a9
	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
	at com.alibaba.druid.pool.DruidDataSource.pollLast(DruidDataSource.java:1664)
	at com.alibaba.druid.pool.DruidDataSource.getConnectionInternal(DruidDataSource.java:1225)
	at com.alibaba.druid.pool.DruidDataSource.getConnectionDirect(DruidDataSource.java:1095)
	at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:1075)
	at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:1065)
	at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:104)
	at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:111)
	at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:77)
	at org.mybatis.spring.transaction.SpringManagedTransaction.openConnection(SpringManagedTransaction.java:82)
	at org.mybatis.spring.transaction.SpringManagedTransaction.getConnection(SpringManagedTransaction.java:68)
	at org.apache.ibatis.executor.BaseExecutor.getConnection(BaseExecutor.java:338)

很明显,很多线程都在等待拿"数据库连接",连接池中连接不够,肯定要创建新的连接吧,继续看发现druid有一个线程是创建连接,而很不幸这个线程也在竞争这把锁

"Druid-ConnectionPool-Create-636701037" Id=476 WAITING on java.util.concurrent.locks.ReentrantLock$FairSync@4f8c2a0e owned by "DubboServerHandler-ip:port-thread-155" Id=586
	at sun.misc.Unsafe.park(Native Method)
	-  waiting on java.util.concurrent.locks.ReentrantLock$FairSync@4f8c2a0e
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireInterruptibly(AbstractQueuedSynchronizer.java:897)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1222)
	at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
	at com.alibaba.druid.pool.DruidDataSource$CreateConnectionThread.run(DruidDataSource.java:2068)

看到这里,大家肯定很好奇,这么多线程都在竞争这把锁或者在条件变量上等待,那么这把锁被哪个线程获取了呢,它又在干什么。我们看到这个拿到锁的线程,去连接池拿连接,但是同样连接池是空的,所以它正在发送empty信号唤醒创建"数据库连接"的线程。前面也看到了,创建连接的线程也在竞争锁。

"DubboServerHandler-ip:port-thread-155" Id=586 RUNNABLE
	at com.alibaba.druid.pool.DruidDataSource.emptySignal(DruidDataSource.java:3081)
	at com.alibaba.druid.pool.DruidDataSource.pollLast(DruidDataSource.java:1646)
	at com.alibaba.druid.pool.DruidDataSource.getConnectionInternal(DruidDataSource.java:1225)
	at com.alibaba.druid.pool.DruidDataSource.getConnectionDirect(DruidDataSource.java:1095)
	at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:1075)
	at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:1065)
	at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:104)
	at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:111)
	at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:77)
	at org.mybatis.spring.transaction.SpringManagedTransaction.openConnection(SpringManagedTransaction.java:82)
	...

	Number of locked synchronizers = 2
	- java.util.concurrent.locks.ReentrantLock$FairSync@4f8c2a0e
	- java.util.concurrent.ThreadPoolExecutor$Worker@27eae930

    Dubbo默认线程数是200,出现问题时158个线程在从连接池获取连接时因为竞争锁而阻塞在锁的同步队列上;9个线程从数据库返回数据准备释放连接时同样因为竞争锁而阻塞在同步队列上;32个线程拿数据连接拿不到而等待在notempty的条件变量上;1个线程占有锁但是其发现连接池没有数据库连接正在发送empty信号唤醒创建连接的线程,其实信号发送完以后它自己就会在notempty的条件变量上等待。

    Druid默认只有一个创建连接的线程,此时它也因为竞争锁而阻塞。

    我们假设一下,占有锁的那个线程发送完empty信号后,自己就会在notempty上等待并且释放锁,那么下一个竞争到锁的大概率还是158个等待获取连接的线程竞争到(数量上碾压准备释放连接的线程和创建连接的线程)。

    即便是准备释放连接的线程,那么只有1个线程能拿到连接,然后呢,这个线程从数据库拿到数据后又得竞争锁去释放连接

   或者创建连接的线程竞争到锁,那么它创建1个数据库连接,并且放入到池子中同时发送notempty信号,唤醒1个等待notempty上的线程,释放锁,然后就是159个获取连接的线程和9个释放连接的线程之间竞争锁,结果可想而知。

 

解决方案

    造成这样一个局面的是请求突然过来,并且数量远远大于连接池现有的连接数,连接池中保有的连接数通常跟initialSize、minIdle、keepAlive、maxEvictableIdleTimeMillis这4个参数有关,其中initialSize、minIdle最为相关。同时还跟系统的请求量有关,如果系统的请求量一直比较平稳,那么连接池中连接数也将维持在一个稳定的水平。

参数

默认值

说明

initialSize

0

初始化数据源连接池时初始创建的连接数量

minIdle

0

最小空闲连接的数量,如果keepAlive=true,初始化连接不小于minIdle

keepAlive

false

是否开启连接保活特性。如果keepAlive为true,那么初始化的连接将不小于minIdle。保活特性就是定时对空闲连接中不多于minIdle的连接进行检测,同时设置活跃时间,避免因为超过最大可空闲时间而被关闭。

maxEvictableIdleTimeMillis

1000L * 60L * 60L * 7 即7小时

最大可空闲时间。如果一个连接空闲的时间超过这个时间将被关闭

    调整initialSize和minIdle值能缓解甚至避免这个问题(一般这两个值设为同一个值),如何估算initialSize和minIdle?参考峰值QPS,二八原则很玄幻,我们假设请求量也最少二八原则,80%的请求在20%的时间内过来,那么依次估算峰值QPS*0.8的请求就是同时过来的请求量。然后呢,我们确保同时过来的请求有一半能够立马获取到连接,那么当这个数据返回时,释放连接的线程和等待连接的线程各50%的概率竞争到锁。建议就按这个公式:峰值QPS * 0.8 / 2 / 机器数 。具体的可上下调整,宜大不宜小。