结论
原因1 配置问题 配置文件中的recordDataTTL、otherMetricsDataTTL配置不生效, 可以认为是bug
解决方案:方法1:手动设置minuteMetricsDataTTL、hourMetricsDataTTL 、dayMetricsDataTTL。recordData的删除会使用dayMetricsDataTTL配置的值。方法2:修改源码原因2 Skywalking Bug skywalking-6.2.0如果设置了nameSpace 在删除index的时候有bug, 需要改源码重新编译
解决方案:方法1:把namespace设置为空。方法2:修改源码
环境
Skywalking版本:6.2.0
ES实例:4核 * 14G, 三台实例,基于docker起的
OAPServer:只有一台,1500M
agent节点:也就是JVM实例大概50个
解决过程
1. 配置问题
通过查看源码找到删除ES历史数据的核心代码,如下。先根据该model(如Segement ,各种Metrics)的Downsampling配置和DataTTLConfig计算出截止时间,小于该截止时间的index需要删除
DownSampling是一个枚举
public enum Downsampling {
None(0, ""), Second(1, "second"), Minute(2, "minute"), Hour(3, "hour"), Day(4, "day"), Month(5, "month");
private final int value;
private final String name;
Downsampling(int value, String name) {
this.value = value;
this.name = name;
}
public int getValue() {
return value;
}
public String getName() {
return name;
}
}
DataTTLConfig就是配置各个类型过期时间的配置,record和metrics
@Setter
@Getter
public class DataTTLConfig {
private int recordDataTTL;
private int minuteMetricsDataTTL;
private int hourMetricsDataTTL;
private int dayMetricsDataTTL;
private int monthMetricsDataTTL;
}
回过头看deleteHistory的逻辑,主要看一下计算截止时间timeBefore的逻辑,可以看到截止时间只与model的Downsampling和DataTTLConfig有关
StorageTTL的实现类为ElasticsearchStorageTTL, ElasticsearchStorageTTL的工作就是根据DownSampling返回对应的TTLCalculator。举例TTLCalculator的实现类EsMinuteTTLCalculator,可以看到会根据当前时间和DataTTLConfig的MinuteMetricsDataTTL配置计算时间,单位为天 , 而EsHourTTLCalculator会使用DataTTLConfig的hourMetricsDataTTL计算时间,TTLCalculator与DataTTLConfig是有对应关系的
public class ElasticsearchStorageTTL implements StorageTTL {
@Override public TTLCalculator calculator(Downsampling downsampling) {
switch (downsampling) {
case Month:
return new MonthTTLCalculator();
case Hour:
return new EsHourTTLCalculator();
case Minute:
return new EsMinuteTTLCalculator();
default:
return new DayTTLCalculator();
}
}
}
public class EsMinuteTTLCalculator implements TTLCalculator {
@Override public long timeBefore(DateTime currentTime, DataTTLConfig dataTTLConfig) {
return Long.valueOf(currentTime.plusDays(0 - dataTTLConfig.getMinuteMetricsDataTTL()).toString("yyyyMMdd"));
}
}
这里顺便说一下为什么recordDataTTL配置不会生效,Record类型的DownSampling为Second,但是从上面可以看到ElasticsearchStorageTTL中并没有case Second,所以遇到Second的话会返回DayTTLCalculator,而DayTTLCalculator使用的dataTTLConfig的DayMetricsDataTTL,recordDataTTL也就没有用了
接下来只用DataTTLConfig是如何获取到的就可以了,从上面的deleteHistory代码可以看到DataTTLConfig是从CoreModule的ConfigService中读取的,其实也就是从application.yml的core Module配置读取的,不过StorageModuleElasticsearchProvider在启动的时候会用StorageModuleElasticsearchConfig覆盖CoreModule中的DataTTLConfig
org.apache.skywalking.oap.server.storage.plugin.elasticsearch.StorageModuleElasticsearchProvider
private void overrideCoreModuleTTLConfig() {
ConfigService configService = getManager().find(CoreModule.NAME).provider().getService(ConfigService.class);
configService.getDataTTLConfig().setRecordDataTTL(config.getRecordDataTTL());
configService.getDataTTLConfig().setMinuteMetricsDataTTL(config.getMinuteMetricsDataTTL());
configService.getDataTTLConfig().setHourMetricsDataTTL(config.getHourMetricsDataTTL());
configService.getDataTTLConfig().setDayMetricsDataTTL(config.getDayMetricsDataTTL());
configService.getDataTTLConfig().setMonthMetricsDataTTL(config.getMonthMetricsDataTTL());
}
看下配置StorageModuleElasticsearchConfig,我们主要看与otherMetricsDataTTL相关的配置,从下面的代码可以看到,作者是想在otherMetricsDataTTL被赋值的时候自动把minuteMetricsDataTTL、hourMetricsDataTTL 、dayMetricsDataTTL给赋值上。但是由于系统启动的时候是通过反射直接修改的Field,所以setOtherMetricsDataTTL方法并不会被触发,这也就是我们在配置文件中配置了otherMetricsDataTTL也不会生效的原因,系统只会用默认的2
@Getter
public class StorageModuleElasticsearchConfig extends ModuleConfig {
@Setter private int recordDataTTL = 7;
@Setter private int minuteMetricsDataTTL = 2;
@Setter private int hourMetricsDataTTL = 2;
@Setter private int dayMetricsDataTTL = 2;
private int otherMetricsDataTTL = 0;
@Setter private int monthMetricsDataTTL = 18;
public void setOtherMetricsDataTTL(int otherMetricsDataTTL) {
if (otherMetricsDataTTL > 0) {
minuteMetricsDataTTL = otherMetricsDataTTL;
hourMetricsDataTTL = otherMetricsDataTTL;
dayMetricsDataTTL = otherMetricsDataTTL;
}
}
}
系统启动时通过反射赋值Config的相关代码
org.apache.skywalking.oap.server.library.module.ModuleDefine
private void copyProperties(ModuleConfig dest, Properties src, String moduleName,
String providerName) throws IllegalAccessException {
if (dest == null) {
return;
}
Enumeration<?> propertyNames = src.propertyNames();
while (propertyNames.hasMoreElements()) {
String propertyName = (String)propertyNames.nextElement();
Class<? extends ModuleConfig> destClass = dest.getClass();
try {
Field field = getDeclaredField(destClass, propertyName);
field.setAccessible(true);
field.set(dest, src.get(propertyName));
} catch (NoSuchFieldException e) {
logger.warn(propertyName + " setting is not supported in " + providerName + " provider of " + moduleName + " module");
}
}
}
配置问题的解决方案
方法1:直接在配置文件中配置minuteMetricsDataTTL、hourMetricsDataTTL 、dayMetricsDataTTL参数,而不使用默认的otherMetricsDataTTL。recordData的删除会使用dayMetricsDataTTL配置的值
方法2:修改源码,手动调用一下setOtherMetricsDataTTL
2. 删除Index的bug问题
这个问题相对比较明显,从上面的deleteHistory中我们看到根据alias查询出index,然后判断时间过期的index会被调用删除逻辑,问题就出在deleteIndex的地方。如下可以看到在删除之前会在传入的indexName前面添加namespace,问题是此时传入的idnexName已经包含了Namespace信息了(是根据alias直接从es中查询出来的),再添加一次namespace就会导致找不到index,而导致删除index失败
public boolean deleteIndex(String indexName) throws IOException {
indexName = formatIndexName(indexName);
DeleteIndexRequest request = new DeleteIndexRequest(indexName);
DeleteIndexResponse response;
response = client.indices().delete(request);
logger.debug("delete {} index finished, isAcknowledged: {}", indexName, response.isAcknowledged());
return response.isAcknowledged();
}
public String formatIndexName(String indexName) {
if (StringUtils.isNotEmpty(namespace)) {
return namespace + "_" + indexName;
}
return indexName;
}
解决方案也很简单, 添加一个deleteIndexWithFullIndexName方法,这个地方直接调用deleteIndexWithFullIndexName即可
public boolean deleteIndex(String indexName) throws IOException {
String fullIndexName = formatIndexName(indexName);
return deleteIndexWithFullIndexName(fullIndexName);
}
public boolean deleteIndexWithFullIndexName(String fullIndexName) throws IOException {
DeleteIndexRequest request = new DeleteIndexRequest(fullIndexName);
DeleteIndexResponse response;
response = client.indices().delete(request);
logger.debug("delete {} index finished, isAcknowledged: {}", fullIndexName, response.isAcknowledged());
return response.isAcknowledged();
}