1. 背景
在推荐系统中,常见的目标为ctr和cvr,这两个目标的数据倾斜严重,如果不进行样本采样,模型很容易学偏,导致线上效果不稳定、泛化能力差。
2. 样本采样处理
2.1 样本净化
- 通常情况下,拉取到的行为数据可能会因为种种原因(埋点、延时上报等),导致样本数据出现一些小问题,最常见的就是同一条数据,会同时出现在正负样本里,这是个比较容易忽略的问题,虽然一般情况下影响不大(量级比较小),但还是尽量规避得好。
- 对于样本重复的,直接丢掉负样本,保留正样本即可;如果样本量足够用,都丢掉也无妨。
- 关联特征时,由于时间窗口的不同,特征值可能会出现不同的情况(像点击量这种数值特征),取最大的就好。
2.2 随机采样
这种方式最普遍也最直接,一般会对负样本进行操作,随机保留一部分特征。
val sample = rdd.map(x => {
val parts = x.split("\t")
parts
})
val posRdd = sample.filter(x => x(labelIdx) == "1")
val negRdd = sample.filter(x => x(labelIdx) == "0")
val posNum = posRdd.count()
val negNum = negRdd.count()
val targetNegRatio = math.min(negNum, posNum*ratio)/negNum
val targetNegRdd = negRdd.sample(false, targetNegRatio, 2021)
val targetSampleRdd = posRdd.union(targetNegRdd).map(x => x.mkString("\t"))
.sample(false, 1, 2021)
2.3 用户维度采样
用户维度的采样相比随机采样,粒度更细一些,每个用户产出的正负样本比例最多不超过某个阈值 α \alphaα ,对于没有正样本的用户,有概率 β \betaβ 保留某一条样本,两个阈值自行设定。
2.4 点击维度采样
只对有点击(正样本)部分用户产出负样本,无正样本时,也不会产出负样本。
val labelIdx = getIndex(sampleFormat, "label")
val uidIdx = getIndex(sampleFormat, "uid")
val sample = rdd.map(x => {
val parts = x.split("\t")
parts
})
val posRdd = sample.filter(x => x(labelIdx) == "1").map(x => (x(uidIdx), x))
val negRdd = sample.filter(x => x(labelIdx) == "1").map(x => (x(uidIdx), x))
val _SampleRdd = posRdd.join(negRdd)
.map(x => Array(x._2._1, x._2._2)).flatMap(x => x)
val _targetPosRdd = _SampleRdd.filter(x => x(labelIdx) == "1")
val _targetNegRdd = _SampleRdd.filter(x => x(labelIdx) == "0")
val posNum = _targetPosRdd.count()
val negNum = _targetNegRdd.count()
val targetNegRatio = math.min(negNum, posNum*ratio)/negNum
val targetNegRdd = _targetNegRdd.sample(false, targetNegRatio, 2021)
val targetSampleRdd = _targetPosRdd.union(targetNegRdd).map(x => x.mkString("\t"))
.sample(false, 1, 2021)
2.5 场景样本reweight
很多时候,某个推荐场景的流量并不大,这是如果单一使用该场景数据训练模型,可能不收敛,效果差。
常用的方式是使用全站的数据训练模型,这样可以保证模型较充分训练。但随之也会带来一个问题:如果全站的各个场景用户行为分布本身就有差异,这种方式也会对模型有一定影响,改善这种影响的方式是:对该场景的样本进行reweight操作
val labelIdx = getIndex(sampleFormat, "label")
val sceneIdx = getIndex(sampleFormat, "scene")
val sample = rdd.map(x => {
val parts = x.split("\t")
parts
})
val sceneSample = sample.filter(x => x(sceneIdx) == scene)
.sample(true, reweightRatio, 2021)
val otherSample = sample.filter(x => x(sceneIdx) != scene)
val scenePosSample = sceneSample.filter(x => x(labelIdx) == "1")
val sceneNegSample = sceneSample.filter(x => x(labelIdx) == "0")
val otherPosSample = otherSample.filter(x => x(labelIdx) == "1")
val otherNegSample = otherSample.filter(x => x(labelIdx) == "0")
val scenePosNum = scenePosSample.count()
val sceneNegNum = sceneNegSample.count()
val otherPosNum = otherPosSample.count()
val otherNegNum = otherNegSample.count()
val targetPosNum = scenePosNum + otherPosNum
val targetNegNum = targetPosNum * ratio
val targetPosSample = scenePosSample.union(otherPosSample)
val targetNegSample = sceneNegSample.union(otherNegSample)
.sample(false, targetNegNum/(sceneNegNum+otherNegNum), 2021)
val targetSample = targetPosSample.union(targetNegSample)
.sample(false, 1, 2021)
.map(x => x.mkString("\t"))
2.6 转化样本reweight
很多推荐场景,其核心业务指标可能有多个,如点击、转化、点赞、收藏等等,通常情况下点击样本最多,其他的行为样本很少,如果在训练模型时,只使用ctr模型,可以考虑使用其他行为对点击样本reweight,这样虽然改变了样本的实际行为分布,主观上增加了更多高质量样本,往往也会有作用。
val labelIdx = getIndex(sampleFormat, "label")
val orderIdx = getIndex(sampleFormat, "order")
val sample = rdd.map(x => {
val parts = x.split("\t")
parts
})
val orderSample = sample.filter(x => x(orderIdx) == "1")
.sample(true, reweightRatio, 2021)
val otherSample = sample.filter(x => x(orderIdx) != "1")
val orderPosSample = orderSample.filter(x => x(labelIdx) == "1")
val orderNegSample = orderSample.filter(x => x(labelIdx) == "0")
val otherPosSample = otherSample.filter(x => x(labelIdx) == "1")
val otherNegSample = otherSample.filter(x => x(labelIdx) == "0")
val posSample = orderPosSample.union(otherPosSample)
val negSample = orderNegSample.union(otherNegSample)
val posNum = posSample.count()
val negNum = negSample.count()
val targetSample = negSample.sample(false, posNum*sampleRatio/negNum, 2021)
.union(posSample)
.sample(false, 1, 2021)
.map(x => x.mkString("\t"))