我有两个数据帧,两个数据帧都包含一个不规则间隔的毫秒分辨率时间戳列。我的目标是匹配行,以便对于每个匹配的行,1)第一个时间戳总是小于或等于第二个时间戳,2)匹配的时间戳对于满足1)的所有时间戳对最接近
关键函数名称位searchsort(),这是一个在numpy下的函数,使用的方法如下所示:
import pandas as pd import numpy as np np.random.seed(0) base = np.array(["2013-01-01 00:00:00"], "datetime64[ns]") a = (np.random.rand(30)*1000000*1000).astype(np.int64)*1000000 t1 = base + a t1.sort() b = (np.random.rand(10)*1000000*1000).astype(np.int64)*1000000 t2 = base + b t2.sort()
使用 searchsorted() 找到索引 t1 对于每一个 t2:
idx = np.searchsorted(t1, t2) - 1 mask = idx >= 0 df = pd.DataFrame({"t1":t1[idx][mask], "t2":t2[mask]})
输出如下所示:
t1 t2 0 2013-01-02 06:49:13.287000 2013-01-03 16:29:15.612000 1 2013-01-05 16:33:07.211000 2013-01-05 21:42:30.332000 2 2013-01-07 04:47:24.561000 2013-01-07 04:53:53.948000 3 2013-01-07 14:26:03.376000 2013-01-07 17:01:35.722000 4 2013-01-07 14:26:03.376000 2013-01-07 18:22:13.996000 5 2013-01-07 14:26:03.376000 2013-01-07 18:33:55.497000 6 2013-01-08 02:24:54.113000 2013-01-08 12:23:40.299000 7 2013-01-08 21:39:49.366000 2013-01-09 14:03:53.689000 8 2013-01-11 08:06:36.638000 2013-01-11 13:09:08.078000
此外,关于本文中提到的关键的函数,这篇文章也有稍微的讲解: