spark将数据存入redis spark读取redis

转载

mob64ca14122c74 2024-01-02 12:05:09

文章标签 spark将数据存入redis java redis jedis 连接池 文章分类 Spark 大数据

文章目录

1. Master
2. Worker
3. Master上创建，Worker上遍历
4. Worker上按分区遍历
5. 使用静态类型，按分区遍历
6. 使用单例模式，按分区遍历
7. 使用单例模式，Driver上定义，分区上遍历

这几天碰到了类似的问题, 网上查的一些资料, 这里记录一下~

1. Master

将所有的数据全部回收到master, 然后在master进行集中处理

连接池代码:

public class TestRedisPool {
	private JedisPool pool = null;
	public TestRedisPool(String ip, int port, String passwd, int database) {
		if (pool == null) {
			JedisPoolConfig config = new JedisPoolConfig();
			config.setMaxTotal(500);
			config.setMaxIdle(30);
			config.setMinIdle(5);
			config.setMaxWaitMillis(1000 * 10);
			config.setTestWhileIdle(false);
			config.setTestOnBorrow(false);
			config.setTestOnReturn(false);
			pool = new JedisPool(config, ip, port, 10000, passwd, database);
			Logs.debug("init:" + pool);
		}
	}
	public JedisPool getRedisPool() {
		return pool;
	}
	public String set(String key,String value){
		Jedis jedis = null;
		try {
			jedis = pool.getResource();
			return jedis.set(key, value);
		} catch (Exception e) {
			e.printStackTrace();
			return "0";
		} finally {
			jedis.close();
		}
	}
}

使用方式:

List<String> list = Arrays.asList("a","b","c","d", "e");
JavaRDD<String> javaRDD = new JavaSparkContext(spark.sparkContext()).parallelize(list, 3);
TestRedisPool testRedisPool = new TestRedisPool(redisIp, port, passwd, dbNum);
List<String> lst = javaRDD.collect();
for(String s:lst) {
	testRedisPool.set(s, getDateString2(0));
}

2. Worker

在worker遍历的时候初始化连接池

javaRDD.foreach(new VoidFunction<String>() {
	@Override
	public void call(String s) throws Exception {
		TestRedisPool testRedisPool = new TestRedisPool(redisIp, port, passwd, dbNum);
		Logs.debug(testRedisPool.getRedisPool());
		testRedisPool.set(s, getDateString2(0));
	}
});

遍历所有元素，TestRedisPool不需要实现序列化；每一个RDD中的元素都需要创建很多的redis连接池，即便使用短连接也会对redis造成很大的压力。效率也是极其低下的。

3. Master上创建，Worker上遍历

在Master上创建一个实例，在进行分区遍历时使用Master上创建的实例，这种方式是可以的，只需要将类实现序列即可。同时还可以通过广播变量，将实例在Worker上持久化，减少实例使用时的网络传输。

TestRedisPool testRedisPool = new TestRedisPool(redisIp, port, passwd, dbNum);
javaRDD.foreach(new VoidFunction<String>() {
	@Override
	public void call(String s) throws Exception {
		Logs.debug(testRedisPool.getRedisPool());
		testRedisPool.set(s, getDateString2(0));
	}
});

Exception in thread "main" org.apache.spark.SparkException: Task not serializable
...
Serialization stack:
	- object not serializable (class: redis.clients.jedis.JedisPool, value: redis.clients.jedis.JedisPool@3e4f80cb)

报错jedispool无法序列化，即使TestRedisPool类实现了序列化，但因为其成员变量jedispool本身并不支持序列化，所以这种方式在有成员变量无法序列化时也不可用。

4. Worker上按分区遍历

javaRDD.foreachPartition(new VoidFunction<Iterator<String>>() {
	@Override
	public void call(Iterator<String> stringIterator) throws Exception {
		TestRedisPool testRedisPool = new TestRedisPool(redisIp, port, passwd, dbNum);
		while (stringIterator.hasNext()) {
			Logs.debug(testRedisPool.getRedisPool());
			testRedisPool.set(stringIterator.next(), getDateString2(0));
		}
	}
});

TestRedisPool不需要实现序列化，每个分区只需要创建一个redis连接池

5. 使用静态类型，按分区遍历

在上面，我们可以做到在每个分区上建立连接池，但是每台机器一般对应多个分区，怎么进一步减少连接池的创建呢。我们知道静态类型全局只有一份，如果将redis连接池定义为静态类型，做到每个worker上只创建一个redis连接池。

public class TestRedisPool {
	private static JedisPool pool = null;
	...
}

错误使用:

TestRedisPool testRedisPool = new TestRedisPool(redisIp, port, passwd, dbNum);
javaRDD.foreachPartition(new VoidFunction<Iterator<String>>() {
	@Override
	public void call(Iterator<String> stringIterator) throws Exception {
		while (stringIterator.hasNext()) {
			Logs.debug(testRedisPool.getRedisPool());
			testRedisPool.set(stringIterator.next(), getDateString2(0));
		}
	}
});

这种在Master上创建TestRedisPool实例的方式，在worker上无法获取到，会报java.lang.NullPointerException异常。

正确使用:

javaRDD.foreachPartition(new VoidFunction<Iterator<String>>() {
	@Override
	public void call(Iterator<String> stringIterator) throws Exception {
		TestRedisPool testRedisPool = new TestRedisPool(redisIp, port, passwd, dbNum);
		while (stringIterator.hasNext()) {
			Logs.debug(testRedisPool.getRedisPool());
			testRedisPool.set(stringIterator.next(), getDateString2(0));
		}
	}
});

TestRedisPool也不需要序列化。这种情况下是在分区上分别创建实例，分区对应的就是虚拟线程的个数，所以相当于3个线程同时去获取jedispool实现，所以一共init了三次。如果做成单例模式就能解决init多次的问题。

6. 使用单例模式，按分区遍历

连接池代码:

package com.project.uitl;

import redis.clients.jedis.Jedis;
import redis.clients.jedis.JedisPool;
import redis.clients.jedis.JedisPoolConfig;

/**
 * Redis 连接池工具包
 *
 */
public class JedisPoolUtil {
    
    private static final String HOST = "132.232.6.208";
    private static final int PORT = 6381;
    
    private static volatile JedisPool jedisPool = null;
    
    private JedisPoolUtil() {}
    
    /**
     * 获取RedisPool实例（单例）
     * @return RedisPool实例
     */
    public static JedisPool getJedisPoolInstance() {
        if (jedisPool == null) {
            synchronized (JedisPoolUtil.class) {
                if (jedisPool == null) {
                    
                    JedisPoolConfig poolConfig = new JedisPoolConfig();
                    poolConfig.setMaxTotal(1000);           // 最大连接数
                    poolConfig.setMaxIdle(32);              // 最大空闲连接数
                    poolConfig.setMaxWaitMillis(100*1000);  // 最大等待时间
                    poolConfig.setTestOnBorrow(true);       // 检查连接可用性, 确保获取的redis实例可用
                    
                    jedisPool = new JedisPool(poolConfig, HOST, PORT);
                }
            }
        }
        
        return jedisPool;
    }
    
    /**
     * 从连接池中获取一个 Jedis 实例（连接）
     * @return Jedis 实例
     */
    public static Jedis getJedisInstance() {
        
        return getJedisPoolInstance().getResource();
    }
    
    /**
     * 将Jedis对象（连接）归还连接池
     * @param jedisPool 连接池
     * @param jedis 连接对象
     */
    public static void release(JedisPool jedisPool, Jedis jedis) {
        
        if (jedis != null) {
            jedisPool.returnResourceObject(jedis);  // 已废弃，推荐使用jedis.close()方法
        }
    }
}

以上volatile保证当jedispool未初始化完成是不能被获取到，synchronized解决多线程冲突的问题。这两个关键词的使用其实也就是lazy initialize的实现。

javaRDD.foreachPartition(new VoidFunction<Iterator<String>>() {
	@Override
	public void call(Iterator<String> stringIterator) throws Exception {
		TestRedisPool testRedisPool = new TestRedisPool(redisIp, port, passwd, dbNum);
		while (stringIterator.hasNext()) {
			Logs.debug("class:" + testRedisPool );
			Logs.debug("pool:" + testRedisPool .getRedisPool());
			testRedisPool .set(stringIterator.next(), getDateString2(0));
		}
	}
});

现在jedispool只init了一次，并且全局也只有一个jedispool。但是现在TestRedisPool对象还是被创建了多个，改为在Master上定义，并已广播变量的形式分发到Worker上可以解决这个问题，这种情况下TestRedisPool需要序列化。

7. 使用单例模式，Driver上定义，分区上遍历

TestRedisPool testRedisPool = new TestRedisPool(redisIp, port, passwd, dbNum);
final Broadcast<TestRedisPool> broadcastRedis = new JavaSparkContext(spark.sparkContext()).broadcast(testRedisPool);
javaRDD.foreachPartition(new VoidFunction<Iterator<String>>() {
	@Override
	public void call(Iterator<String> stringIterator) throws Exception {
		TestRedisPool redisClient1 = broadcastRedis.getValue();
		while (stringIterator.hasNext()) {
			Logs.debug("class:" + redisClient1);
			Logs.debug("pool:" + redisClient1.getRedisPool());
			redisClient1.set(stringIterator.next(), getDateString2(0));
		}
	}
});

现在是TestRedisPool在Master上定义，广播到各个Worker上；同时jedispool在每台worker上也始终只会有一个实例存在。

但是也会有人会疑问，为什么jedispool现在没有序列化的问题（方法三），或者定义成静态导致worker上获取不到jedispool（方法五第一种情况）的问题。这是因为方法三中jedispool为普通类型是和类一起序列化，因为其本身不支持序列化，所以报错方法五中，定义成静态类型之后，静态类型不属于类，所以TestRedisPool序列化不会出错，但是因为jedispool在Master上定义和初始化，不会传输到节点上，节点上获取到的jedispool都为null，所以报错。而方法七中使用懒启动的方式，在使用时才会初始化jedispool，所以实际是在节点上完成的初始化，所以不会有问题。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。