iceberg 表能用inesert overwrite语句插入数据吗

转载

小蝌蚪 2025-01-06 22:17:30

文章标签 大数据 iceberg presto 实践数据湖 文章分类 架构后端开发

文章目录

iceberg介绍
环境准备
实操crud

presto操作

配置
测试
结论

trino操作

介绍
配置
测试
结论

iceberg介绍

关于iceberg的一些介绍官方有做详细说明 https://iceberg.apache.org/ ，我们重点说下iceberg的实践和一些使用踩坑。为什么先讲述presto和trino引擎操作，主要是这两个组件没有找到详细介绍的文档，另外关于spark/flink会有更多的文章介绍，后续也可以讲下结合其他引擎遇到的坑。
有些文章没有按照最新官网文档来，或者不去多实践就得出一些结论（比如iceberg还不支持行级更新等），目前我验证最新presto-0.276和flink15在 sql还不支持直接非分区键删除或更新，但 spark/trino 都是能通过sql直接行级更新的。

环境准备

操作需要我们先安装一些组件，测试的话单机部署就好。列举下我这边使用的版本：
hadoop-3.2.3
hive-3.1.2 (主要使用metastore功能)
presto-0.276
trino-397 （需要下载jdk-17.0.3 及以上版本）

实操crud

presto操作

配置

关于iceberg的配置，官方文档有具体的描述。
presto是支持两类catalog 的分别是hive和hadoop，通过配置iceberg.catalog.type ，我们都创建下。

hadoop类型
我这里就写下核心配置 etc/catalog/iceberg.properties ，名称iceberg可以自定义。

connector.name=iceberg
hive.metastore.uri=thrift://127.0.0.1:9083
iceberg.catalog.type=hadoop
iceberg.catalog.warehouse=hdfs://127.0.0.1:8020/user/iceberg/hadoop_db

hive类型
etc/catalog/iceberg1.properties

connector.name=iceberg
hive.metastore.uri=thrift://localhost:9083
iceberg.catalog.type=hive

测试

链接presto ./presto-cli.jar --server localhost:8080 --catalog iceberg ，指定catalog为刚才创建好的hadoop类型。然后创建并使用test_db 的schema

presto> create schema test_db;
CREATE SCHEMA
presto> 
presto> use test_db;
USE
presto:test_db>

可以通过hadoop 路径来验证你指定的warehouse有没有生成目录。

iceberg 表能用inesert overwrite语句插入数据吗_数据湖

我们创建test1表，并在表里插入一些数据

presto:test_db> CREATE TABLE test1 ( 
             ->     "id" bigint,                        
             ->     "data" varchar                      
             ->  )             ;
CREATE TABLE
presto:test_db> show create table test1;
             Create Table             
--------------------------------------
 CREATE TABLE iceberg.test_db.test1 ( 
    "id" bigint,                      
    "data" varchar                    
 )                                    
 WITH (                               
    format = 'PARQUET'                
 )                                    
(1 row)

Query 20220926_094457_00039_yd26t, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
81ms [0 rows, 0B] [0 rows/s, 0B/s]

presto:test_db> 
presto:test_db> insert into test1 values (1, '张三'), (2, '李四');
INSERT: 2 rows

Query 20220926_094551_00040_yd26t, FINISHED, 1 node
Splits: 35 total, 35 done (100.00%)
0:02 [0 rows, 0B] [0 rows/s, 0B/s]

presto:test_db> select * from test1;
 id | data 
----+------
  1 | 张三 
  2 | 李四 
(2 rows)

Query 20220926_094607_00041_yd26t, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
111ms [2 rows, 426B] [17 rows/s, 3.73KB/s]

尝试进行行级删除修改，但是会失败。

presto:test_db> delete from test1 where id = 1;
Query 20220926_094632_00042_yd26t failed: This connector only supports delete where one or more partitions are deleted entirely

presto:test_db>  update test1 set data = 'update' where id = 1; 
Query 20220926_094724_00043_yd26t failed: line 1:1: mismatched input 'update'. Expecting: 'ALTER', 'ANALYZE', 'CALL', 'COMMIT', 'CREATE', 'DEALLOCATE', 'DELETE', 'DESC', 'DESCRIBE', 'DROP', 'EXECUTE', 'EXPLAIN', 'GRANT', 'INSERT', 'PREPARE', 'REFRESH', 'RESET', 'REVOKE', 'ROLLBACK', 'SET', 'SHOW', 'START', 'TRUNCATE', 'USE', <query>
update test1 set data = 'update' where id = 1

同样我们通过链接hive的catalog ，创建名为hive_test的schema ，以及表test1，测试也会得到同样结果。

./presto-cli.jar --server localhost:8080 --catalog iceberg1
create schema hive_test;
use hive_test;

CREATE TABLE test1 ( 
    "id" bigint,                        
    "data" varchar                      
);
insert into test1 values (1, '张三'), (2, '李四');
select * from test1;
delete from test1 where id = 1;
update test1 set data = 'update' where id = 1;

结论

presto 支持iceberg的使用，提供hive和hadoop两种catalog支持
presto sql api目前还不支持删除和修改操作。

trino操作

介绍

trino （https://trino.io/）是presto创始人和Facebook意见不一致，从Facebook离职后创建的。https://www.sohu.com/a/441573139_315839

配置

目前官方文档也提供两种catalog配置。 hive/glue，我们这里用hive进行演示下。主要配置etc/catalog/iceberg.properties 如下：

connector.name=iceberg
hive.metastore.uri=thrift://9.135.12.10:9083

测试

这里安装trino时需要依赖jdk17，启动端口也要修改下不要和presto重复了，如果不想更新系统全局的JAVA_HOME，需要在bin/launcher 中单独指定：

export JAVA_HOME=/data/opt/jdk-17.0.4.1
export PATH=$JAVA_HOME/bin:$PATH
java -version
# 需要在启动命令前指定java17
exec "$(dirname "$0")/launcher.py" "$@"

我们通过trino-cli 链接测试。通过show schemas 我们会发现之前通过presto创建的hive_test 在这里可以查到

./trino-cli.jar --server localhost:8081 --catalog iceberg

trino> show schemas;
       Schema       
--------------------
 default            
 hive_test          
(4 rows)

但是目前查到的数据确是NULL

trino> select * from hive_test.test1;
  id  | data 
------+------
 NULL | NULL 
 NULL | NULL 
(2 rows)

Query 20220926_102503_00883_99pn2, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0.09 [2 rows, 305B] [21 rows/s, 3.2KB/s]

下面建表test2进行增删改查：

iceberg 表能用inesert overwrite语句插入数据吗_iceberg_02