1、hdfs文件块异常
问题描述:
使用hive load hdfs文件时报错:
FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. Spark job failed due to task failures: Cannot obtain block length for LocatedBlock{BP-1984322900-192.168.102.3-1594185446267:blk_1180034904_106295094; getBlockSize()=4179; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[192.168.102.11:9866,DS-cb5a2e07-20e9-45fd-869b-5d8b4ad170a4,DISK], DatanodeInfoWithStorage[192.168.102.9:9866,DS-74706bce-bb23-4aaf-a6eb-ceaa9bdbf38c,DISK], DatanodeInfoWithStorage[192.168.102.5:9866,DS-57f122fb-b6ca-437c-a52e-5f81efdd239c,DISK]]}
22/02/23 16:31:17 ERROR ql.Driver: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. Spark job failed due to task failures: Cannot obtain block length for LocatedBlock{BP-1984322900-192.168.102.3-1594185446267:blk_1180034904_106295094; getBlockSize()=4179; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[192.168.102.11:9866,DS-cb5a2e07-20e9-45fd-869b-5d8b4ad170a4,DISK], DatanodeInfoWithStorage[192.168.102.9:9866,DS-74706bce-bb23-4aaf-a6eb-ceaa9bdbf38c,DISK], DatanodeInfoWithStorage[192.168.102.5:9866,DS-57f122fb-b6ca-437c-a52e-5f81efdd239c,DISK]]}
分析问题:
可得知hdfs文件块出现异常,Cannot obtain block length for LocatedBlock,无法获取块文件长度信息
因为昨日CDH重启导致hdfs文件未关闭写状态
解决问题:
对hive load hdfs文件的地址执行检查命令
hdfs fsck /flume/ex_trade/date=20220222 -openforwrite
Connecting to namenode via http://master1.zd.prod:9870/fsck?ugi=jay&openforwrite=1&path=%2Fflume%2Fex_trade%2Fdate%3D20220222
FSCK started by jay (auth:SIMPLE) from /192.168.102.11 for path /flume/ex_trade/date=20220222 at Wed Feb 23 16:37:10 CST 2022
/flume/ex_trade/date=20220222/ex_trade_.1645508080614 1494 bytes, replicated: replication=3, 1 block(s), OPENFORWRITE: /flume/ex_trade/date=20220222/ex_trade_.1645519019856 4179 bytes, replicated: replication=3, 1 block(s), OPENFORWRITE:
Status: HEALTHY
Number of data-nodes: 6
Number of racks: 1
Total dirs: 1
Total symlinks: 0
Replicated Blocks:
Total size: 51340251 B
Total files: 7
Total blocks (validated): 7 (avg. block size 7334321 B)
Minimally replicated blocks: 5 (71.42857 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 2.142857
Missing blocks: 0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Blocks queued for replication: 0
Erasure Coded Block Groups:
Total size: 0 B
Total files: 0
Total block groups (validated): 0
Minimally erasure-coded block groups: 0
Over-erasure-coded block groups: 0
Under-erasure-coded block groups: 0
Unsatisfactory placement block groups: 0
Average block group size: 0.0
Missing block groups: 0
Corrupt block groups: 0
Missing internal blocks: 0
Blocks queued for replication: 0
FSCK ended at Wed Feb 23 16:37:10 CST 2022 in 2 milliseconds
The filesystem under path '/flume/ex_trade/date=20220222' is HEALTHY
发现文件 ex_trade.1645508080614和 ex_trade.1645519019856未关闭
执行修复文件命令
hdfs debug recoverLease -path /flume/ex_trade/date=20220222/ex_trade_.1645508080614 -retries 3
hdfs debug recoverLease -path /flume/ex_trade/date=20220222/ex_trade_.1645519019856 -retries 3
重新使用hive load hdfs文件,解决问题。
2. Log not rolled. Name node is in safe mode.
[root@cdhm01 hadoop-hdfs]# hdfs dfsadmin -safemode enter
Safe mode is ON
[root@cdhm01 hadoop-hdfs]# hdfs dfsadmin -safemode leave
Safe mode is OFF
手动进出一下
3.Canary 测试无法在目录 /tmp/.cloudera_health_monitoring_canary_files 中创建文件。
大部分情况都是刚重启 namenode在安全模式的原因 。等待自动退出即可 。长时间不退出手动退出一下安全模式
4.群集中有 xx个 丢失块。群集中共有 xxx个块。百分比 丢失块: x%
集群报块丢失的原因很多,如物理磁盘损坏,节点不正常下线退役,集群高负载时如内存打满卡死,网络拥堵,系统本身问题等造成节点掉线,如cdh集群的agent和server失去联系,非正常下线,心跳超时等原因造成yarn界面出现块丢失现象
- 检查缺失块
hdfs fsck /
具体缺失的文件和目录
往下翻
CORRUPT FILES: 54 #损坏的文件
MISSING BLOCKS: 56 #丢失的块
MISSING SIZE: 273708122 B #丢失的块大小
可以使用这个命令来具体定位
hdfs fsck / | egrep -v '^\.+$' | grep -v eplica
输出里查找
UNDER MIN REPL'D BLOCKS: 56 (80.0 %) #丢失比例
CORRUPT FILES: 54 #损坏的文件
MISSING BLOCKS: 56 #丢失的块
MISSING SIZE: 273708122 B #丢失的块大小
********************************
Missing blocks: 56 #丢失的块
Corrupt blocks: 0 #损坏的块
查看具体块的情况
hdfs fsck /hbase/oldWALs/pv2-00000000000000000007.log
删除掉不重要的丢失块
hdfs fsck -delete /hbase/oldWALs/pv2-00000000000000000003.log
注意千万不要使用
hdfs dfs -rm 文件路径
重要的块可以尝试这样修复
hdfs debug recoverLease -path /hbase/oldWALs/pv2-00000000000000000007.log
虽然提示成功。。大部分情况下没见修复成功过
丢失的块hdfs还可以找回,损坏的块还是建议删除
hdfs fsck -delete
5.could only be written to 0 of the 1 minReplication nodes,there are 3 datanode(3) running
原因是因为找不到写入的datanode 。至于为什么没有找到需要从网络等角度进行排查。
常见的网络不通 端口不通 hdfs拿到的主机名无法解析,没有配映射 ,如果双网卡 拿到的ip和能通的ip不一样
hadoop dfsadmin -report
用这个命令看下拿到的ip是不是你能通的ip,还有hdfs本身是直接访问ip 还是通过主机名进行访问的
绝大多数情况下不需要对namenode进行格式化
6、distcp Neither source file listing nor source paths
权限问题 读不到文件列表
7、distcp 报错 Neither Check-sum mismatch between hdfs
1、从报错信息 Source and target differ in block-size. Use -pb to preserve block-sizes during copy来看应该是新旧环境的block size大小不一致
HDFS在写的时候有设置块大小,默认128M,Distcp 从源集群读文件后写入新集群,默认是使用的MR任务中的dfs.blocksize 128M
<name>dfs.blocksize</name>
<value>134217728</value>
4、在distcp写完文件后,会基于块的物理大小做校验,因为该文件在新旧集群中blocksize不一致,因此拆分大小不一致,导致校验失败。
在旧集群是 10000个block, 在新集群 97656个block. 因此实际在磁盘的物理大小因分隔而导致2边校验失败
distcp 时增加-pb 参数,则会保留原集群的block大小
执行语句改为:hadoop distcp -pb
8、hdfs HA 出现Cannot find any valid remote NN to service request
所有的nameNode 都是standby状态,查看出现
Cannot find any valid remote NN to service request
解决办法:
- 停掉hadoop的服务
- 在所有的nameNode上执行
hdfs zkfc -formatZK
然后重启hadoop服务即可
10、apache hadoop namenode 部署kerberos后启动报错 Principal not defined in configuration
没加dfs.web.authentication.kerberos.principal
这个配置项
具体的可以看下原生apache hadoop3.3.1集群安装配置Kerberos_Mumunu-的博客-CSDN博客
相关配置部分
11、 HDFS报错 Premature EOF from inputStream
这种情况的原因可能有很多,例如网络中断、磁盘空间不足,data stream操作过程中文件被删除了等。为了解决这个问题,可以尝试重试数据传输,或者检查网络连接和磁盘空间是否正常。我的原因是磁盘的读写速度偶尔会变得很慢
DataNode同时处理的数据传输连接数超过配置的限制,需要调大参数:dfs.datanode.max.transfer.threads 调整到8192。
同时可以调大
dfs.datanode.handler.count
12、挂载磁盘未报错但挂不上去
集群硬盘故障更换,更换硬盘后挂载不成功,
mount /dev/sda /data1
执行挂载命令无报错,df -h 查看无/data1
检查系统日志/var/log/messages 发现相关异常信息如下:
systemd: Unit data4.mount is bound to inactive unit dev-disk-by\x2duuid-32ae8f30\x2d1992\x2d488c\x2da811\x2d078385a40b31.device. Stopping, too
解决措施:
执行命令加载daemon-reload
systemctl daemon-reload
再次挂载硬盘
mount /dev/sda /data1
查看磁盘挂载OK