1、hdfs文件块异常

问题描述:

使用hive load hdfs文件时报错:

FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. Spark job failed due to task failures: Cannot obtain block length for LocatedBlock{BP-1984322900-192.168.102.3-1594185446267:blk_1180034904_106295094; getBlockSize()=4179; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[192.168.102.11:9866,DS-cb5a2e07-20e9-45fd-869b-5d8b4ad170a4,DISK], DatanodeInfoWithStorage[192.168.102.9:9866,DS-74706bce-bb23-4aaf-a6eb-ceaa9bdbf38c,DISK], DatanodeInfoWithStorage[192.168.102.5:9866,DS-57f122fb-b6ca-437c-a52e-5f81efdd239c,DISK]]}
 22/02/23 16:31:17 ERROR ql.Driver: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. Spark job failed due to task failures: Cannot obtain block length for LocatedBlock{BP-1984322900-192.168.102.3-1594185446267:blk_1180034904_106295094; getBlockSize()=4179; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[192.168.102.11:9866,DS-cb5a2e07-20e9-45fd-869b-5d8b4ad170a4,DISK], DatanodeInfoWithStorage[192.168.102.9:9866,DS-74706bce-bb23-4aaf-a6eb-ceaa9bdbf38c,DISK], DatanodeInfoWithStorage[192.168.102.5:9866,DS-57f122fb-b6ca-437c-a52e-5f81efdd239c,DISK]]}

分析问题:

可得知hdfs文件块出现异常,Cannot obtain block length for LocatedBlock,无法获取块文件长度信息

因为昨日CDH重启导致hdfs文件未关闭写状态

解决问题:

对hive load hdfs文件的地址执行检查命令

hdfs fsck /flume/ex_trade/date=20220222 -openforwrite

Connecting to namenode via http://master1.zd.prod:9870/fsck?ugi=jay&openforwrite=1&path=%2Fflume%2Fex_trade%2Fdate%3D20220222
 FSCK started by jay (auth:SIMPLE) from /192.168.102.11 for path /flume/ex_trade/date=20220222 at Wed Feb 23 16:37:10 CST 2022
 /flume/ex_trade/date=20220222/ex_trade_.1645508080614 1494 bytes, replicated: replication=3, 1 block(s), OPENFORWRITE: /flume/ex_trade/date=20220222/ex_trade_.1645519019856 4179 bytes, replicated: replication=3, 1 block(s), OPENFORWRITE:
 Status: HEALTHY
  Number of data-nodes:  6
  Number of racks:               1
  Total dirs:                    1
  Total symlinks:                0
  
 Replicated Blocks:
  Total size:    51340251 B
  Total files:   7
  Total blocks (validated):      7 (avg. block size 7334321 B)
  Minimally replicated blocks:   5 (71.42857 %)
  Over-replicated blocks:        0 (0.0 %)
  Under-replicated blocks:       0 (0.0 %)
  Mis-replicated blocks:         0 (0.0 %)
  Default replication factor:    3
  Average block replication:     2.142857
  Missing blocks:                0
  Corrupt blocks:                0
  Missing replicas:              0 (0.0 %)
  Blocks queued for replication: 0
  
 Erasure Coded Block Groups:
  Total size:    0 B
  Total files:   0
  Total block groups (validated):        0
  Minimally erasure-coded block groups:  0
  Over-erasure-coded block groups:       0
  Under-erasure-coded block groups:      0
  Unsatisfactory placement block groups: 0
  Average block group size:      0.0
  Missing block groups:          0
  Corrupt block groups:          0
  Missing internal blocks:       0
  Blocks queued for replication: 0
 FSCK ended at Wed Feb 23 16:37:10 CST 2022 in 2 milliseconds
  
  
 The filesystem under path '/flume/ex_trade/date=20220222' is HEALTHY

发现文件 ex_trade.1645508080614和 ex_trade.1645519019856未关闭

执行修复文件命令

hdfs debug recoverLease -path /flume/ex_trade/date=20220222/ex_trade_.1645508080614 -retries 3

hdfs debug recoverLease -path /flume/ex_trade/date=20220222/ex_trade_.1645519019856 -retries 3

重新使用hive load hdfs文件,解决问题。

2. Log not rolled. Name node is in safe mode.

[root@cdhm01 hadoop-hdfs]# hdfs dfsadmin -safemode enter
Safe mode is ON

[root@cdhm01 hadoop-hdfs]# hdfs dfsadmin -safemode leave 
Safe mode is OFF

手动进出一下 

3.Canary 测试无法在目录 /tmp/.cloudera_health_monitoring_canary_files 中创建文件。

大部分情况都是刚重启 namenode在安全模式的原因 。等待自动退出即可 。长时间不退出手动退出一下安全模式

4.群集中有 xx个 丢失块。群集中共有 xxx个块。百分比 丢失块: x%

集群报块丢失的原因很多,如物理磁盘损坏,节点不正常下线退役,集群高负载时如内存打满卡死,网络拥堵,系统本身问题等造成节点掉线,如cdh集群的agent和server失去联系,非正常下线,心跳超时等原因造成yarn界面出现块丢失现象

  1. 检查缺失块
hdfs fsck /

 具体缺失的文件和目录

hdfs中datanode宕机_hadoop

 往下翻

CORRUPT FILES:        54   #损坏的文件
  MISSING BLOCKS:       56  #丢失的块
  MISSING SIZE:         273708122 B  #丢失的块大小

 可以使用这个命令来具体定位

hdfs fsck / | egrep -v '^\.+$' | grep -v eplica

输出里查找

UNDER MIN REPL'D BLOCKS:      56 (80.0 %)  #丢失比例
  CORRUPT FILES:        54   #损坏的文件
  MISSING BLOCKS:       56     #丢失的块
  MISSING SIZE:         273708122 B  #丢失的块大小
  ********************************
 Missing blocks:                56    #丢失的块
 Corrupt blocks:                0      #损坏的块
查看具体块的情况
hdfs fsck  /hbase/oldWALs/pv2-00000000000000000007.log

 删除掉不重要的丢失块

hdfs fsck -delete /hbase/oldWALs/pv2-00000000000000000003.log
注意千万不要使用

hdfs dfs -rm 文件路径

 重要的块可以尝试这样修复

hdfs debug recoverLease -path /hbase/oldWALs/pv2-00000000000000000007.log

虽然提示成功。。大部分情况下没见修复成功过
丢失的块hdfs还可以找回,损坏的块还是建议删除
hdfs fsck -delete

5.could only be written to 0 of the 1 minReplication nodes,there are 3 datanode(3) running

原因是因为找不到写入的datanode 。至于为什么没有找到需要从网络等角度进行排查。

常见的网络不通 端口不通   hdfs拿到的主机名无法解析,没有配映射  ,如果双网卡 拿到的ip和能通的ip不一样

hadoop dfsadmin -report

用这个命令看下拿到的ip是不是你能通的ip,还有hdfs本身是直接访问ip 还是通过主机名进行访问的

绝大多数情况下不需要对namenode进行格式化

6、distcp  Neither  source file listing  nor source paths

权限问题 读不到文件列表

7、distcp 报错 Neither  Check-sum mismatch between hdfs

1、从报错信息 Source and target differ in block-size. Use -pb to preserve block-sizes during copy来看应该是新旧环境的block size大小不一致

HDFS在写的时候有设置块大小,默认128M,Distcp 从源集群读文件后写入新集群,默认是使用的MR任务中的dfs.blocksize 128M
                <name>dfs.blocksize</name>
              <value>134217728</value>
4、在distcp写完文件后,会基于块的物理大小做校验,因为该文件在新旧集群中blocksize不一致,因此拆分大小不一致,导致校验失败。
在旧集群是 10000个block, 在新集群   97656个block. 因此实际在磁盘的物理大小因分隔而导致2边校验失败
distcp 时增加-pb 参数,则会保留原集群的block大小
执行语句改为:hadoop distcp -pb 

8、hdfs HA 出现Cannot find any valid remote NN to service request

所有的nameNode 都是standby状态,查看出现

Cannot find any valid remote NN to service request

解决办法:

  1. 停掉hadoop的服务
  2. 在所有的nameNode上执行
hdfs zkfc -formatZK

然后重启hadoop服务即可 

10、apache hadoop namenode 部署kerberos后启动报错 Principal not defined in configuration

没加dfs.web.authentication.kerberos.principal这个配置项 

具体的可以看下原生apache hadoop3.3.1集群安装配置Kerberos_Mumunu-的博客-CSDN博客

相关配置部分

11、 HDFS报错 Premature EOF from inputStream

这种情况的原因可能有很多,例如网络中断、磁盘空间不足,data stream操作过程中文件被删除了等。为了解决这个问题,可以尝试重试数据传输,或者检查网络连接和磁盘空间是否正常。我的原因是磁盘的读写速度偶尔会变得很慢

DataNode同时处理的数据传输连接数超过配置的限制,需要调大参数:dfs.datanode.max.transfer.threads  调整到8192。

同时可以调大
dfs.datanode.handler.count

12、挂载磁盘未报错但挂不上去

集群硬盘故障更换,更换硬盘后挂载不成功,

mount  /dev/sda  /data1

执行挂载命令无报错,df -h 查看无/data1 

检查系统日志/var/log/messages 发现相关异常信息如下:

systemd: Unit data4.mount is bound to inactive unit dev-disk-by\x2duuid-32ae8f30\x2d1992\x2d488c\x2da811\x2d078385a40b31.device. Stopping, too

解决措施:

执行命令加载daemon-reload

systemctl daemon-reload

再次挂载硬盘

mount  /dev/sda  /data1

查看磁盘挂载OK