问题描述:
服务的不同接口不间断的报出502,分布在不同的接口和不同的nginx服务上,很是怪异。

竞赛生产日志平台:

nginx中的error.log
 2020/12/23 16:59:59 [error] 22636#0: *380224130 no live upstreams while connecting to upstream, client: 100.117.86.88, server: aa.code.com, request: "GET /api/competit
 ion/process/student/detail?competitionId=127 HTTP/1.1", upstream: "http://aa_xes/api/competition/process/student/detail?competitionId=127", host: "aa.code.com", r
 eferrer: "https://aa.code.com/?id=c6a4761c4b3974d6fe56d77c8ebe3a0a&code=7597a7b2e4d0c50ac96db0cefca6f30448bda50049bf3307ab4a5e4e030afb88796788528826034928717be5efd3da6
 e"
 2020/12/23 16:59:59 [error] 22636#0: *380224133 no live upstreams while connecting to upstream, client: 100.117.86.51, server: aa.code.com, request: "GET /api/competit
 ion/user/public/getCompPage?id=c6a4761c4b3974d6fe56d77c8ebe3a0a HTTP/1.1", upstream: "http://aa_xes/api/competition/user/public/getCompPage?id=c6a4761c4b3974d6fe56d77c8ebe3a0
 a", host: "aa.code.com", referrer: "https://aa.code.com/?id=c6a4761c4b3974d6fe56d77c8ebe3a0a&code=443082759c2302f47aa47b7c0a92bed72bc1f473c4da309c7e5650e7d
 3e662e1"access.log 中的错误日志
 {"@timestamp":"2020-12-23T17:56:29+08:00","cookie_id":"-","client_ip":"100.117.86.102","remote_user":"-","request_method":"GET","domain":"aa.code.com","user_agent":"Mo
 zilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36","xff":"218.79.230.248","upstream_addr":"172.16.1.52:8380","upstr
 eam_response_time":"0.017","request_time":"0.017","size":"113","idc_tag":"tjtx","status":"500","upstream_status":"500","host":"bcy-nginx01","via":"-","protocol":"http","request_ur
 i":"/api/competition/process/student/detail?competitionId=127","http_referer":"https://aa.code.com/?id=c6a4761c4b3974d6fe56d77c8ebe3a0a&code=3e2418e23c5b47051dfd61ce13
 b6f5916c7ff04509bfb837ebb7d08b2d85912f"}服务中的错误日志
 2020-12-23 18:12:50.355[ERROR][  XNIO-1 task-4]       c.t.c.w.e.GlobalExceptionHandler        : [Handle_Exception] - java.lang.NullPointerException
         at com.tal.competition.service.CompetitionProcessItemService.lambda$findStudentProcesses$2(CompetitionProcessItemService.java:385)
         at java.base/java.util.ArrayList.forEach(ArrayList.java:1507)
         at com.tal.competition.service.CompetitionProcessItemService.findStudentProcesses(CompetitionProcessItemService.java:332)先理解upstream upstream jingsai_xes {
       server 172.16.1.53:8380;
       server 172.16.1.51:8380;
       server 172.16.1.48:8380;
       server 172.16.1.49:8380;
       server 172.16.1.47:8380;
       server 172.16.1.52:8380;
       server 172.16.1.44:8380;
       server 172.16.1.50:8380;
       server 172.16.1.45:8380;
       server 172.16.1.46:8380;
       check interval=2000 rise=2 fall=1 timeout=1000 type=tcp port=8380;
  }
  这些都是配置的upstream。proxy_next_upstream含义:如果某个机器上的幂等服务报错,则会到下一个upstream(服务器)
 max_fails参数的理解:根据上面的解释,max_fails默认为1,fail_timeout默认为10秒,也就是说,默认情况下后端服务器 在10秒钟之内可以容许有一次的失败,如果超过1次则视为该服务器有问题,将该服务器标记为不可用。等待10秒后再将请求发给该服务器
 参考:http://blog.chinaunix.net/uid-29580597-id-4415903.htmlcheck interval=2000 rise=2 fall=1 timeout=1000 type=tcp port=8098;
 #对负载均衡池中的所有节点,每个2秒检测一次,请求2次正常则标记realserver状态为up,如果检测2次都失败,标记realserver的状态为down,后端健康请求的超时时间为1s,健康检查包的类型为http请求。把这个配置加进去,变成: proxy_next_upstream error timeout http_500 non_idemponent; 问题终于解决了。
 这段话的意思是说,像 post, lock, patch 这种会对服务器造成不幂等的方法,默认是不进行重试的,如果一定要进行重试,则要加上这个配置。
 参考:https://zhuanlan.zhihu.com/p/35803906单台nginx认定所有服务时效的场景:
 如果该台机器10秒内调用 /api/competition/process/student/detail?competitionId=127本质原因:
 127请求进入nginx,转到某个服务器报了500,请求会在所有机器转发一遍,如果127这个请求10秒内两次达到机器上,nginx就认为所有机器不可用,报出no live upstreams while connecting to upstrea为什么只有竞赛有该问题?
 proxy_next_upstream中没有配置http_5005时0分钟errlog统计
 nginx 01  
 只有7-14秒没有报 no live upstreams while connecting to upstream
 且accessLog说明  07-08秒有请求进入nginx 02 
 16-20秒没有报 no live upstreams while connecting to upstreamnginx 03
 02-06 29-32 41-45 54-60秒没有报 no live upstreams while connecting to upstream
 配置修改
 proxy_next_upstream_tries 3
 status code 返回的是 502 server {
          listen        80;
         server_name   aa.code.com;
         proxy_next_upstream   http_500  http_502 http_503 http_504 error timeout invalid_header;
         proxy_set_header Accept-Encoding "";
         client_max_body_size 64M;
         access_log      /data/log/nginx/aa-access.log main_json;
         location / {
             proxy_read_timeout      3600;
             proxy_connect_timeout   300;
             proxy_set_header Host $http_host;
             proxy_set_header X-Real-IP $remote_addr;
             proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
             proxy_set_header X-Forwarded-Proto https;
             root /data/webroot/cp-website/;
         }修改后nginx的配置(即在proxy_next_upstream 中去掉了http_502)
     server {
          listen        80;
         server_name   aa.code.com;
         proxy_next_upstream     http_502 http_503 http_504 error timeout invalid_header;
         proxy_set_header Accept-Encoding "";
         client_max_body_size 64M;
         access_log      /data/log/nginx/aa-access.log main_json;
         location / {
             proxy_read_timeout      3600;
             proxy_connect_timeout   300;
             proxy_set_header Host $http_host;
             proxy_set_header X-Real-IP $remote_addr;
             proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
             proxy_set_header X-Forwarded-Proto https;
             root /data/webroot/cp-website/;
         }