Retinanet evaluation spikes memory usage on TPUs, crashes training

# Prerequisites

Please answer the following questions for yourself before submitting an issue.

- [X] I am using the latest TensorFlow Model Garden release and TensorFlow 2.
- [x] I am reporting the issue to the correct repository. (Model Garden official or research directory)
- [X] I checked to make sure that this issue has not been filed already.

## 1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/r2.8.0/official/vision/beta/train.py
https://github.com/tensorflow/models/blob/r2.8.0/official/vision/beta/configs/experiments/retinanet/resnet50fpn_coco_tfds_tpu.yaml

## 2. Describe the bug

There are exponentially increasing memory spikes on TPUs during the training and evaluation of Retinanet, this eventually causes a crash during training. This bug was found while working on the beta project [yolov4-tiny](https://github.com/tensorflow/models/tree/master/official/vision/beta/projects/yolo). I observed that training needed to be restarted frequently, and there were thousands of lines of...
```bash
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/resource_variable_ops.py", line 290, in __del__
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/gen_resource_variable_ops.py", line 257, in destroy_resource_op
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 7186, in raise_from_not_ok_status
tensorflow.python.framework.errors_impl.AbortedError: Unable to find a context_id matching the specified one (12469621923045235436). Perhaps the worker was restarted, or the context was GC'd? [Op:DestroyResourceOp]
Exception ignored in: <function EagerResourceDeleter.__del__ at 0x7f58d152dc80>
```
in every training log file. Retinanet has the same issue. This crashing was observed on both v3-8 and v2-256 TPUs. 

This bug is apparent looking at both the training output file (stderr and stdout) and the GCP TPU Dashboard charts. The output file shows the crash happening during evaluation, and the spikes in memory on the GCP TPU Dashboard occur only during evaluation. I provide the logs and pictures of the TPU memory usage in additional content. This bug was observed in "train_and_eval" mode, it does not occur in "train" mode.

## 3. Steps to reproduce

Create a v3-8 TPU with version 2.8.0.

Load and SSH into a new GCP Compute Engine VM with the disk image Debian GNU/Linux 10 Buster + TF 2-8-0.

`git clone https://github.com/tensorflow/models.git`
`cd models`
`git checkout r2.8.0`
`pip3 install -r official/requirements.txt`

Install the COCO dataset, I used a GCP Bucket to store mine.

Modify the `official/vision/beta/configs/experiments/retinanet/resnet50fpn_coco_tfds_tpu.yaml` config to use your COCO dataset instead of tfds (I did this because it was already installed in one of my buckets). It should look like...
```yaml
runtime:
  distribution_strategy: 'tpu'
  mixed_precision_dtype: 'bfloat16'
task:
  annotation_file: ''  # Can't use annotation file when tfds is used.
  losses:
    l2_weight_decay: 0.0001
  model:
    num_classes: 91
    max_level: 7
    min_level: 3
    input_size: [640, 640, 3]
    norm_activation:
      activation: relu
      norm_epsilon: 0.001
      norm_momentum: 0.99
      use_sync_bn: true
  train_data:
          #tfds_name: 'coco/2017'
          #tfds_split: 'train'
    drop_remainder: true
    dtype: bfloat16
    global_batch_size: 256
    input_path: 'gs://cam2-datasets/coco/train*'
    is_training: true
    shuffle_buffer_size: 1000
  validation_data:
          #tfds_name: 'coco/2017'
          #tfds_split: 'validation'
    drop_remainder: true
    dtype: bfloat16
    global_batch_size: 8
    input_path: 'gs://cam2-datasets/coco/val*'
    is_training: false
```

In `~/models` run the training script...
```bash
nohup python3 -m official.vision.beta.train --mode=train_and_eval --experiment=retinanet_resnetfpn_coco --model_dir={MODEL_DIR_HERE} --config_file=~/models/official/vision/beta/configs/experiments/retinanet/resnet50fpn_coco_tfds_tpu.yaml --tpu={TPU_NAME_HERE} > ../retinanet.txt &
```
mine looked like...
```bash
nohup python3 -m official.vision.beta.train --mode=train_and_eval --experiment=retinanet_resnetfpn_coco --model_dir=gs://cam2-models/new-yolov4-tiny/retinanet/ --config_file=/home/cam2tensorflow/working/models/official/vision/beta/configs/experiments/retinanet/resnet50fpn_coco_tfds_tpu.yaml --tpu=tf-yolo-1 > ../retinanet.txt &
```

You will see in `../retinanet.txt` that the training crashes with errors, and on the TPU monitoring you should see exponentially increasing spikes in memory usage during evaluation.

## 4. Expected behavior

The training should run all the way through without crashing due to memory issues on the TPU.

## 5. Additional context

[TPU Memory Usage in dashboard](https://docs.google.com/document/d/1WKlZHJgPREk-uAT4COiGavF6n-EhfWbsTEFZsS5zors/edit?usp=sharing)
[Retinanet training log after one crash](https://drive.google.com/file/d/1N2fYaeIDm9taQn3bvQ0ZQYtkfOt2MeY8/view?usp=sharing)

## 6. System information

- OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Linux Debian 10.11
Debian GNU/Linux 10 Buster + TF 2-8-0 Disk on GCP
- TensorFlow installed from (source or binary): Pre-installed on disk
- TensorFlow version (use command below): 2.8.0
- Python version: 3.7.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retinanet evaluation spikes memory usage on TPUs, crashes training #10528

Prerequisites

1. The entire URL of the file you are using

2. Describe the bug

3. Steps to reproduce

4. Expected behavior

5. Additional context

6. System information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Retinanet evaluation spikes memory usage on TPUs, crashes training #10528

Description

Prerequisites

1. The entire URL of the file you are using

2. Describe the bug

3. Steps to reproduce

4. Expected behavior

5. Additional context

6. System information

Activity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions