Retinanet evaluation spikes memory usage on TPUs, crashes training · Issue #10528 · tensorflow/models · GitHub
Skip to content

Retinanet evaluation spikes memory usage on TPUs, crashes training #10528

Closed
@jacob-zietek

Description

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am using the latest TensorFlow Model Garden release and TensorFlow 2.
  • I am reporting the issue to the correct repository. (Model Garden official or research directory)
  • I checked to make sure that this issue has not been filed already.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/r2.8.0/official/vision/beta/train.py
https://github.com/tensorflow/models/blob/r2.8.0/official/vision/beta/configs/experiments/retinanet/resnet50fpn_coco_tfds_tpu.yaml

2. Describe the bug

There are exponentially increasing memory spikes on TPUs during the training and evaluation of Retinanet, this eventually causes a crash during training. This bug was found while working on the beta project yolov4-tiny. I observed that training needed to be restarted frequently, and there were thousands of lines of...

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/resource_variable_ops.py", line 290, in __del__
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/gen_resource_variable_ops.py", line 257, in destroy_resource_op
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 7186, in raise_from_not_ok_status
tensorflow.python.framework.errors_impl.AbortedError: Unable to find a context_id matching the specified one (12469621923045235436). Perhaps the worker was restarted, or the context was GC'd? [Op:DestroyResourceOp]
Exception ignored in: <function EagerResourceDeleter.__del__ at 0x7f58d152dc80>

in every training log file. Retinanet has the same issue. This crashing was observed on both v3-8 and v2-256 TPUs.

This bug is apparent looking at both the training output file (stderr and stdout) and the GCP TPU Dashboard charts. The output file shows the crash happening during evaluation, and the spikes in memory on the GCP TPU Dashboard occur only during evaluation. I provide the logs and pictures of the TPU memory usage in additional content. This bug was observed in "train_and_eval" mode, it does not occur in "train" mode.

3. Steps to reproduce

Create a v3-8 TPU with version 2.8.0.

Load and SSH into a new GCP Compute Engine VM with the disk image Debian GNU/Linux 10 Buster + TF 2-8-0.

git clone https://github.com/tensorflow/models.git
cd models
git checkout r2.8.0
pip3 install -r official/requirements.txt

Install the COCO dataset, I used a GCP Bucket to store mine.

Modify the official/vision/beta/configs/experiments/retinanet/resnet50fpn_coco_tfds_tpu.yaml config to use your COCO dataset instead of tfds (I did this because it was already installed in one of my buckets). It should look like...

runtime:
  distribution_strategy: 'tpu'
  mixed_precision_dtype: 'bfloat16'
task:
  annotation_file: ''  # Can't use annotation file when tfds is used.
  losses:
    l2_weight_decay: 0.0001
  model:
    num_classes: 91
    max_level: 7
    min_level: 3
    input_size: [640, 640, 3]
    norm_activation:
      activation: relu
      norm_epsilon: 0.001
      norm_momentum: 0.99
      use_sync_bn: true
  train_data:
          #tfds_name: 'coco/2017'
          #tfds_split: 'train'
    drop_remainder: true
    dtype: bfloat16
    global_batch_size: 256
    input_path: 'gs://cam2-datasets/coco/train*'
    is_training: true
    shuffle_buffer_size: 1000
  validation_data:
          #tfds_name: 'coco/2017'
          #tfds_split: 'validation'
    drop_remainder: true
    dtype: bfloat16
    global_batch_size: 8
    input_path: 'gs://cam2-datasets/coco/val*'
    is_training: false

In ~/models run the training script...

nohup python3 -m official.vision.beta.train --mode=train_and_eval --experiment=retinanet_resnetfpn_coco --model_dir={MODEL_DIR_HERE} --config_file=~/models/official/vision/beta/configs/experiments/retinanet/resnet50fpn_coco_tfds_tpu.yaml --tpu={TPU_NAME_HERE} > ../retinanet.txt &

mine looked like...

nohup python3 -m official.vision.beta.train --mode=train_and_eval --experiment=retinanet_resnetfpn_coco --model_dir=gs://cam2-models/new-yolov4-tiny/retinanet/ --config_file=/home/cam2tensorflow/working/models/official/vision/beta/configs/experiments/retinanet/resnet50fpn_coco_tfds_tpu.yaml --tpu=tf-yolo-1 > ../retinanet.txt &

You will see in ../retinanet.txt that the training crashes with errors, and on the TPU monitoring you should see exponentially increasing spikes in memory usage during evaluation.

4. Expected behavior

The training should run all the way through without crashing due to memory issues on the TPU.

5. Additional context

TPU Memory Usage in dashboard
Retinanet training log after one crash

6. System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
    Linux Debian 10.11
    Debian GNU/Linux 10 Buster + TF 2-8-0 Disk on GCP
  • TensorFlow installed from (source or binary): Pre-installed on disk
  • TensorFlow version (use command below): 2.8.0
  • Python version: 3.7.3

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

models:officialmodels that come under official repositorystat:awaiting responseWaiting on input from the contributortype:bugBug in the code

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions