This paper describes our USTC_NELSLIP system submitted to the Open Automatic Speech Recognition (OpenASR21) Challenge for the Constrained condition, where only a 10-hour speech dataset is allowed for training while additional text data is unlimited. To improve the low-resource speech recognition performance, we collect external text data for language modeling and train a text-to-speech (TTS) model to generate speech-text paired data. Our system is then built based on the conventional hybrid structure, where various subsystems are developed using different acoustic neural network architectures and different data augmentation methods. Finally, system fusion is employed to obtain the final result. Experiments on the OpenASR21 challenge show that the proposed system achieves the best performance for all testing languages.