In the training programs of human simultaneous interpreters, trainee speech is transcribed into text for quality assessment. Though interpreter speech contains irregular speech events such as hesitations, filled pauses, and self-repairs, automatic speech recognition (ASR) is expected to be introduced to save labor of transcription. In the training programs, source language text can be used for ASR because the training materials are prepared in advance. Thus, we propose a Transformer-based end-to-end ASR with an auxiliary input of a source language text toward transcribing simultaneous interpretation. Because a sufficient amount of human interpreter speech with source language text is not available for training the model, we conducted the initial evaluation of the model by simulating speech with source language text by changing the inputs and outputs of large-scale corpora for developing end-to-end speech translation (ST). Our proposed model significantly reduced word error rates (WERs) for four ST corpora: MuST-C English speech - Netherlandic text, English speech - German text, CoVoST 2 English speech - Japanese text, and our original TED-based English speech - Japanese text corpus.