SPEECH API v2
The SPEECH API is a websocket-based API that allows you to transcribe, translate and dub audio streams in real-time. This guide will help you get started with the basics of using the SPEECH API:
- Create sessions
- Send audio data over a secure WebSocket (WS) channel
- Receive transcript results over a secure WS channel
- Receive translation results over a secure WS channel
- Receive voiceover results over a secure WS channel
Typescript SDK
For a more convenient way to interact with the API, we provide a Typescript SDK that can be installed via npm:
This SDK provides types for all Websocket messages. For example, when receiving a websocket message, you can use the following code to parse it:
Session Management
Speech sessions can be managed either through the VIDEO.TAXI Web interface or via the GraphQL API. The full API description can be found at https://service.video.taxi/graphiql.
List Existing Sessions
Retrieve Session Details
Create a New Session
WebSocket Messages
Sender
Once the sender URL has been obtained, clients can send audio data over it. Virtually every streaming container format supported by FFmpeg is accepted, such as WebM and MPEG-TS. After establishing the WS connection, simply send your binary audio frames through it.
Viewer
Once the viewer URL has been obtained, clients can connect to the socket to receive transcription events. Every message follows the same base format:
For example:
The list of events does not have a fixed length. Events are usually grouped together for the UI to render changes cohesively. Here are the event kinds with their descriptions:
partial: A temporary transcript.
transcript: A final transcript.
translation: The translation of a transcript.
voiceover: The playback link of a translation.
end_of_stream: Indicates that the session is temporarily closed.
Notes:
- latency is the processing latency of the pipeline from the time the sentence was spoken until it was transcribed completely, expressed in milliseconds (float).
- The identifier in the payload of translation and voiceover events always points to an originating transcript event and hence are all equal.
- from_ms and to_ms indicate the timeframe of the playback in which the sentence was heard.