SPEECH API v2
The SPEECH API is a websocket-based API that allows you to transcribe, translate and dub audio streams in real-time. This guide will help you get started with the basics of using the SPEECH API:
- Create sessions
- Send audio data over a secure WebSocket (WS) channel
- Receive transcript results over a secure WS channel
- Receive translation results over a secure WS channel
- Receive voiceover results over a secure WS channel
Typescript SDK
For a more convenient way to interact with the API, we provide a Typescript SDK that can be installed via npm:
npm install @tv1-eu/videotaxi-api-typings
This SDK provides types for all Websocket messages. For example, when receiving a websocket message, you can use the following code to parse it:
import { WebSocketMessage } from '@tv1-eu/videotaxi-api-typings';
// setup websockets, not included hereviewerSocket.onmessage = (event: MessageEvent) => { const message: WebSocketMessage = JSON.parse(event.data); // do something with it}
Session Management
Speech sessions can be managed either through the VIDEO.TAXI Web interface or via the GraphQL API. The full API description can be found at https://service.video.taxi/graphiql.
List Existing Sessions
query { speechRooms(limit: 5, offset: 0) { id name translationLanguages }}
Retrieve Session Details
query { speechRoom(id: "session-id") { id name translationLanguages # URL to the web interface for following the transcript via a browser. viewerWebUrl # WebSocket URL for a custom UI. Must be opened within 15 minutes. viewerSocketUrl(languageCode: "en-US", enable_voiceover: true) # WebSocket URL for transmitting audio. Must be opened within 15 minutes. masterSocketUrl(languageCode: "it") }}
Create a New Session
mutation { createSpeechRoom( name: "Live-Event", # Languages available for translation. The original language is always included. translationLanguages: ["it", "de", "en-US"] ) { id name }}
WebSocket Messages
Sender
Once the sender URL has been obtained, clients can send audio data over it. Virtually every streaming container format supported by FFmpeg is accepted, such as WebM and MPEG-TS. After establishing the WS connection, simply send your binary audio frames through it.
Viewer
Once the viewer URL has been obtained, clients can connect to the socket to receive transcription events. Every message follows the same base format:
{"events": []}
For example:
{ "events": [ { "kind": "transcript", "payload": { "id": "BHI67vVr", "sentence_id": "abc", "text": "So if you like, he was sort of part of the, he was, he was part of the poachers.", "latency": 2939.328806, "speaker": "S5", "created_at": 1720014373412957000, "from_ms": 286730.0, "to_ms": 291290.0 } }, { "kind": "partial", "payload": { "text": "He was", "latency": 2939.328806 } } ]}
The list of events does not have a fixed length. Events are usually grouped together for the UI to render changes cohesively. Here are the event kinds with their descriptions:
partial: A temporary transcript.
{ "kind": "partial", "payload": { "id": "BHI67vVr", "text": "So if you like, he was sort of part of the. He was he was part of the poachers and then he became a gamekeeper he was head of britain's", "latency": 596.662145 }}
transcript: A final transcript.
{ "kind": "transcript", "payload": { "id": "BHI67vVr", "sentence_id": "abc", "text": "So if you like, he was sort of part of the, he was, he was part of the poachers.", "latency": 2939.328806, "speaker": "S5", "created_at": 1720014373412957000, "from_ms": 286730.0, "to_ms": 291290.0 }}
translation: The translation of a transcript.
{ "kind": "translation", "payload": { "id": "BHI67vVr", "sentence_id": "abc", "text": "Wenn man so will, war er also Teil der Wilderer.", "original": "So if you like, he was sort of part of the, he was, he was part of the poachers.", "latency": 3067.328804, "speaker": "S5", "created_at": 1720014373412957000, "from_ms": 286730.0, "to_ms": 291290.0 }}
voiceover: The playback link of a translation. DEPRECATED
Please use the voice
message instead
{ "kind": "voiceover", "payload": { "id": "BHI67vVr", "text": "Wenn man so will, war er also Teil der Wilderer.", "original": "So if you like, he was sort of part of the, he was, he was part of the poachers.", "latency": 4880.66211, "speaker": "S5", "created_at": 1720014373412957000, "playback_uri": "https://video-taxi-client-data.s3.eu-west-1.amazonaws.com/B8C4D7C4D031EDC3762D7CD2BCA5FACDC432722A.aac?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUYKVZWFGAMK6VTIZ%2F20240703%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20240703T134615Z&X-Amz-Expires=43200&X-Amz-SignedHeaders=host&X-Amz-Signature=2cbf459ffdc9fad5c9b3d3797d120de563c69e876b2490d69a553926701fc516", "from_ms": 286730.0, "to_ms": 291290.0 }}
voice: Includes raw audio data for playback
Each sentence is made of multiple smaller potions of audio. seq
indicates the order of the audio deltas.
audio
is a base64 encoded (with padding), signed linear 16 (s16le) pcm audio with sample rate at 24kHz.
{ "kind": "voice", "payload": { "id": "BHI67vVr", "sentence_id": "abc", "text": "Wenn man so will, war er also Teil der Wilderer.", "latency": 4880.66211, "speaker": "S5", "created_at": 1720014373412957000, "audio": "", "seq": 1, "from_ms": 286730.0, "to_ms": 291290.0 }}
end_of_stream: Indicates that the session is temporarily closed.
{ "kind": "end_of_stream", "payload": { "reason": "normal" }}
Notes:
- latency is the processing latency of the pipeline from the time the sentence was spoken until it was transcribed completely, expressed in milliseconds (float).
- The identifier in the payload of translation and voiceover events always points to an originating transcript event and hence are all equal.
- from_ms and to_ms indicate the timeframe of the playback in which the sentence was heard.