Skip to content

SPEECH API v2

The SPEECH API is a websocket-based API that allows you to transcribe, translate and dub audio streams in real-time. This guide will help you get started with the basics of using the SPEECH API:

  • Create sessions
  • Send audio data over a secure WebSocket (WS) channel
  • Receive transcript results over a secure WS channel
  • Receive translation results over a secure WS channel
  • Receive voiceover results over a secure WS channel

Typescript SDK

For a more convenient way to interact with the API, we provide a Typescript SDK that can be installed via npm:

Terminal window
npm install @tv1-eu/videotaxi-api-typings

This SDK provides types for all Websocket messages. For example, when receiving a websocket message, you can use the following code to parse it:

import { WebSocketMessage } from '@tv1-eu/videotaxi-api-typings';
// setup websockets, not included here
viewerSocket.onmessage = (event: MessageEvent) => {
const message: WebSocketMessage = JSON.parse(event.data);
// do something with it
}

Session Management

Speech sessions can be managed either through the VIDEO.TAXI Web interface or via the GraphQL API. The full API description can be found at https://service.video.taxi/graphiql.

List Existing Sessions

query {
speechRooms(limit: 5, offset: 0) {
id
name
translationLanguages
}
}

Retrieve Session Details

query {
speechRoom(id: "session-id") {
id
name
translationLanguages
# URL to the web interface for following the transcript via a browser.
viewerWebUrl
# WebSocket URL for a custom UI. Must be opened within 15 minutes.
viewerSocketUrl(languageCode: "EN", enable_voiceover: true)
# WebSocket URL for transmitting audio. Must be opened within 15 minutes.
masterSocketUrl(languageCode: "it")
}
}

Create a New Session

mutation {
createSpeechRoom(
name: "Live-Event",
# Languages available for translation. The original language is always included.
translationLanguages: ["IT", "DE", "EN-US"]
) {
id
name
}
}

WebSocket Messages

Sender

Once the sender URL has been obtained, clients can send audio data over it. Virtually every streaming container format supported by FFmpeg is accepted, such as WebM and MPEG-TS. After establishing the WS connection, simply send your binary audio frames through it.

Viewer

Once the viewer URL has been obtained, clients can connect to the socket to receive transcription events. Every message follows the same base format:

{"events": []}

For example:

{
"events": [
{
"kind": "transcript",
"payload": {
"id": "BHI67vVr",
"text": "So if you like, he was sort of part of the, he was, he was part of the poachers.",
"latency": 2939.328806,
"speaker": "S5",
"created_at": 1720014373412957000,
"from_ms": 286730.0,
"to_ms": 291290.0
}
},
{
"kind": "partial",
"payload": {
"text": "He was",
"latency": 2939.328806
}
}
]
}

The list of events does not have a fixed length. Events are usually grouped together for the UI to render changes cohesively. Here are the event kinds with their descriptions:

partial: A temporary transcript.

{
"kind": "partial",
"payload": {
"text": "So if you like, he was sort of part of the. He was he was part of the poachers and then he became a gamekeeper he was head of britain's",
"latency": 596.662145
}
}

transcript: A final transcript.

{
"kind": "transcript",
"payload": {
"id": "BHI67vVr",
"text": "So if you like, he was sort of part of the, he was, he was part of the poachers.",
"latency": 2939.328806,
"speaker": "S5",
"created_at": 1720014373412957000,
"from_ms": 286730.0,
"to_ms": 291290.0
}
}

translation: The translation of a transcript.

{
"kind": "translation",
"payload": {
"id": "BHI67vVr",
"text": "Wenn man so will, war er also Teil der Wilderer.",
"original": "So if you like, he was sort of part of the, he was, he was part of the poachers.",
"latency": 3067.328804,
"speaker": "S5",
"created_at": 1720014373412957000,
"from_ms": 286730.0,
"to_ms": 291290.0
}
}
{
"kind": "voiceover",
"payload": {
"id": "BHI67vVr",
"text": "Wenn man so will, war er also Teil der Wilderer.",
"original": "So if you like, he was sort of part of the, he was, he was part of the poachers.",
"latency": 4880.66211,
"speaker": "S5",
"created_at": 1720014373412957000,
"playback_uri": "https://video-taxi-client-data.s3.eu-west-1.amazonaws.com/B8C4D7C4D031EDC3762D7CD2BCA5FACDC432722A.aac?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUYKVZWFGAMK6VTIZ%2F20240703%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20240703T134615Z&X-Amz-Expires=43200&X-Amz-SignedHeaders=host&X-Amz-Signature=2cbf459ffdc9fad5c9b3d3797d120de563c69e876b2490d69a553926701fc516",
"from_ms": 286730.0,
"to_ms": 291290.0
}
}

end_of_stream: Indicates that the session is temporarily closed.

{
"kind": "end_of_stream",
"payload": {
"reason": "normal"
}
}

Notes:

  • latency is the processing latency of the pipeline from the time the sentence was spoken until it was transcribed completely, expressed in milliseconds (float).
  • The identifier in the payload of translation and voiceover events always points to an originating transcript event and hence are all equal.
  • from_ms and to_ms indicate the timeframe of the playback in which the sentence was heard.