SPEECH API v2

The SPEECH API is a websocket-based API that allows you to transcribe, translate and dub audio streams in real-time. This guide will help you get started with the basics of using the SPEECH API:

Create sessions
Send audio data over a secure WebSocket (WS) channel
Receive transcript results over a secure WS channel
Receive translation results over a secure WS channel
Receive voiceover results over a secure WS channel

Typescript SDK

For a more convenient way to interact with the API, we provide a Typescript SDK that can be installed via npm:

npm install @tv1-eu/videotaxi-api-typings

This SDK provides types for all Websocket messages. For example, when receiving a websocket message, you can use the following code to parse it:

import { WebSocketMessage } from '@tv1-eu/videotaxi-api-typings';

// setup websockets, not included here
viewerSocket.onmessage = (event: MessageEvent) => {
  const message: WebSocketMessage = JSON.parse(event.data);
  // do something with it
}

Session Management

Speech sessions can be managed either through the VIDEO.TAXI Web interface or via the GraphQL API. The full API description can be found at https://service.video.taxi/graphiql.

List Existing Sessions

query {
  speechRooms(limit: 5, offset: 0) {
    id
    name
    translationLanguages
  }
}

Retrieve Session Details

query {
  speechRoom(id: "session-id") {
    id
    name
    translationLanguages
    # URL to the web interface for following the transcript via a browser.
    viewerWebUrl
    # WebSocket URL for a custom UI. Must be opened within 15 minutes.
    viewerSocketUrl(languageCode: "en-US", enable_voiceover: true)
    # WebSocket URL for transmitting audio. Must be opened within 15 minutes.
    masterSocketUrl(languageCode: "it")
  }
}

Create a New Session

mutation {
  createSpeechRoom(
    name: "Live-Event",
    # Languages available for translation. The original language is always included.
    translationLanguages: ["it", "de", "en-US"]
  ) {
    id
    name
  }
}

WebSocket Messages

Sender

Once the sender URL has been obtained, clients can send audio data over it. Virtually every streaming container format supported by FFmpeg is accepted, such as WebM and MPEG-TS. After establishing the WS connection, simply send your binary audio frames through it.

Viewer

Once the viewer URL has been obtained, clients can connect to the socket to receive transcription events. Every message follows the same base format:

{"events": []}

For example:

{
  "events": [
    {
      "kind": "transcript",
      "payload": {
        "id": "BHI67vVr",
        "sentence_id": "abc",
        "text": "So if you like, he was sort of part of the, he was, he was part of the poachers.",
        "latency": 2939.328806,
        "speaker": "S5",
        "created_at": 1720014373412957000,
        "from_ms": 286730.0,
        "to_ms": 291290.0
      }
    },
    {
      "kind": "partial",
      "payload": {
        "text": "He was",
        "latency": 2939.328806
      }
    }
  ]
}

The list of events does not have a fixed length. Events are usually grouped together for the UI to render changes cohesively. Here are the event kinds with their descriptions:

partial: A temporary transcript.

{
  "kind": "partial",
  "payload": {
    "id": "BHI67vVr",
    "text": "So if you like, he was sort of part of the. He was he was part of the poachers and then he became a gamekeeper he was head of britain's",
    "latency": 596.662145
  }
}

transcript: A final transcript.

{
  "kind": "transcript",
  "payload": {
    "id": "BHI67vVr",
    "sentence_id": "abc",
    "text": "So if you like, he was sort of part of the, he was, he was part of the poachers.",
    "latency": 2939.328806,
    "speaker": "S5",
    "created_at": 1720014373412957000,
    "from_ms": 286730.0,
    "to_ms": 291290.0
  }
}

translation: The translation of a transcript.

{
  "kind": "translation",
  "payload": {
    "id": "BHI67vVr",
    "sentence_id": "abc",
    "text": "Wenn man so will, war er also Teil der Wilderer.",
    "original": "So if you like, he was sort of part of the, he was, he was part of the poachers.",
    "latency": 3067.328804,
    "speaker": "S5",
    "created_at": 1720014373412957000,
    "from_ms": 286730.0,
    "to_ms": 291290.0
  }
}

voiceover: The playback link of a translation. DEPRECATED

Please use the voice message instead

{
  "kind": "voiceover",
  "payload": {
    "id": "BHI67vVr",
    "text": "Wenn man so will, war er also Teil der Wilderer.",
    "original": "So if you like, he was sort of part of the, he was, he was part of the poachers.",
    "latency": 4880.66211,
    "speaker": "S5",
    "created_at": 1720014373412957000,
    "playback_uri": "https://video-taxi-client-data.s3.eu-west-1.amazonaws.com/B8C4D7C4D031EDC3762D7CD2BCA5FACDC432722A.aac?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUYKVZWFGAMK6VTIZ%2F20240703%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20240703T134615Z&X-Amz-Expires=43200&X-Amz-SignedHeaders=host&X-Amz-Signature=2cbf459ffdc9fad5c9b3d3797d120de563c69e876b2490d69a553926701fc516",
    "from_ms": 286730.0,
    "to_ms": 291290.0
  }
}

voice: Includes raw audio data for playback

Each sentence is made of multiple smaller potions of audio. seq indicates the order of the audio deltas. audio is a base64 encoded (with padding), signed linear 16 (s16le) pcm audio with sample rate at 24kHz.

{
  "kind": "voice",
  "payload": {
    "id": "BHI67vVr",
    "sentence_id": "abc",
    "text": "Wenn man so will, war er also Teil der Wilderer.",
    "latency": 4880.66211,
    "speaker": "S5",
    "created_at": 1720014373412957000,
    "audio": "",
    "seq": 1,
    "from_ms": 286730.0,
    "to_ms": 291290.0
  }
}

end_of_stream: Indicates that the session is temporarily closed.

{
  "kind": "end_of_stream",
  "payload": {
    "reason": "normal"
  }
}

Notes:

latency is the processing latency of the pipeline from the time the sentence was spoken until it was transcribed completely, expressed in milliseconds (float).
The identifier in the payload of translation and voiceover events always points to an originating transcript event and hence are all equal.
from_ms and to_ms indicate the timeframe of the playback in which the sentence was heard.