keyword-searchable transcription

Multi-Module Pipeline: Keyword-Searchable Transcription

This document details a multi-modular pipeline that takes in an audio file, transcribes it, and makes the result keyword searchable.

This pipeline facilitates easy navigation and retrieval of specific content within audio and video archives, improving accessibility and content management. It can, for example, be applied in video platforms, educational institutions, and businesses for efficient content indexing, search engine optimization, and enhancing user engagement through targeted content delivery.

The document is divided into the following sections:

Pipeline Setup
Processing an Input File
Performing Keyword Search

Pipeline Setup

To achieve what we've described above, let's set up a pipeline sequentially consisting of the following modules:

A transcribe module.
A json-to-txt module.
A keyword-db module.

We do this by leveraging the create_pipeline method, as follows:

# create a pipeline as detailed above
pipeline = krixik.create_pipeline(name="multi_keyword_searchable_transcription", module_chain=["transcribe", "json-to-txt", "keyword-db"])

Processing an Input File

A pipeline's valid input formats are determined by its first module—in this case, a transcribe module. Therefore, this pipeline only accepts audio file inputs.

Lets take a quick look at a test file before processing.

# examine contents of input file
import IPython

IPython.display.Audio(data_dir + "input/Interesting Facts About Colombia.mp3")

We will use the default models for every module in the pipeline, so the modules argument of the process method doesn't need to be leveraged.

# process the file through the pipeline, as described above
process_output = pipeline.process(
    local_file_path=data_dir + "input/Interesting Facts About Colombia.mp3",  # the initial local filepath where the input file is stored
    local_save_directory=data_dir + "output",  # the local directory that the output file will be saved to
    expire_time=60 * 30,  # process data will be deleted from the Krixik system in 30 minutes
    wait_for_process=True,  # wait for process to complete before returning IDE control to user
    verbose=False,
)  # do not display process update printouts upon running code

The output of this process is printed below. To learn more about each component of the output, review documentation for the process method.

Because the output of this particular module-model pair is an SQLlite database file, the process_output is "null". However, the output file has been saved to the location noted in the process_output_files key. The file_id of the processed input is used as a filename prefix for the output file.

# nicely print the output of this process
print(json.dumps(process_output, indent=2))

{
  "status_code": 200,
  "pipeline": "multi_keyword_searchable_transcription",
  "request_id": "4932e263-585e-47e2-859f-6a65c7b23d53",
  "file_id": "6b4d7dc2-4010-4f8b-9a18-9b54ed8c14dd",
  "message": "SUCCESS - output fetched for file_id 6b4d7dc2-4010-4f8b-9a18-9b54ed8c14dd.Output saved to location(s) listed in process_output_files.",
  "warnings": [],
  "process_output": null,
  "process_output_files": [
    "../../../data/output/6b4d7dc2-4010-4f8b-9a18-9b54ed8c14dd.db"
  ]
}

Performing Keyword Search

Krixik's keyword_search method enables keyword search on documents processed through pipelines that end with the keyword-db module.

Since our pipeline satisfies this condition, it has access to the keyword_search method. Let's use it to query our text for a few keywords, as below:

# perform keyword search over the file in the pipeline
keyword_output = pipeline.keyword_search(query="lets talk about the country of Colombia", file_ids=[process_output["file_id"]])

# nicely print the output of this process
print(json.dumps(keyword_output, indent=2))

{
  "status_code": 200,
  "request_id": "c63d8207-0c12-43ca-8f87-cbc3c00c0883",
  "message": "Successfully queried 1 user file.",
  "warnings": [
    {
      "WARNING: the following words in the query are in the stop_words list and thus no results will be returned for them": [
        "about",
        "the",
        "of"
      ]
    }
  ],
  "items": [
    {
      "file_id": "6b4d7dc2-4010-4f8b-9a18-9b54ed8c14dd",
      "file_metadata": {
        "file_name": "krixik_generated_file_name_yngkbbnerk.mp3",
        "symbolic_directory_path": "/etc",
        "file_tags": [],
        "num_lines": 1,
        "created_at": "2024-06-05 14:50:54",
        "last_updated": "2024-06-05 14:50:54"
      },
      "search_results": [
        {
          "keyword": "country",
          "line_number": 1,
          "keyword_number": 7
        },
        {
          "keyword": "talk",
          "line_number": 1,
          "keyword_number": 118
        },
        {
          "keyword": "countries",
          "line_number": 1,
          "keyword_number": 142
        }
      ]
    }
  ]
}