create a pipeline

Creating a Pipeline

This overview on creating pipelines is divided into the following sections:

The create_pipeline Method
A Single-Module Pipeline
A Multi-Module Pipeline
Module Sequence Validation
Pipeline Name Repetition

The `create_pipeline` Method

The create_pipeline method instantiates new pipelines. It's a very simple method that takes two arguments, both required:

name (str): The name of your new pipeline. Set it wisely: pipeline names are their key identifiers, and no two pipelines can share the same name.
module_chain (list): The sequential list of modules that your new pipeline is comprised of.

Click here to see the current list of available Krixik modules. Remember that as long as outputs and inputs match any combination of modules is fair game, including those with module repetition.

A Single-Module Pipeline

Let's use the create_pipeline method to create a single-module pipeline. We'll use the parser module, which divides input text files into shorter snippets.

# create a pipeline with a single parser module
pipeline = krixik.create_pipeline(name="create_pipeline_1_parser", module_chain=["parser"])

Make sure that you have initialized your session before executing this code.

Note that the name argument can be whatever string you want it to be. However, the module_chain list can only be comprised of established module identifiers.

A Multi-Module Pipeline

Now let's set up a pipeline sequentially consisting of three modules: a parser module, a text-embedder module, and a vector-db module. This popular module_chain arises often: it's the basic document-based semantic (a.k.a. vector) search pipeline.

As you can see, pipeline setup syntax is the same as above. The order of the modules in module_chain is the the order they'll process pipeline input in:

# create a basic semantic (vector) search multi-module pipeline
pipeline = krixik.create_pipeline(name="create_pipeline_2_parser_embedder_vector", module_chain=["parser", "text-embedder", "vector-db"])

An array of multi-module pipeline examples can be found here.

Module Sequence Validation

Upon create_pipeline execution the Krixik CLI confirms that the modules indicated will run properly in the provided sequence. If they cannot—which is generally a consequence of one module's output not matching the next module's input—an explanatory local exception is thrown.

For example, attempting to build a two-module pipeline that sequentially consists of a parser module and a caption module will rightly fail and produce a local exception. This is because the parser module outputs a JSON file, while the caption module accepts only image input, as the error message below indicates:

# attempt to create a pipeline sequentially comprised of a parser and a caption module
pipeline = krixik.create_pipeline(name="create_pipeline_3_parser_caption", module_chain=["parser", "caption"])

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

Cell In[4], line 3
      1 # attempt to create a pipeline sequentially comprised of a parser and a caption module
----> 3 pipeline_3 = krixik.create_pipeline(name="create_pipeline_3_parser_caption",
      4                                     module_chain=["parser", "caption"])


File c:\Users\Lucas\Desktop\krixikdocsnoodle\myenv\Lib\site-packages\krixik\main.py:70, in krixik.create_pipeline(cls, name, module_chain)
     68         raise ValueError(f"module_chain item - {item} - is not a currently one of the currently available modules -{available_modules}")
     69 module_chain_ = [Module(m_name) for m_name in module_chain]
---> 70 custom = BuildPipeline(name=name, module_chain=module_chain_)
     71 return cls.load_pipeline(pipeline=custom)


File c:\Users\Lucas\Desktop\krixikdocsnoodle\myenv\Lib\site-packages\krixik\pipeline_builder\pipeline.py:63, in BuildPipeline.__init__(self, name, module_chain, config_path)
     61 chain_check(module_chain)
     62 for module in module_chain:
---> 63     self._add(module)
     64 self.test_connections()


File c:\Users\Lucas\Desktop\krixikdocsnoodle\myenv\Lib\site-packages\krixik\pipeline_builder\pipeline.py:86, in BuildPipeline._add(self, module, insert_index)
     83 self.__module_chain_configs.append(module.config)
     84 self.__module_chain_output_process_keys.append(module.output_process_key)
---> 86 self.test_connections()


File c:\Users\Lucas\Desktop\krixikdocsnoodle\myenv\Lib\site-packages\krixik\pipeline_builder\pipeline.py:160, in BuildPipeline.test_connections(self)
    158 # check format compatibility
    159 if prev_module_output_format != curr_module_input_format:
--> 160     raise TypeError(
    161         f"format type mismatch between {prev_module.name} - whose output format is {prev_module_output_format} - and {curr_module.name} - whose input format is {curr_module_input_format}"
    162     )
    164 # check process key type compatibility
    165 if prev_module_output_process_key_type != curr_module_input_process_key_type:


TypeError: format type mismatch between parser - whose output format is json - and caption - whose input format is image

Pipeline Name Repetition

Krixik will not allow you to create a pipeline with the name of a pipeline you have already created. The only exception is if the new pipeline has a module chain identical to the old one.

If you attempt to create a new pipeline with the name of a previous pipeline and with a different module_chain, initial pipeline instantiation will not fail; in other words, you will be able to run the create_pipeline method without issue. However, when two pipelines with the same name and different module_chains exist and you've already processed one file through one of them, you will not be allowed to process a file through the other because of pipeline name duplication.