create a pipeline
Creating a Pipeline
🇨🇴 Versión en español de este documento
This overview on creating pipelines is divided into the following sections:
- The
create_pipelineMethod - A Single-Module Pipeline
- A Multi-Module Pipeline
- Module Sequence Validation
- Pipeline Name Repetition
The create_pipeline Method
The create_pipeline method instantiates new pipelines. It's a very simple method that takes two arguments, both required:
name(str): The name of your new pipeline. Set it wisely: pipeline names are their key identifiers, and no two pipelines can share the same name.module_chain(list): The sequential list of modules that your new pipeline is comprised of.
Click here to see the current list of available Krixik modules. Remember that as long as outputs and inputs match any combination of modules is fair game, including those with module repetition.
A Single-Module Pipeline
Let's use the create_pipeline method to create a single-module pipeline. We'll use the parser module, which divides input text files into shorter snippets.
# create a pipeline with a single parser module
pipeline = krixik.create_pipeline(name="create_pipeline_1_parser", module_chain=["parser"])
Make sure that you have initialized your session before executing this code.
Note that the name argument can be whatever string you want it to be. However, the module_chain list can only be comprised of established module identifiers.
A Multi-Module Pipeline
Now let's set up a pipeline sequentially consisting of three modules: a parser module, a text-embedder module, and a vector-db module. This popular module_chain arises often: it's the basic document-based semantic (a.k.a. vector) search pipeline.
As you can see, pipeline setup syntax is the same as above. The order of the modules in module_chain is the the order they'll process pipeline input in:
# create a basic semantic (vector) search multi-module pipeline
pipeline = krixik.create_pipeline(name="create_pipeline_2_parser_embedder_vector", module_chain=["parser", "text-embedder", "vector-db"])
An array of multi-module pipeline examples can be found here.
Module Sequence Validation
Upon create_pipeline execution the Krixik CLI confirms that the modules indicated will run properly in the provided sequence. If they cannot—which is generally a consequence of one module's output not matching the next module's input—an explanatory local exception is thrown.
For example, attempting to build a two-module pipeline that sequentially consists of a parser module and a caption module will rightly fail and produce a local exception. This is because the parser module outputs a JSON file, while the caption module accepts only image input, as the error message below indicates:
# attempt to create a pipeline sequentially comprised of a parser and a caption module
pipeline = krixik.create_pipeline(name="create_pipeline_3_parser_caption", module_chain=["parser", "caption"])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[4], line 3
1 # attempt to create a pipeline sequentially comprised of a parser and a caption module
----> 3 pipeline_3 = krixik.create_pipeline(name="create_pipeline_3_parser_caption",
4 module_chain=["parser", "caption"])
File c:\Users\Lucas\Desktop\krixikdocsnoodle\myenv\Lib\site-packages\krixik\main.py:70, in krixik.create_pipeline(cls, name, module_chain)
68 raise ValueError(f"module_chain item - {item} - is not a currently one of the currently available modules -{available_modules}")
69 module_chain_ = [Module(m_name) for m_name in module_chain]
---> 70 custom = BuildPipeline(name=name, module_chain=module_chain_)
71 return cls.load_pipeline(pipeline=custom)
File c:\Users\Lucas\Desktop\krixikdocsnoodle\myenv\Lib\site-packages\krixik\pipeline_builder\pipeline.py:63, in BuildPipeline.__init__(self, name, module_chain, config_path)
61 chain_check(module_chain)
62 for module in module_chain:
---> 63 self._add(module)
64 self.test_connections()
File c:\Users\Lucas\Desktop\krixikdocsnoodle\myenv\Lib\site-packages\krixik\pipeline_builder\pipeline.py:86, in BuildPipeline._add(self, module, insert_index)
83 self.__module_chain_configs.append(module.config)
84 self.__module_chain_output_process_keys.append(module.output_process_key)
---> 86 self.test_connections()
File c:\Users\Lucas\Desktop\krixikdocsnoodle\myenv\Lib\site-packages\krixik\pipeline_builder\pipeline.py:160, in BuildPipeline.test_connections(self)
158 # check format compatibility
159 if prev_module_output_format != curr_module_input_format:
--> 160 raise TypeError(
161 f"format type mismatch between {prev_module.name} - whose output format is {prev_module_output_format} - and {curr_module.name} - whose input format is {curr_module_input_format}"
162 )
164 # check process key type compatibility
165 if prev_module_output_process_key_type != curr_module_input_process_key_type:
TypeError: format type mismatch between parser - whose output format is json - and caption - whose input format is image
Pipeline Name Repetition
Krixik will not allow you to create a pipeline with the name of a pipeline you have already created. The only exception is if the new pipeline has a module chain identical to the old one.
If you attempt to create a new pipeline with the name of a previous pipeline and with a different module_chain, initial pipeline instantiation will not fail; in other words, you will be able to run the create_pipeline method without issue. However, when two pipelines with the same name and different module_chains exist and you've already processed one file through one of them, you will not be allowed to process a file through the other because of pipeline name duplication.