the parameterizable process method
The Parameterizable process Method
🇨🇴 Versión en español de este documento
The process method, available on every Krixik pipeline, is invoked whenever you wish to process files through a pipeline.
This overview of the process method is divided into the following sections:
- Core process Method Arguments
- Basic Usage and Output Breakdown
- Selecting Models Via the modules Argument
- Using your own Models
- Optional Metadata Arguments
- Metadata Argument Defaults
- Automatic File Type Conversions
- Output Size Cap
Core process Method Arguments
The process method takes five basic arguments (in addition to the modules argument and a series of optional metadata arguments, all discussed further below). These five arguments are:
-
local_file_path: (required, str) The local file path of the file you wish to process through the pipeline. -
local_save_directory: (optional, str) The local directory you want process output saved to. Defaults to the current working directory. -
expire_time: (optional, int) The amount of time (in seconds) that process output remains on Krixik servers. Defaults to 1800 seconds, which is 30 minutes. -
wait_for_process: (optional, bool) Indicates whether or not Krixik should wait for your process to complete before returning control of your IDE or notebook.Truetells Krixik to wait until the process is complete, so you won't be able to execute anything else in the meantime.Falsetells Krixik that you wish to regain control as soon as file upload to the Krixik system has concluded. When set toFalse, processing status can be examined via theprocess_statusmethod. Defaults toTrue. -
verbose: (optional, bool) Determines if Krixik should immediately display process update printouts at your terminal/notebook. Defaults toTrue.
Basic Usage and Output Breakdown
Let's first create a single-module pipeline to demonstrate the process method with. We'll use a sentiment module.
# create single-module pipeline for process demo
pipeline = krixik.create_pipeline(name="process_method_1_sentiment", module_chain=["sentiment"])
We've locally created a JSON file that holds three snippets that simulate online product reviews. The snippets read as follows:
-
This recliner is the best damn seat I've ever come across. When I fall asleep on it, which is often, I sleep like a baby.
-
This recliner is terrible. It broke on its way out of the box, and no matter what I try, it doesn't recline. Avoid at all costs.
-
I've sat on a lot of recliners in my life. I've forgotten about most of them. I'll forget about this one as well.
Keep in mind that input JSON files must follow a very specific format. If they don't, they'll be rejected by Krixik.
# process short input file
process_demo_output = pipeline.process(
local_file_path=data_dir + "input/recliner_reviews.json", # the initial local filepath where the input JSON file is stored
local_save_directory=data_dir + "output", # the local directory that the output file will be saved to
expire_time=60 * 30, # process data will be deleted from the Krixik system in 10 minutes
wait_for_process=True, # wait for process to complete before returning IDE control to user
verbose=False,
) # do not display process update printouts upon running code
Now let's print the output of the process. Because the output of this particular module-model pair is in JSON format, we can print it nicely with the following code:
# nicely print the output of the above process
import json
print(json.dumps(process_demo_output, indent=2))
{
"status_code": 200,
"pipeline": "process_method_1_sentiment",
"request_id": "339ef4dd-5c97-4822-b450-aea700bc6021",
"file_id": "6a314cdb-6938-4663-aef5-a0258341c120",
"message": "SUCCESS - output fetched for file_id 6a314cdb-6938-4663-aef5-a0258341c120.Output saved to location(s) listed in process_output_files.",
"warnings": [],
"process_output": [
{
"snippet": "This recliner is the best damn seat I've ever come across. When I fall asleep on it, which is often, I sleep like a baby.",
"positive": 0.871,
"negative": 0.129,
"neutral": 0.0
},
{
"snippet": "This recliner is terrible. It broke on its way out of the box, and no matter what I try, it doesn't recline. Avoid at all costs.",
"positive": 0.001,
"negative": 0.999,
"neutral": 0.0
},
{
"snippet": "I've sat on a lot of recliners in my life. I've forgotten about most of them. I'll forget about this one as well.",
"positive": 0.001,
"negative": 0.999,
"neutral": 0.0
}
],
"process_output_files": [
"../../../data/output/6a314cdb-6938-4663-aef5-a0258341c120.json"
]
}
Let's break down the output:
-
status_code: The HTTP status code for this process (e.g. "200", "500") -
pipeline: Thenameof the pipeline we just ranprocesson. -
request_id: The unique ID associated with this execution ofprocess. -
file_id: The unique server-side ID for the now-processed file (and thus its associated output). -
message: This message specifies SUCCESS or FAILURE for the method call and offers detail. -
warnings: A message list that includes any warnings related to the method call. -
process_output: The output of the process. In this case, since the output is in JSON format, it's easily printable in a code notebook. -
process_output_files: A list of file names and file paths generated as process outputs and saved locally.
We can see from process_output that our sentiment analysis pipeline has worked correctly. Each of the product reviews has been assigned a sentiment value breakdown between positive, negative, and neutral.
In addition to being printed here, this process output is also stored in the file indicated in process_output_files. Let's load it in and confirm that it shows the same process output we received above:
# load in process output from file
import json
with open(process_demo_output["process_output_files"][0], "r") as file:
print(json.dumps(json.load(file), indent=2))
[
{
"snippet": "This recliner is the best damn seat I've ever come across. When I fall asleep on it, which is often, I sleep like a baby.",
"positive": 0.871,
"negative": 0.129,
"neutral": 0.0
},
{
"snippet": "This recliner is terrible. It broke on its way out of the box, and no matter what I try, it doesn't recline. Avoid at all costs.",
"positive": 0.001,
"negative": 0.999,
"neutral": 0.0
},
{
"snippet": "I've sat on a lot of recliners in my life. I've forgotten about most of them. I'll forget about this one as well.",
"positive": 0.001,
"negative": 0.999,
"neutral": 0.0
}
]
Selecting Models Via the modules Argument
The modules argument to the process method is optional, but through it you can access a wealth of parameterization options. This argument allows you to parameterize how each module operates, INCLUDING the determination of (when applicable) what AI model is active within it.
The modules argument takes the form of a dictionary with dictionaries within it. On a single-module pipeline it looks like this:
modules={'<model name>': {'model':'<model selection>', 'params': <dictionary of parameters>}}
Bear in mind that model names are case sensitive.
An example for a single-module pipeline that holds a caption module would specifically look like this, blip-image-captioning-base being the available model selected:
modules={'caption': {'model':'blip-image-captioning-base', 'params': {}}}
In the above example params is an empty dictionary because caption module models don't take any parameters. Other types of models do, such as the text-embedder module models. This is what the modules argument might look like for a single-module text-embedder pipeline:
modules={'text-embedder': {'model':'multi-qa-MiniLM-L6-cos-v1', 'params': {'quantize': False}}}
quantize is a parameter that you can set for text-embedder module models, and only for text-embedder module models.
The modules argument syntax for multi-module pipelines is similar to the above, but in that case there's one sub-dictionary for every module. For instance, the modules argument for a vector (semantic) search pipeline that sequentially chains together parser, text-embedder, and vector-db modules might look like this:
modules={'parser': {'model':'fixed', 'params': {"chunk_size": 10, "overlap_size": 5}},
'text-embedder': {'model':'all-MiniLM-L6-v2', 'params': {}},
'vector-db': {'model':'faiss', 'params': {}}}
Note that any modules not explicitly called out will take their default values. If you need to specify one module's model or its params, that doesn't mean you need to specify all of them in the pipeline. Consequently, given that in the code immediately above the text-embedder and vector-db modules above are being set to their default values, you could achieve the exact same thing by removing them from the code and only leaving the parser module, as follows:
modules={'parser': {'model':'fixed', 'params': {"chunk_size": 10, "overlap_size": 5}}}
Find detail on each of our current modules, including available models for each, here.
Using your own Models
Do you have a model—either one you've developed or one you've fine-tuned—that you'd like to use on Krixik?
Please click here to learn how to do so!
Optional Metadata Arguments
The process method also takes a variety of optional metadata arguments. These do not change how process runs or treats data. Instead, they make your processed files easier to retrieve and organize. You can think of it as a file system for files you've processed through your pipelines.
Optional metadata arguments include:
-
symbolic_directory_path(str) - A UNIX-formatted directory path under your account in the Krixik system. Default is/etc. -
file_name(str) - A custom file name that must end with the file extension of the original input file. Default is a randomly-generated string (see below). -
symbolic_file_path(str) - A combination ofsymbolic_directory_pathandfile_namein a single argument. Default is a concatenation of the default of each. -
file_tags(list) - A list of custom file tags (each a key-value pair). Default is an empty list. -
file_description(str) - A custom file description. Default is an empty string.
The first four of these—symbolic_directory_path, file_name, symbolic_directory_path, and file_tags—can be used as arguments to the list method and to the keyword_search and semantic_search methods.
Note that a file you process through one pipeline is only accessible to that pipeline. If you upload a file to a certain symbolic_directory_path on a certain pipeline, for instance, you will not be able to list, search, or otherwise access it from any other pipeline, even if you target the same symbolic_directory_path from there.
Also note that a symbolic_file_path cannot be duplicated within a pipeline. In other words, if on a certain pipeline you process a file to a specified symbolic_directory_path and file_name, Krixik will not allow you to process any other files with that same combination of symbolic_file_path and file_name.
Let's call the process method once more. We'll use the same product review file as before, but expand our line of code with some of these optional metadata arguments:
# process short input file with optional metadata arguments
process_demo_output = pipeline.process(
local_file_path=data_dir + "input/recliner_reviews.json",
local_save_directory=data_dir + "output",
expire_time=60 * 30,
wait_for_process=True,
verbose=False,
symbolic_directory_path="/my/custom/filepath",
file_name="product_reviews.json",
file_tags=[{"category": "furniture"}, {"product code": "recliner-47b-u11"}],
file_description="Three product reviews for the Orwell Cloq recliner.",
)
Metadata Argument Defaults
-
If no
file_nameis provided, a random one is generated. It takes the formkrixik_generated_file_name_{10 random chars}.ext, where here.extis the extension of your input file provided inlocal_file_path. -
If no
symbolic_directory_pathis provided, the default value it takes is/etc. -
Note that you cannot define any children directories under the
symbolic_directory_path/etc; it is the catch-all directory, and is not meant to be built under.
Automatic File Type Conversions
For certain modules, the process method automatically converts the format of some local_file_path input files. Conversions currently done by Krixik are:
pdf->txtdocx->txtpptx->txt
Output Size Cap
The current size limit on output generated by the process method is 5MB.