Skip to content

búsqueda semántica básica

Open In Colab

Pipeline Multimodular: Búsqueda Semántica Básica

🇺🇸 English version of this document

Este documento detalla un pipeline multimodular que recibe un archivo de texto como entrada y habilita búsqueda semántica sobre él.

El documento se divide en las siguientes secciones:

Monta tu Pipeline

Para lograr lo arriba descrito, monta un pipeline que consiste de los siguientes módulos en secuencia:

Para esto usarás el método create_pipeline de la siguiente manera:

# crear el pipeline descrito
pipeline = krixik.create_pipeline(name="multi_busqueda_semantica_basica", module_chain=["parser", "text-embedder", "vector-db"])

Procesa un Archivo de Entrada

Examina el archivo de prueba antes de continuar:

# examina el archivo de entrada
with open(data_dir + "input/1984_corto.txt", "r") as file:
    print(file.read())
It was a bright cold day in April, and the clocks were striking thirteen.
Winston Smith, his chin nuzzled into his breast in an effort to escape the
vile wind, slipped quickly through the glass doors of Victory Mansions,
though not quickly enough to prevent a swirl of gritty dust from entering
along with him.

The hallway smelt of boiled cabbage and old rag mats. At one end of it a
coloured poster, too large for indoor display, had been tacked to the wall.
It depicted simply an enormous face, more than a metre wide: the face of a
man of about forty-five, with a heavy black moustache and ruggedly handsome
features. Winston made for the stairs. It was no use trying the lift. Even
at the best of times it was seldom working, and at present the electric
current was cut off during daylight hours. It was part of the economy drive
in preparation for Hate Week. The flat was seven flights up, and Winston,
who was thirty-nine and had a varicose ulcer above his right ankle, went
slowly, resting several times on the way. On each landing, opposite the
lift-shaft, the poster with the enormous face gazed from the wall. It was
one of those pictures which are so contrived that the eyes follow you about
when you move. BIG BROTHER IS WATCHING YOU, the caption beneath it ran.

Inside the flat a fruity voice was reading out a list of figures which had
something to do with the production of pig-iron. The voice came from an
oblong metal plaque like a dulled mirror which formed part of the surface
of the right-hand wall. Winston turned a switch and the voice sank
somewhat, though the words were still distinguishable. The instrument
(the telescreen, it was called) could be dimmed, but there was no way of
shutting it off completely. He moved over to the window: a smallish, frail
figure, the meagreness of his body merely emphasized by the blue overalls
which were the uniform of the party. His hair was very fair, his face
naturally sanguine, his skin roughened by coarse soap and blunt razor
blades and the cold of the winter that had just ended.

Outside, even through the shut window-pane, the world looked cold. Down in
the street little eddies of wind were whirling dust and torn paper into
spirals, and though the sun was shining and the sky a harsh blue, there
seemed to be no colour in anything, except the posters that were plastered
everywhere. The black-moustachio'd face gazed down from every commanding
corner. There was one on the house-front immediately opposite. BIG BROTHER
IS WATCHING YOU, the caption said, while the dark eyes looked deep into
Winston's own. Down at street level another poster, torn at one corner,
flapped fitfully in the wind, alternately covering and uncovering the
single word INGSOC. In the far distance a helicopter skimmed down between
the roofs, hovered for an instant like a bluebottle, and darted away again
with a curving flight. It was the police patrol, snooping into people's
windows. The patrols did not matter, however. Only the Thought Police
mattered.

Behind Winston's back the voice from the telescreen was still babbling away
about pig-iron and the overfulfilment of the Ninth Three-Year Plan. The
telescreen received and transmitted simultaneously. Any sound that Winston
made, above the level of a very low whisper, would be picked up by it,
moreover, so long as he remained within the field of vision which the metal
plaque commanded, he could be seen as well as heard. There was of course
no way of knowing whether you were being watched at any given moment. How
often, or on what system, the Thought Police plugged in on any individual
wire was guesswork. It was even conceivable that they watched everybody all
the time. But at any rate they could plug in your wire whenever they wanted
to. You had to live--did live, from habit that became instinct--in the
assumption that every sound you made was overheard, and, except in
darkness, every movement scrutinized.

Winston kept his back turned to the telescreen. It was safer; though, as he
well knew, even a back can be revealing. A kilometre away the Ministry of
Truth, his place of work, towered vast and white above the grimy landscape.
This, he thought with a sort of vague distaste--this was London, chief
city of Airstrip One, itself the third most populous of the provinces of
Oceania. He tried to squeeze out some childhood memory that should tell him
whether London had always been quite like this. Were there always these
vistas of rotting nineteenth-century houses, their sides shored up with
baulks of timber, their windows patched with cardboard and their roofs
with corrugated iron, their crazy garden walls sagging in all directions?
And the bombed sites where the plaster dust swirled in the air and the
willow-herb straggled over the heaps of rubble; and the places where the
bombs had cleared a larger patch and there had sprung up sordid colonies
of wooden dwellings like chicken-houses? But it was no use, he could not
remember: nothing remained of his childhood except a series of bright-lit
tableaux occurring against no background and mostly unintelligible.

The Ministry of Truth--Minitrue, in Newspeak [Newspeak was the official
language of Oceania. For an account of its structure and etymology see
Appendix.]--was startlingly different from any other object in sight. It
was an enormous pyramidal structure of glittering white concrete, soaring
up, terrace after terrace, 300 metres into the air. From where Winston
stood it was just possible to read, picked out on its white face in
elegant lettering, the three slogans of the Party:


  WAR IS PEACE
  FREEDOM IS SLAVERY
  IGNORANCE IS STRENGTH

Usarás los modelos predeterminados en este pipeline, así que el argumento modules del método process no hará falta.

# procesa el archivo a través del pipeline según lo arriba descrito
process_output = pipeline.process(
    local_file_path=data_dir + "input/1984_corto.txt",  # la ruta de archivo inicial en la que yace el archivo de entrada
    local_save_directory=data_dir + "output",  # el directorio local en el que se guardará el archivo de salida
    expire_time=60 * 30,  # data de este proceso se eliminará del sistema Krixik en 30 minutos
    wait_for_process=True,  # espera que el proceso termine antes de devolver control del IDE al usuario
    verbose=False,  # no mostrar actualizaciones de proceso al ejecutar el código
)

Reproduce la salida de este proceso con el siguiente código. Para aprender más sobre cada componente de la salida, estudia la documentación del método process.

Dado que la salida de este modelo/módulo es un archivo de base de datos FAISS, process_output se muestra como "null". Sin embargo, el archivo de salida se ha guardado en la ubicación indicada bajo process_output_files. El file_id del archivo procesado es el prefijo del nombre del archivo de salida en esta ubicación.

# nítidamente reproduce la salida de este proceso
print(json.dumps(process_output, indent=2))
{
  "status_code": 200,
  "pipeline": "multi_basic_semantic_search",
  "request_id": "46c2fed2-ed7b-4046-8fa9-3f6143391699",
  "file_id": "3c9f01fe-1a93-43f4-b87c-5270225a0575",
  "message": "SUCCESS - output fetched for file_id 3c9f01fe-1a93-43f4-b87c-5270225a0575.Output saved to location(s) listed in process_output_files.",
  "warnings": [],
  "process_output": null,
  "process_output_files": [
    "../../../data/output/3c9f01fe-1a93-43f4-b87c-5270225a0575.faiss"
  ]
}

Busqueda Semantica

El método semantic_search de Krixik habilita búsqueda semántica sobre documentos procesados a través de ciertos pipelines. Dado que el método semantic_search hace embedding (encaje léxico) con la consulta y luego lleva a cabo la búsqueda, solo se puede usar con pipelines que de manera consecutiva contienen los módulos text-embedder (encaje léxico) y vector-db (base de datos vectorial).

Ya que tu pipeline satisface esta condición tiene acceso al método semantic_search. Úsalo de la siguiente manera para consultar el texto con lengua natural:

# haz búsqueda semántica sobre el texto procesado por el pipeline
semantic_output = pipeline.semantic_search(query="it was cold night", file_ids=[process_output["file_id"]])

# nítidamente reproduce la salida de esta búsqueda
print(json.dumps(semantic_output, indent=2))
{
  "status_code": 200,
  "request_id": "a6ee5fb9-f652-4ffa-a1bd-a824eb3f9079",
  "message": "Successfully queried 1 user file.",
  "warnings": [],
  "items": [
    {
      "file_id": "3c9f01fe-1a93-43f4-b87c-5270225a0575",
      "file_metadata": {
        "file_name": "krixik_generated_file_name_rnhqfhzbgh.txt",
        "symbolic_directory_path": "/etc",
        "file_tags": [],
        "num_vectors": 50,
        "created_at": "2024-06-05 14:51:02",
        "last_updated": "2024-06-05 14:51:02"
      },
      "search_results": [
        {
          "snippet": "Outside, even through the shut window-pane, the world looked cold.",
          "line_numbers": [
            32,
            33
          ],
          "distance": 0.232
        },
        {
          "snippet": "It was a bright cold day in April, and the clocks were striking thirteen.",
          "line_numbers": [
            1
          ],
          "distance": 0.236
        },
        {
          "snippet": "His hair was very fair, his face\nnaturally sanguine, his skin roughened by coarse soap and blunt razor\nblades and the cold of the winter that had just ended.",
          "line_numbers": [
            29,
            30,
            31
          ],
          "distance": 0.324
        },
        {
          "snippet": "Down in\nthe street little eddies of wind were whirling dust and torn paper into\nspirals, and though the sun was shining and the sky a harsh blue, there\nseemed to be no colour in anything, except the posters that were plastered\neverywhere.",
          "line_numbers": [
            33,
            34,
            35,
            36,
            37
          ],
          "distance": 0.33
        },
        {
          "snippet": "It was the police patrol, snooping into people's\nwindows.",
          "line_numbers": [
            44,
            45
          ],
          "distance": 0.352
        }
      ]
    }
  ]
}