recursive summarization
Multi-Module Pipeline: Recursive Summarization
🇨🇴 Versión en español de este documento
A practical way to achieve short and abstract (but representative) summaries of long or medium-length documents is to apply summarization recursively. This concept was discussed in our overview of the single-module summarize pipeline, where we applied a single summarize module pipeline several times to create terser and terser summary representations of an input text.
Recursively summarizing thus facilitates a quicker understanding of key ideas across different levels of detail. This approach is useful for creating hierarchical summaries, optimizing information retrieval, and providing varying levels of detail depending on user needs or application requirements.
In this document we reproduce the same result via a pipeline consisting of multiple summarize modules in immediate succession. Processing files through this pipeline recursively summarizes with a single pipeline invocation. If you require further summarization, you can build a similar pipeline with more summarize modules in it.
The document is divided into the following sections:
Pipeline Setup
To achieve what we've described above, let's set up a pipeline consisting of three sequential summarize modules.
We do this by leveraging the create_pipeline method, as follows:
# create a pipeline as detailed above
pipeline = krixik.create_pipeline(name="multi_recursive_summarization", module_chain=["summarize", "summarize", "summarize"])
Processing an Input File
A pipeline's valid input formats are determined by its first module—in this case, a summarize module. Therefore, this pipeline only accepts textual document inputs.
Let's take a quick look at a short test file before processing.
# examine contents of input file
with open(data_dir + "input/1984_short.txt", "r") as file:
print(file.read())
It was a bright cold day in April, and the clocks were striking thirteen.
Winston Smith, his chin nuzzled into his breast in an effort to escape the
vile wind, slipped quickly through the glass doors of Victory Mansions,
though not quickly enough to prevent a swirl of gritty dust from entering
along with him.
The hallway smelt of boiled cabbage and old rag mats. At one end of it a
coloured poster, too large for indoor display, had been tacked to the wall.
It depicted simply an enormous face, more than a metre wide: the face of a
man of about forty-five, with a heavy black moustache and ruggedly handsome
features. Winston made for the stairs. It was no use trying the lift. Even
at the best of times it was seldom working, and at present the electric
current was cut off during daylight hours. It was part of the economy drive
in preparation for Hate Week. The flat was seven flights up, and Winston,
who was thirty-nine and had a varicose ulcer above his right ankle, went
slowly, resting several times on the way. On each landing, opposite the
lift-shaft, the poster with the enormous face gazed from the wall. It was
one of those pictures which are so contrived that the eyes follow you about
when you move. BIG BROTHER IS WATCHING YOU, the caption beneath it ran.
Inside the flat a fruity voice was reading out a list of figures which had
something to do with the production of pig-iron. The voice came from an
oblong metal plaque like a dulled mirror which formed part of the surface
of the right-hand wall. Winston turned a switch and the voice sank
somewhat, though the words were still distinguishable. The instrument
(the telescreen, it was called) could be dimmed, but there was no way of
shutting it off completely. He moved over to the window: a smallish, frail
figure, the meagreness of his body merely emphasized by the blue overalls
which were the uniform of the party. His hair was very fair, his face
naturally sanguine, his skin roughened by coarse soap and blunt razor
blades and the cold of the winter that had just ended.
Outside, even through the shut window-pane, the world looked cold. Down in
the street little eddies of wind were whirling dust and torn paper into
spirals, and though the sun was shining and the sky a harsh blue, there
seemed to be no colour in anything, except the posters that were plastered
everywhere. The black-moustachio'd face gazed down from every commanding
corner. There was one on the house-front immediately opposite. BIG BROTHER
IS WATCHING YOU, the caption said, while the dark eyes looked deep into
Winston's own. Down at street level another poster, torn at one corner,
flapped fitfully in the wind, alternately covering and uncovering the
single word INGSOC. In the far distance a helicopter skimmed down between
the roofs, hovered for an instant like a bluebottle, and darted away again
with a curving flight. It was the police patrol, snooping into people's
windows. The patrols did not matter, however. Only the Thought Police
mattered.
Behind Winston's back the voice from the telescreen was still babbling away
about pig-iron and the overfulfilment of the Ninth Three-Year Plan. The
telescreen received and transmitted simultaneously. Any sound that Winston
made, above the level of a very low whisper, would be picked up by it,
moreover, so long as he remained within the field of vision which the metal
plaque commanded, he could be seen as well as heard. There was of course
no way of knowing whether you were being watched at any given moment. How
often, or on what system, the Thought Police plugged in on any individual
wire was guesswork. It was even conceivable that they watched everybody all
the time. But at any rate they could plug in your wire whenever they wanted
to. You had to live--did live, from habit that became instinct--in the
assumption that every sound you made was overheard, and, except in
darkness, every movement scrutinized.
Winston kept his back turned to the telescreen. It was safer; though, as he
well knew, even a back can be revealing. A kilometre away the Ministry of
Truth, his place of work, towered vast and white above the grimy landscape.
This, he thought with a sort of vague distaste--this was London, chief
city of Airstrip One, itself the third most populous of the provinces of
Oceania. He tried to squeeze out some childhood memory that should tell him
whether London had always been quite like this. Were there always these
vistas of rotting nineteenth-century houses, their sides shored up with
baulks of timber, their windows patched with cardboard and their roofs
with corrugated iron, their crazy garden walls sagging in all directions?
And the bombed sites where the plaster dust swirled in the air and the
willow-herb straggled over the heaps of rubble; and the places where the
bombs had cleared a larger patch and there had sprung up sordid colonies
of wooden dwellings like chicken-houses? But it was no use, he could not
remember: nothing remained of his childhood except a series of bright-lit
tableaux occurring against no background and mostly unintelligible.
The Ministry of Truth--Minitrue, in Newspeak [Newspeak was the official
language of Oceania. For an account of its structure and etymology see
Appendix.]--was startlingly different from any other object in sight. It
was an enormous pyramidal structure of glittering white concrete, soaring
up, terrace after terrace, 300 metres into the air. From where Winston
stood it was just possible to read, picked out on its white face in
elegant lettering, the three slogans of the Party:
WAR IS PEACE
FREEDOM IS SLAVERY
IGNORANCE IS STRENGTH
With the process method we run this input through the pipeline. We use the default model for the summarize module each time, so the modules argument doesn't have to be leveraged.
# process the file through the pipeline, as described above
process_output = pipeline.process(
local_file_path=data_dir + "input/1984_short.txt", # the initial local filepath where the input file is stored
local_save_directory=data_dir + "output", # the local directory that the output file will be saved to
expire_time=60 * 30, # process data will be deleted from the Krixik system in 30 minutes
wait_for_process=True, # wait for process to complete before returning IDE control to user
verbose=False,
) # do not display process update printouts upon running code
The output of this process is printed below. To learn more about each component of the output, review documentation for the process method.
The output text file itself has been saved to the location noted in the process_output_files key. The file_id of the processed input is used as a filename prefix for the output file.
# nicely print the output of this process
print(json.dumps(process_output, indent=2))
{
"status_code": 200,
"pipeline": "multi_recursive_summarization",
"request_id": "166d7e3a-ce29-4c82-9f06-17ca2016906e",
"file_id": "33d00458-6d3d-4a43-92c4-e7027c908c38",
"message": "SUCCESS - output fetched for file_id 33d00458-6d3d-4a43-92c4-e7027c908c38.Output saved to location(s) listed in process_output_files.",
"warnings": [],
"process_output": null,
"process_output_files": [
"../../../data/output/33d00458-6d3d-4a43-92c4-e7027c908c38.txt"
]
}
To confirm that everything went as it should have, let's load in the text file output from process_output_files:
# load in process output from file
with open(process_output["process_output_files"][0], "r") as file:
print(file.read())
Winston Smith walked through the glass doors of Victory Mansions. The hallway
smelled of boiled cabbage and old rag mats. A kilometre away, his
place of work, the Ministry of Truth, towered vast and white.