Enhancing Large Language Models with Self-Reflective Retrieval-Augmented Generation (Self-RAG) | by Isaac Kargar

Let’s check the implementation of Self-RAG in LlamaIndex.

Description: The process begins when a user inputs a query into the system. This query can range from simple factual questions to complex, multi-faceted requests.

Code Integration: Handled by the custom_query method in the SelfRAGQueryEngine class.

def custom_query(self, query_str: str) -> Response:
"""Run self-RAG."""
response = self.llm(prompt=_format_prompt(query_str), **_GENERATE_KWARGS)
answer = response["choices"][0]["text"]
source_nodes = []
# Further processing...

2. Initial Response Generation

Description: Upon receiving the query, the LLM generates a preliminary response based on its existing knowledge base. During this phase, the model assesses whether additional information is needed by producing a special retrieval token.

Code Integration: Generating the initial response and checking for the retrieval token.

response = self.llm(prompt=_format_prompt(query_str), **_GENERATE_KWARGS)
answer = response["choices"][0]["text"]
if "[Retrieval]" in answer:
# Proceed to retrieval phase
...

3. Decision to Retrieve External Information

Description: The presence of a retrieval token indicates that the initial response may benefit from additional external information. The system decides whether to proceed with retrieval based on this token.

Code Integration: Conditional check for retrieval token.

if "[Retrieval]" in answer:
if self.verbose:
print_text("Retrieval required\n", color="blue")
documents = self.retriever.retrieve(query_str)
...

4. Retrieval of Relevant Documents

Description: The system searches through a predefined corpus or external databases to retrieve a set number (K) of documents pertinent to the user’s query. Criteria include relevance, recency, and credibility.

Code Integration: Retrieving documents using the retriever.

documents = self.retriever.retrieve(query_str)
paragraphs = [
_format_prompt(query_str, document.node.text) for document in documents
]

5. Evaluation of Retrieved Documents

Description: Each retrieved document is evaluated to determine its relevance and how well it supports the initial response. This involves generating reflection tokens that critique each document’s utility.

Code Integration: Evaluation is performed in the _run_critic method.

def _run_critic(self, paragraphs: List[str]) -> CriticOutput:
paragraphs_final_score = {}
llm_response_text = {}
source_nodes = {}
for p_idx, paragraph in enumerate(paragraphs):
pred = self.llm(paragraph, **self.generate_kwargs)
llm_response_text[p_idx] = pred["choices"][0]["text"]
logprobs = pred["choices"][0]["logprobs"]
pred_log_probs = logprobs["top_logprobs"]
isRel_score = _relevance_score(pred_log_probs[0])
isSup_score = _is_supported_score(logprobs["tokens"], pred_log_probs)
isUse_score = _is_useful_score(logprobs["tokens"], pred_log_probs)
paragraphs_final_score[p_idx] = (
isRel_score + isSup_score + 0.5 * isUse_score
)
source_nodes.append(
NodeWithScore(
node=TextNode(text=paragraph, id_=str(p_idx)),
score=isRel_score,
)
)
return CriticOutput(llm_response_text, paragraphs_final_score, source_nodes)

6. Selection of Supporting Documents

Description: Based on the evaluations, the system selects the most pertinent documents to incorporate into the final response. Typically, the document with the highest relevance score is prioritized, but the system can integrate information from multiple documents if beneficial.

Code Integration: Selecting the best paragraph based on final scores.

critic_output = self._run_critic(paragraphs)
paragraphs_final_score = critic_output.paragraphs_final_score
llm_response_per_paragraph = critic_output.llm_response_per_paragraph
best_paragraph_id = max(
paragraphs_final_score, key=paragraphs_final_score.get
)
answer = llm_response_per_paragraph[best_paragraph_id]

7. Final Response Generation

Description: The LLM generates a refined and comprehensive response by incorporating insights from the selected documents. The response is post-processed to remove any control tokens or unwanted characters before being returned to the user.

Code Integration: Post-processing and returning the final response.

answer = _postprocess_answer(answer)
if self.verbose:
print_text(f"Final answer: {answer}\n", color="green")
return Response(response=str(answer), source_nodes=source_nodes)

8. Iterative Retrieval (If Necessary)

Description: If the initial retrieval and integration do not fully address the user’s query, the system can perform additional retrievals and refinements. This iterative process helps in filling gaps or addressing ambiguities in the response.

Note: While the concept of iterative retrieval is proposed, current implementations like LlamaIndex may not fully support this feature yet.