Enhancing Large Language Models with Self-Reflective Retrieval-Augmented Generation (Self-RAG) | by Isaac Kargar | Jan, 2025


Let’s check the implementation of Self-RAG in LlamaIndex.
Description: The process begins when a user inputs a query into the system. This query can range from simple factual questions to complex, multi-faceted requests.
Code Integration: Handled by the custom_query
method in the SelfRAGQueryEngine
class.
def custom_query(self, query_str: str) -> Response:
"""Run self-RAG."""
response = self.llm(prompt=_format_prompt(query_str), **_GENERATE_KWARGS)
answer = response["choices"][0]["text"]
source_nodes = []
# Further processing...
2. Initial Response Generation
Description: Upon receiving the query, the LLM generates a preliminary response based on its existing knowledge base. During this phase, the model assesses whether additional information is needed by producing a special retrieval token.
Code Integration: Generating the initial response and checking for the retrieval token.
response = self.llm(prompt=_format_prompt(query_str), **_GENERATE_KWARGS)
answer = response["choices"][0]["text"]
if "[Retrieval]" in answer:
# Proceed to retrieval phase
...
3. Decision to Retrieve External Information
Description: The presence of a retrieval token indicates that the initial response may benefit from additional external information. The system decides whether to proceed with retrieval based on this token.
Code Integration: Conditional check for retrieval token.
if "[Retrieval]" in answer:
if self.verbose:
print_text("Retrieval required\n", color="blue")
documents = self.retriever.retrieve(query_str)
...
4. Retrieval of Relevant Documents
Description: The system searches through a predefined corpus or external databases to retrieve a set number (K) of documents pertinent to the user’s query. Criteria include relevance, recency, and credibility.
Code Integration: Retrieving documents using the retriever.
documents = self.retriever.retrieve(query_str)
paragraphs = [
_format_prompt(query_str, document.node.text) for document in documents
]
5. Evaluation of Retrieved Documents
Description: Each retrieved document is evaluated to determine its relevance and how well it supports the initial response. This involves generating reflection tokens that critique each document’s utility.
Code Integration: Evaluation is performed in the _run_critic
method.
def _run_critic(self, paragraphs: List[str]) -> CriticOutput:
paragraphs_final_score = {}
llm_response_text = {}
source_nodes = {}
for p_idx, paragraph in enumerate(paragraphs):
pred = self.llm(paragraph, **self.generate_kwargs)
llm_response_text[p_idx] = pred["choices"][0]["text"]
logprobs = pred["choices"][0]["logprobs"]
pred_log_probs = logprobs["top_logprobs"]
isRel_score = _relevance_score(pred_log_probs[0])
isSup_score = _is_supported_score(logprobs["tokens"], pred_log_probs)
isUse_score = _is_useful_score(logprobs["tokens"], pred_log_probs)
paragraphs_final_score[p_idx] = (
isRel_score + isSup_score + 0.5 * isUse_score
)
source_nodes.append(
NodeWithScore(
node=TextNode(text=paragraph, id_=str(p_idx)),
score=isRel_score,
)
)
return CriticOutput(llm_response_text, paragraphs_final_score, source_nodes)
6. Selection of Supporting Documents
Description: Based on the evaluations, the system selects the most pertinent documents to incorporate into the final response. Typically, the document with the highest relevance score is prioritized, but the system can integrate information from multiple documents if beneficial.
Code Integration: Selecting the best paragraph based on final scores.
critic_output = self._run_critic(paragraphs)
paragraphs_final_score = critic_output.paragraphs_final_score
llm_response_per_paragraph = critic_output.llm_response_per_paragraph
best_paragraph_id = max(
paragraphs_final_score, key=paragraphs_final_score.get
)
answer = llm_response_per_paragraph[best_paragraph_id]
7. Final Response Generation
Description: The LLM generates a refined and comprehensive response by incorporating insights from the selected documents. The response is post-processed to remove any control tokens or unwanted characters before being returned to the user.
Code Integration: Post-processing and returning the final response.
answer = _postprocess_answer(answer)
if self.verbose:
print_text(f"Final answer: {answer}\n", color="green")
return Response(response=str(answer), source_nodes=source_nodes)
8. Iterative Retrieval (If Necessary)
Description: If the initial retrieval and integration do not fully address the user’s query, the system can perform additional retrievals and refinements. This iterative process helps in filling gaps or addressing ambiguities in the response.
Note: While the concept of iterative retrieval is proposed, current implementations like LlamaIndex may not fully support this feature yet.
