How Amazon Revolutionized Finance Operations with Generative AI
Amazon Finance Operations has long relied on Accounts Payable (AP) and Accounts Receivable (AR) analysts to handle customer queries. These queries, which come through email, internal tools, or phone, often require analysts to sift through extensive policy documents or consult subject matter experts (SMEs). This process, while thorough, is time-consuming and can take hours or even days to resolve, especially for new hires who lack immediate access to the necessary information. Recognizing this inefficiency, Amazon Finance Automation has developed a groundbreaking solution: a generative AI-powered question-and-answer (Q&A) chat assistant built on Amazon Bedrock.
This innovative tool leverages a large language model (LLM) to provide analysts with instant, accurate answers to customer queries, significantly reducing response times. By integrating this AI-driven assistant into their workflow, Amazon has streamlined operations and enhanced productivity. Here’s a closer look at how this transformative solution was built and the impact it has had.
Solution Overview
The foundation of this solution is a Retrieval Augmented Generation (RAG) pipeline, which operates on Amazon Bedrock. When a user submits a query, the RAG pipeline retrieves relevant documents from a knowledge base and generates a response using the LLM. This approach ensures that responses are both accurate and contextually relevant.
The architecture of the solution includes several key components:
- Knowledge Base: Amazon OpenSearch Service was used as the vector store for embedding documents. Multiple Amazon finance policy documents were processed and indexed into this knowledge base. Plans are underway to migrate to Amazon Bedrock Knowledge Bases for enhanced scalability and reduced cluster management.
- Embedding Model: The Amazon Titan Multimodal Embeddings G1 model was employed for its superior accuracy compared to other embedding models.
- Generator Model: A foundation model from Amazon Bedrock was used to deliver precise and quick answers.
- Diversity Ranker: This component ensures that results are not biased toward specific documents or sections.
- Lost in the Middle Ranker: This feature optimizes the placement of the most relevant results within the prompt, enhancing the overall response quality.
- Guardrails: Amazon Bedrock Guardrails were implemented to detect personal identifiable information (PII) and prevent prompt injection attacks.
- Validation Engine: This engine removes PII and ensures that generated answers align with the retrieved context. If the context is insufficient, the system responds with “I don’t know” to avoid inaccuracies.
- Chat Assistant UI: Built using Streamlit, this user-friendly interface facilitates seamless interaction between analysts and the AI assistant.
Evaluating RAG Performance
Accuracy is paramount for the success of the chat assistant. Initially, the system achieved a 49% accuracy rate, which fell short of expectations. To address this, Amazon adopted an automated performance evaluation approach:
- Testing Data: A dataset of 100 questions was created, covering various sources such as policy documents and engineering SOPs. Each question was paired with an expected answer (manually labeled by SMEs) and the bot’s generated answer.
- NLP Scores: Metrics like ROUGE and METEOR were used but showed a 30% variance compared to human evaluations.
- LLM-Based Scoring: Using an FM from Amazon Bedrock, the team designed prompts to evaluate accuracy, acceptability, and factualness. This method reduced variance to just 5% compared to human analysis.
Amazon also leveraged the RAG evaluation for Amazon Bedrock Knowledge Bases tool to assess metrics like context relevance, correctness, and responsible AI factors.
Improving RAG Accuracy
Through iterative improvements, Amazon enhanced the accuracy of the RAG pipeline from 49% to an impressive 86%. Here’s how they achieved this:
1. Document Semantic Chunking (49% to 64%)
Initially, incomplete contexts were a major issue, as the segmentation algorithm used fixed chunk sizes without considering document boundaries. To resolve this, Amazon implemented a new segmentation approach using QUILL Editor, Amazon Titan Text Embeddings, and OpenSearch Service. This method preserved logical document structures and improved context retrieval accuracy.
2. Prompt Engineering (64% to 76%)
Prompt engineering played a critical role in enhancing the LLM’s performance. Key strategies included:
- Preventing the LLM from generating responses when no relevant context was available.
- Encouraging comprehensive responses when user feedback indicated brevity.
- Enabling both concise and detailed answers based on user needs.
- Incorporating citations to validate the LLM’s responses.
- Introducing chain-of-thought (CoT) reasoning to improve coherence and reduce hallucinations.
These improvements were implemented through meta-prompting, which tailored prompts to specific tasks.
3. Amazon Titan Text Embeddings Model (76% to 86%)
Despite earlier improvements, relevance scores for retrieved contexts remained suboptimal. By adopting the Amazon Titan Text Embeddings G1 model, Amazon increased the relevance of retrieved contexts from 55–65% to 75–80%, further boosting overall accuracy.
Conclusion
Amazon’s journey to develop a generative AI Q&A chat assistant has been marked by continuous innovation and improvement. By leveraging a RAG pipeline and LLMs on Amazon Bedrock, the team addressed challenges like hallucinations, document ingestion issues, and context retrieval inaccuracies. The result is a highly efficient system that has transformed how Amazon Finance Operations handles customer queries, achieving an accuracy rate of 86%.
This solution serves as a blueprint for organizations looking to implement similar AI-driven tools to enhance productivity and streamline operations.
About the Authors
Soheb Moin: A Software Development Engineer at Amazon, Soheb specializes in generative AI and Big Data analytics. Outside of work, he enjoys traveling, playing badminton, and chess.
Nitin Arora: A Senior Software Development Manager at Amazon, Nitin has over 19 years of experience in building scalable software. He enjoys music and reading in his spare time.
Yunfei Bai: A Principal Solutions Architect at AWS, Yunfei designs AI/ML solutions and has a PhD in Electronic and Electrical Engineering. He enjoys reading and music.
Kumar Satyen Gaurav: A Software Development Manager at Amazon, Kumar has 16 years of expertise in big data analytics. He enjoys reading, traveling, and chess.
Mohak Chugh: A Software Development Engineer at Amazon, Mohak works on RAG-based chatbots and enjoys playing the piano and performing with his band.
Parth Bavishi: A Senior Product Manager at Amazon, Parth leads generative AI initiatives and enjoys volleyball and reading.
Originally Written by: Soheb Moin