AJ - AWS Certified Generative AI Developer - Professional (AIP-C01) Exam Handout

1. Exam Overview

Attribute	Detail
Exam Code	AIP-C01
Duration	4 hours (240 minutes)
Questions	85
Format	Multiple choice, multiple response

Exam Domains

Domain
Content Domain 1: Foundation Model Integration, Data Management, and Compliance
Content Domain 2: Implementation and Integration
Content Domain 3: AI Safety, Security, and Governance
Content Domain 4: Operational Efficiency and Optimization for GenAI Applications
Content Domain 5: Testing, Validation, and Troubleshooting

Ref: AWS Certified Generative AI Developer - Professional | Exam Guide PDF

2. Amazon Q Family

Amazon Q Developer (formerly CodeWhisperer)

AI-powered code generation, debugging, and transformation
Supports 15+ programming languages
IDE integration (VS Code, JetBrains, AWS Cloud9)
Code security scanning and vulnerability detection
/transform for Java code modernization (e.g., Java 8 to 17)

Amazon Q Business

Enterprise RAG assistant connecting 40+ data sources
Integrations: Salesforce, ServiceNow, SharePoint, Slack, Gmail, Atlassian, MS 365, S3
Permission-aware: Respects ACLs from identity providers
Personalized responses based on IdP data (department, role, etc.)
Q Apps: Convert conversations into lightweight task automation apps (Pro tier)
Plugins: JIRA, ServiceNow, Zendesk, custom OpenAPI plugins
Browser extension: Chrome, Firefox, Slack, Teams, Word, Outlook

Security & Access:

IdP support: Okta, Google Identity, Entra, IAM Identity Center
ACLs ingested from IdP service
All data stays within region; no data used for training

Retrievers:

Native retriever (all integrations)
Existing retriever (Amazon Kendra)
Index provisioning: Enterprise (1M docs, multi-AZ) | Starter (100K docs, single-AZ)

Admin Controls & Guardrails:

Restrict responses to enterprise sources only (or fallback to LLM knowledge)
Topic restrictions, blocked words
Data handling and response generation policies

CloudWatch Metrics: AWS/QBusiness namespace - DocumentsIndexed, ThumbsUpCount, ThumbsDownCount

Amazon Q in QuickSight

Natural language to dashboard generation
Business review story generation
Multi-source data unification for insights

Amazon Q in Connect (Customer Service)

Real-time agent assistance for contact centers
Automated response suggestions and knowledge search

Ref: Amazon Q Business | Q Business Features | Q Business Integrations

3. AWS AI/ML Services

AWS HealthScribe

HIPAA-compliant automatic speech recognition (ASR)
Generates clinical notes from patient-clinician conversations
Extracts structured medical data
Supports clinical documentation workflows

AWS Comprehend

NLP service for text analysis
Entity recognition, sentiment analysis, key phrase extraction
Topic modeling, language detection
Custom entity recognition and classification models

AWS Comprehend Medical

Specialized NLP for healthcare text
Extracts medical entities: medications, conditions, dosages, procedures
HIPAA eligible service
Identifies PHI (Protected Health Information)
ICD-10 and RxNorm ontology linking

Ref: AWS Comprehend | AWS Comprehend Medical

4. Prompting Techniques

This section is heavily tested. Know each technique, when to use it, and how it differs from others.

Chain-of-Thought (CoT) Prompting

Ask the model to reason step-by-step before giving a final answer
Most useful for math, logic, and multi-step reasoning
Zero-shot CoT: Add "Let's think step by step" to the prompt
Few-shot CoT: Provide examples with reasoning chains

ReAct (Reasoning + Acting)

Combines reasoning (CoT) with tool calls in an interleaved loop
Pattern: Thought -> Action -> Observation -> Thought -> ...
Foundation of the agent loop and deep research patterns
Model reasons about what to do, takes an action (API call, KB search), observes result, then plans next step

Tree of Thought (ToT)

Explores multiple reasoning paths simultaneously (branching tree)
Uses search algorithms (BFS/DFS) for systematic exploration
Enables lookahead and backtracking - if one path fails, try another
Best for problems with many possible solutions

Maieutic Prompting

Iterative explanation technique inspired by Socratic method
Model generates explanations, then critiques its own reasoning
Goal: do not leave inconsistencies - resolve contradictions
Related to the 5 Whys technique - keep asking "why" to reach root cause

Complexity-Based Prompting

Generate multiple CoT reasoning chains in parallel
Select the most common conclusion across chains
Filters out outlier/incorrect reasoning paths
Effective for ambiguous problems

Least-to-Most Prompting

List subproblems first, then solve from simplest upward
Decompose complex tasks into ordered subtasks
Each solution feeds into the next, building toward final answer

Self-Refine Prompting

Tell the model to iterate over its own output
Produce initial solution -> Critique it -> Produce improved version
Repeat until quality threshold is met

Directional Stimulus Prompting

Provide hints, cues, or keywords in the prompt
Guide the model toward desired output without explicit answers
Useful for steering generation direction

Prompt Chaining

Break tasks into subtasks, chain prompts sequentially
Output of one prompt becomes input for the next
Each step performs a transformation
Example: Extract -> Summarize -> Translate -> Format

Ref: Prompt Engineering Guide | Chain-of-Thought Prompting | Prompt Chaining

5. Getting Started with Amazon Bedrock

Model Evaluation Before Release

Method	Use Case
Bedrock Model Evaluation	Batch dataset, detailed scores across metrics
Playground Compare	Single prompt, two models side-by-side, token controls, latency

Prompt Management

Version control for prompts - track changes, audit trail
A/B testing across prompt versions
Parameterized templates - reusable prompts with variables
KMS encryption for prompt security

Bedrock Flows

Drag-and-drop visual builder for GenAI workflows
Connect: Knowledge Bases, Prompts, Lambda functions
Example flow: [User Input] -> [KB Search] -> [LLM Processing] -> [Output]

Frameworks and Tools

Framework	Best For
LangChain	Chatbots, agents, chains, tool integration
LlamaIndex	Data retrieval, processing, RAG pipelines

Bedrock Runtime API Response Structure

{
  "message": { "role": "assistant", "content": [...] },
  "stopReason": "end_turn",
  "usage": { "inputTokens": X, "outputTokens": Y }
}

Converse API (Preferred)

Unified structure regardless of model used (no model-specific formatting)
Supports: tools, guardrails, system prompts, text + image
Temperature, topP, maxTokens controls
Use try-catch blocks for error handling

Common Bedrock Errors

Error	Root Cause
Service Quota Exceeded	Account limits reached
ThrottlingException	Too many requests per second
Data Issues	Training/validation/output data problems
Token Count Exceeded	Input or output too long
Malformed Input	Doesn't match model's expected format
Internal Server Errors	AWS-side issues

Ref: Amazon Bedrock | Converse API

6. Fine-Tuning, Continued Pre-Training & Distillation

When to Use What

Prompt Engineering & RAG fall short?
          |
    Yes --+-- Need domain knowledge? --> Continued Pre-Training (CPT)
          |
          +-- Need task-specific skill? --> Fine-Tuning
          |
          +-- Need smaller/cheaper model? --> Distillation

Fine-Tuning

Input: Small, labeled dataset (prompt-completion pairs)
Pros: Quick, cheap, small data requirements
Cons: Easy to overfit!
Use cases: Sentiment analysis, text summarization, chatbots, classification

PEFT (Parameter-Efficient Fine-Tuning) Techniques

Technique	Description
LoRA	Train a small subset of parameters via low-rank matrices
QLoRA	LoRA + quantization for memory efficiency
Prefix Tuning	Add trainable parameters to input layer
Prompt Tuning	Inject learnable soft prompts on input
P-Tuning	Automated prompt training with neural networks
RLHF	Reinforcement learning from human feedback
Multi-task Fine-tuning	Train on multiple tasks simultaneously

Continued Pre-Training (CPT)

Input: Large, unlabeled domain-specific corpus
Extends model's foundational knowledge
Use cases: Scientific papers, legal documents, financial reports, news articles

Model Distillation (Bedrock Distillation Service)

Transfer knowledge from teacher model (large) to student model (small)
Example: Llama 70B -> Llama 8B
Sources: Custom prompts, prompts + completions, or invocation logs
Fine-tuning with labels generated by teacher model
Recommended for specific domains

Custom Model Validation Results

Metric	Description
`step_number`	Single pass of training batch
`epoch_number`	All steps per epoch
`validation_loss`	Lower = model better fits validation data
`validation_perplexity`	How well model predicts token sequences (lower = better)

Ref: Bedrock Fine-Tuning | CPT | PEFT Techniques Blog

7. Inference, Throughput & Monitoring

Inference Options

Option	Details	Savings
On-Demand	Pay per token, no commitment	Baseline
Provisioned Throughput	Purchase Model Units (MU), hourly rate	40-60% savings
Batch Inference	Queue jobs for async processing	~50% savings
Cross-Region Inference	Route to other regions for capacity	No extra data transfer charges

Provisioned Throughput Details

1 MU = X input tokens + Y output tokens per minute (model-dependent)
Commitment: 1 month, 6 months, or no commitment
Burst capacity covered by on-demand
Per region only - does not work with cross-region inference

Cross-Region Inference

Same price as on-demand in primary region
No extra charges for data transfer
Logs remain in source region
CloudWatch and CloudTrail record in source region

CloudWatch KPIs for Bedrock

Metric	Use
`Invocations`	Track usage volume
`InvocationLatency`	Detect performance degradation
`ClientErrors`	Prompt/UI issues
`ServerErrors`	Stability, capacity issues
`InputTokenCount / OutputTokenCount`	Cost monitoring
`Throttles`	Key indicator you need Provisioned Throughput

Invocation Logging

Set up in Bedrock console for all models in account
Destinations: S3, CloudWatch Logs, or both
Options for PII masking
Use for: Auditing, pattern analytics, troubleshooting (CW Logs Insights)

CloudTrail Data Events for Bedrock

InvokeModel, InvokeFlow
InvokeAgent
RetrieveKB
UseGuardrail
Integrate with GuardDuty for threat detection

Monitoring Best Practices

Performance:

Establish baseline metrics (2-week observability period recommended)
Proactive alerting on deviations (e.g., 5% error increase in 5 minutes)
Track model-specific metrics: coherence, perplexity
Monitor usage against quotas and throttling

Cost:

Invocation logs for usage patterns
Optimize prompts to reduce token usage
Cost allocation tags + budgets + anomaly detection
Consider batch inference for non-real-time workloads

Security:

Audit API access via CloudTrail
GuardDuty for automated threat scans
Monitor CW Logs for PII exposure
Enforce compliance with Guardrails + AWS Config

Ref: Cross-Region Inference | Bedrock Monitoring | Invocation Logging

8. Bedrock Knowledge Bases & RAG

Knowledge Base Configuration

KB Params: Name, description, tags, IAM role, query engine, log deliveries (NOT inference logging)

Data Source Params:

Name, location (S3 URI - must be in same region)
Parsing strategy: Text | Foundation Model | Data Automation
Chunking strategy (semantic, fixed-size, hierarchical)
Transformation Lambda for custom chunking/metadata
Embedding model + vector store selection

Retrieval Configurations

Setting	Options
Search Type	Semantic or Hybrid (text + semantic)
Max Results	Configurable
Inference Params	Temperature, top-p, top-k, max tokens
Prompt Template	System prompt customization
Guardrails	Attach guardrail ID
Reranking	Improve relevance ordering

Structured Data Retrieval

Query is translated into SQL for structured data sources
Natural language to SQL generation

KB Best Practices

High quality data - clean, well-structured documents
Chunking strategy - align with your query patterns
Feedback loops from users for continuous improvement
KB evaluation - use LLM-as-a-judge
Plan for scalability - monitor index growth
Responsible AI - regular audits for biases, relevance, accuracy
Logging - S3, CW Logs, Firehose
UX - clear UI, fast response time, multimodal support
Use reranking to improve result relevance

RAG Evaluation Metrics

Category	Metrics
Query	Answer Relevancy - is input related to documents
Retrieval	Context Precision, Context Recall, Context Entity Recall
Generation	Faithfulness, Correctness, Coherence
Overall	Completeness, Harmfulness, Answer Refusal, Stereotyping

RAGAS Interpretation

Faithfulness - answers grounded in retrieved context (hallucination detection)
Relevancy - answers address the question, no redundancy
Precision - relevant docs ranked higher
Recall - all relevant context retrieved vs ground truth
Entity Recall - entities in context vs ground truth
Answer Similarity - semantic comparison of answer vs ground truth

RAG Eval Best Practices

Diverse question sets covering various topics
Balance automatic and human evaluation
Iterative improvement: adjust chunking, reranking strategies
Domain-specific metrics
Regular re-evaluation as Knowledge Base grows

Ref: Bedrock Knowledge Bases | RAG Evaluation | KB How It Works

9. Bedrock Agents & Strands SDK

Agent Development Lifecycle

BUILD-TIME (Setup)                    RUNTIME (Execution)
+-----------------------+             +---------------------------+
| Select FM             |             | Pre-process               |
| Write Instructions    |   ------>   |   Validate user input     |
| Attach Action Groups  |             | Orchestrate               |
| Connect KBs           |             |   Think -> KB -> Actions  |
+-----------------------+             | Post-process              |
                                      |   Format response         |
                                      +---------------------------+

Agent Orchestration Flow

User sends input/query
FM receives input + context + system prompt
FM breaks down input into sequence of steps
For each step: execute API or query KB
Based on results, plan next action
Output final answer

Orchestration Customization

Customize pre-processing, orchestration, post-processing prompts
Parse using Lambda for dynamically changing prompts
Keep prompts: clear, concise, aligned with agent's capabilities

Action Groups

Multiple can be attached per agent
Max 3 functions per group

Lambda handler pattern:

agent = event["agent"]
actionGroup = event["actionGroup"]
function = event["function"]
params = {p["name"]: p["value"] for p in event["parameters"]}
session = event["sessionAttributes"]

Agent Performance Optimization

Tune: temperature, topK, length penalty
Customize advanced prompts (pre/post-processing, orchestration)
Continuous monitoring + user feedback
Track conversational metrics

Strands Agents SDK

Open-source framework from AWS for building production-ready agents
Three core components: Model Provider, System Prompt, Toolbelt
Native integration with Bedrock Guardrails, KBs, and AgentCore
MCP (Model Context Protocol) support
Observability with OpenTelemetry

Strands vs Bedrock Agents

Aspect	Strands SDK	Bedrock Agents
Control	Complete control of architecture	Managed, serverless
Configuration	Code-based	Console-based
Deployment	Self-managed or AgentCore	Fully managed
Flexibility	Maximum customization	Opinionated patterns
Multi-agent	Graph, Swarm, Workflow patterns	Built-in multi-agent collaboration

Ref: Bedrock Agents | Strands Agents | Multi-Agent Patterns

10. Model Evaluation

Why Evaluate?

Quality assurance and performance benchmarking
Bias detection and fairness assessment
Comparative analysis (models or versions)
Continuous improvement guidance (training, fine-tuning)
Trust and transparency for stakeholders
Regulatory compliance (EU AI Act)
Resource optimization (is the model too large? need fine-tune?)

Evaluation Types on Bedrock

Type	Description
Automatic (Programmatic)	Predefined metrics: accuracy, robustness, toxicity
Model-as-a-Judge	Select metrics and judge model; tasks: text gen, summary, QA, classification
Human-Based	Customized UI, form teams, flexible subjective evaluation

Key Evaluation Metrics

Metric	What It Measures	Scale
Perplexity	How well model predicts completion	Lower = better (10 means 10 uniform tokens)
BLEU	Translation quality	0-1 (1 = perfect match)
ROUGE-n	N-gram overlap between prediction and reference	0-1 (1 = best)
Coherence / Fluency	Logical flow of output	May need human eval
BERTScore	Semantic similarity	Higher = better

Task-Specific Metrics

Task	Metrics
QA	Exact Match, F1
Classification	Accuracy, Precision, Recall, F1
Translation	BLEU
Summarization	ROUGE

Built-in Evaluation Datasets

Dataset	Use Case
TriviaQA	Question Answering
Natural Questions	Question Answering
WikiText 2	Robustness
Real Toxicity	Toxicity detection
Gigaword	Summarization
E-Commerce Clothing Reviews	Text Classification

Human Evaluation Setup

Define metrics with descriptions and rating methods
- Thumbs up/down
- Likert scale (5-star)
- Freeform feedback (text field)
Number of workers per prompt
Setup CORS in S3
Team via SageMaker GroundTruth private workforce (Cognito or OIDC)
Optional SNS notifications for new tasks

Human Eval Analysis

Overview dashboard: Aggregate scores, rating distribution
Inter-rater agreement: Consistency across workers
Sample analysis: Individual samples with ratings
Comparative: Between models, human vs automatic
Action items: Prompt refinement, fine-tuning, bias mitigation

LLM-based Quality Assessment (RAG)

Faithfulness - detect hallucinations
Relevancy - penalize redundancy, incomplete answers
Context Precision - relevant documents ranked higher
Context Recall - context retrieved vs ground truth
Context Entity Recall - entities retrieved vs ground truth
Answer Similarity - semantic comparison vs ground truth
Correctness - accuracy of answer vs ground truth

Evaluation Limitations

No ground truth for creative tasks
Contextual dependency and subjectivity
Difficulty evaluating ethics and biases
Factual accuracy (hallucinations)
Consistency across interactions
Adversarial robustness
Need for new evaluation datasets as LLMs improve

Ref: Bedrock Evaluations | Evaluation Metrics | Model Evaluation Blog

11. Security, Responsible AI & Guardrails

Data Protection

No prompts or responses are used to train models
Separate deployment accounts per model provider per region
Provider isolation: Anthropic can't read prompts; Llama hosted separately from Claude

Encryption

TLS in transit
VPC Endpoints with private IPs
KMS encryption for: prompts, custom models, guardrails

IAM

Fine-grained policies
Service roles for Bedrock, agents, KBs

Compliance

Logging and monitoring (CloudTrail, CloudWatch)
SOC, ISO, HIPAA, GDPR

Amazon Bedrock Guardrails

Feature	Description
Content Filtering	Block harmful, offensive content by category and severity
Denied Topics	Define off-limits subjects
Word Filters	Block specific words and phrases
PII Detection	Identify and redact personally identifiable information
Contextual Grounding	Check response faithfulness to source material
Automated Reasoning Checks	Mathematical logic to verify factual accuracy (up to 99%)

Automated Reasoning Checks

Uses formal logic (not statistical methods) to detect hallucinations
Suggests corrections and highlights unstated assumptions
Validates AI responses against defined business rules
Critical for regulated industries (finance, healthcare, legal)
Currently in detection mode

Guardrails Integration

Apply to Bedrock models, agents, and KB responses
Synchronous mode: scans before response (adds latency)
Asynchronous mode: scans in parallel to streaming (small risk of brief inappropriate content)

Ref: Bedrock Guardrails | Automated Reasoning Checks | Responsible AI Blog

12. Developing GenAI Applications - Best Practices

Design Decision Tree

Task Requirements Analysis
    |
    +-- No external data needed? --> Prompt Engineering
    |
    +-- Need external/real-time data? --> RAG + Knowledge Bases
    |
    +-- Domain-specific knowledge? --> RAG + PEFT Fine-Tuning
    |
    +-- Real-time actions needed? --> Agents + Streaming
    |
    +-- External API integration? --> Agents with Action Groups

Model Routing

Bedrock Intelligent Prompt Routing: Single family routing (e.g., Llama big + small)
- Only predefined model pairs
Custom Router: LangChain or Lambda-based
- Added latency but better cost control

Token Streaming

Reduces time-to-first-token for users
Works with Amazon Connect for voice AI
Response caching improves repeated query performance

Guardrails with Streaming

Synchronous: Adds latency, scans before delivery
Asynchronous: Scans in parallel, small risk of brief inappropriate content

Cost Optimization Priority

Optimize prompts first - clarity, minimize output, specify format precisely
Provisioned Throughput - 40-60% savings for steady workloads
Batch Inference - 50% savings for non-real-time
Prompt caching - reduce redundant computation
Model routing - send simple queries to cheaper models

Performance Optimization

Define SLAs for response time, latency, alerting
Autoscaling based on CPU, memory, request queue size
Multi-level caching: app-level, response, query results
Monitor cache hit ratio
Load balancing across endpoints

SageMaker JumpStart Best Practices

Select model closest to your use case
Consider cost, size, performance, licensing
Use multi-model endpoints and autoscaling
Spot training for cost savings
A/B testing with SageMaker Experiments
SageMaker Pipelines for end-to-end workflows
Feature Store for feature management
Version control for all models

Quality Assurance

Testing framework: Unit, integration, performance + AI-specialized tests
Human evaluation and A/B testing
Error tracing and user feedback collection
Content moderation and bias detection
Output validation against expected formats

Quick Reference: Key Numbers to Remember

Item	Value
Exam passing score	750/1000
Max action group functions	3 per group
Q Business Enterprise index	1M docs, multi-AZ
Q Business Starter index	100K docs, single-AZ
Q Business index unit	20K documents
Provisioned Throughput savings	40-60% vs on-demand
Batch inference savings	~50% vs on-demand
Baseline monitoring period	2 weeks recommended
Automated Reasoning accuracy	Up to 99%
BLEU perfect score	1.0
ROUGE perfect score	1.0

Command Palette

Table of Contents