← Back to blogApril 3, 2026

The Agent Migration Testing Checklist: Verify Before You Cut Over

A migration is only as good as its testing. Use this checklist to verify memory, skills, integrations, and behavior before switching to your new platform.

I have seen more migrations fail during verification than during transfer. The data moved fine. The config files copied without errors. The agent process started successfully on the new server. And then someone asked it a question about a customer it had been talking to for three months, and it had no idea what they were talking about.

A migration without thorough testing is a gamble. This checklist exists so you do not have to rely on luck.

Pre-migration snapshot

Before you touch anything, create a reference point. This is not the same as a backup (you should have that too). A snapshot is a record of how your agent behaves right now, so you can compare it to how the agent behaves after migration.

Capture these baselines:

Response samples. Send 20-30 representative prompts to your agent and record the responses verbatim. Cover different skill areas, different conversation types, and edge cases you know about. These become your behavioral comparison set.
Memory inventory. Count the total number of memory entries by type (long-term, episodic, user profiles). Record the most recent entry timestamp. Note the total storage size.
Skill manifest. Export the complete list of installed skills with version numbers. Record which skills have local state files and where those files live.
Integration endpoints. Document every external service connection: URL, authentication method, last successful interaction timestamp.
Performance baselines. Record average and P95 response times for typical queries. Note memory usage and CPU utilization during normal operation.

Store this snapshot somewhere outside both the source and destination environments. A git repo, a shared drive, a local machine. You will need it for comparison after migration, and it must not depend on either server being available.

Memory verification tests

Memory is the most failure-prone component of a migration. Test it systematically.

Record count match. Query the destination's memory store and compare entry counts against your snapshot. Long-term memory entries, episodic records, user profiles — every category should match within a small tolerance. If your agent had 12,400 long-term memories on the source and the destination shows 12,380, find those 20 missing entries before proceeding.

Content integrity. Select 50 memory entries at random and compare their content between source and destination. Check for encoding issues (unicode characters mangled during transfer), truncation (entries cut off mid-sentence), and timestamp accuracy (dates shifted by timezone conversion). Even one corrupted entry suggests a systematic issue that affected others.

Retrieval accuracy. The agent does not just store memories. It retrieves them based on context. Ask the migrated agent questions that should trigger specific memory recall. "What did customer X ask about last month?" should return the same information on both instances. If retrieval fails, the memory data might be present but the search index might be missing or corrupted.

Recency verification. Confirm that the most recent memories transferred. Migrations sometimes miss the last few entries if the source was still writing during export. Check that the newest memory timestamp on the destination matches the newest timestamp in your snapshot.

Skill execution tests

Every installed skill needs a functional test on the destination.

Manifest check. Compare the installed skill list on the destination against your snapshot. Every skill should be present with the same version number. Missing skills or version mismatches indicate an incomplete transfer.

Basic execution. Trigger each skill with a simple request and verify it completes without errors. A CRM skill should fetch a known customer record. A scheduling skill should create a test appointment. A web search skill should return results for a known query.

State verification. Skills that maintain local state (caches, databases, indexes) need their state verified. If your CRM skill cached 500 customer records on the source, the destination should have the same 500 records. If a research skill indexed 200 documents, those 200 documents should be searchable on the destination.

Error handling. Deliberately trigger an error condition for each skill (invalid input, network timeout simulation, missing data) and verify the agent handles it gracefully. Migration can break error handling paths that work fine for happy-path requests.

Integration connectivity checks

External integrations are the most commonly missed migration failure.

Outbound connections. For every external API your agent calls (LLM providers, databases, third-party services), verify the destination can reach the endpoint. Network configuration, firewall rules, and IP allowlists differ between environments. An API that was reachable from your old server might be blocked from the new one.

Inbound webhooks. For every service that sends data to your agent, verify the webhook URL has been updated and the service can reach the destination. Send a test payload from each webhook provider and confirm the agent receives and processes it.

Authentication. OAuth tokens, API keys, and session credentials all need verification. Some OAuth tokens are bound to specific redirect URLs that change with migration. Some API keys are IP-restricted. Test every authenticated connection, not just the ones you think might break.

Callback URLs. If your agent provides callback URLs to external services (for async operations, status updates, or OAuth flows), verify those URLs point to the destination and are reachable.

Behavioral comparison

This is the test most teams skip, and it is the most important one.

Send the same set of prompts you captured in your pre-migration snapshot to both the source and destination agents. Compare the responses. They will not be word-for-word identical because LLM responses have inherent randomness. But they should demonstrate the same knowledge, the same preferences, and the same capabilities.

Red flags to watch for:

The destination agent does not reference information the source agent knew. This indicates missing memory.
The destination agent uses a different tone or personality. This indicates a config mismatch in SOUL.md or personality settings.
The destination agent fails on a skill the source agent handled. This indicates a missing or misconfigured skill.
The destination agent responds noticeably slower. This indicates a performance issue in the new environment.

Run this comparison across at least 30 prompts covering your agent's primary use cases. If more than 10% show meaningful divergence, investigate before cutting over.

Performance benchmarks

Migration to a different server or provider means different hardware, different network topology, and different noisy-neighbor characteristics. Performance testing is not optional.

Response time comparison. Send 100 identical requests to both instances and compare response time distributions. The destination should be within 20% of the source. Larger differences indicate resource constraints, misconfiguration, or network latency issues.

Concurrent load testing. If your agent handles multiple conversations simultaneously, test that on the destination. Send 10, 20, 50 concurrent requests and verify response times stay reasonable. The new server might have different CPU or memory limits that cap concurrency lower than expected.

Memory usage under load. Monitor RAM consumption on the destination during load testing. A memory leak that was manageable on the old server (which had 32GB RAM) might become critical on the new server (which has 16GB).

The 48-hour parallel run

After all automated tests pass, run both instances in parallel for 48 hours with real traffic. The source continues to serve users. The destination receives a mirror of the traffic and processes it independently. Compare outputs and metrics continuously.

Why 48 hours? Because some issues only surface under sustained operation. A slow memory leak. A scheduled task that runs daily and fails on the new server. A weekend traffic pattern that differs from weekday load. Two full days catches most time-dependent issues.

During the parallel run, monitor:

Error rates on both instances (destination should be equal or lower)
Response time trends (looking for gradual degradation on destination)
Memory and CPU usage trends (looking for leaks or resource exhaustion)
Integration health (all webhooks and API calls succeeding)

Signoff criteria

Do not cut over until every item passes:

[ ] Memory entry counts match within 0.1% tolerance
[ ] 50 random memory entries verified for content integrity
[ ] Memory retrieval returns correct results for 20 known queries
[ ] All installed skills execute successfully
[ ] All external integrations connect and authenticate
[ ] All inbound webhooks deliver and process correctly
[ ] Behavioral comparison shows less than 10% divergence across 30 prompts
[ ] Response times within 20% of source baseline
[ ] No memory leaks or resource exhaustion during 48-hour parallel run
[ ] Rollback tested and confirmed working
[ ] Monitoring and alerting configured on destination
[ ] Team signoff from at least one engineer who did not perform the migration

That last point matters. Fresh eyes catch things the migration engineer has been staring at for too long. A second engineer reviewing the test results and signing off adds a meaningful safety layer.

When to abort

Not every migration succeeds, and that is fine. Abort and reschedule if:

Memory entry counts differ by more than 1%
Behavioral comparison shows divergence on more than 20% of prompts
Any critical integration fails to connect after troubleshooting
Performance degradation exceeds 50% under normal load
The 48-hour parallel run surfaces recurring errors

Aborting a migration is not a failure. Cutting over with unresolved issues is. Your source environment is still running. Take the time to diagnose, fix, and try again. ClawSail's migration reports give you specific diagnostics for every failed check, so the second attempt starts with clear information about what went wrong.

Migrate without lock-in

ClawSail makes switching between agent platforms painless with automated migration tools.

Plan Migration

How to Migrate OpenClaw Without Losing Your Brain →OpenClaw Portability: How to Keep Your Agent Truly Vendor-Independent →How to Back Up Your AI Agent — Memory, Skills, and Personality →Migrating Your OpenClaw Agent Between Cloud Providers →Claude Code vs Cursor: What Actually Transfers When You Switch →