The Problem
You're running a long-running task in Ansible using async and poll, but the task fails unexpectedly. The error message might look like:
TASK [Run long database migration] ********************************************
fatal: [db-server]: FAILED! => {"ansible_job_id": "123456789012.45678", "attempts": 30, "changed": true, "finished": 0, "msg": "The async task failed to complete within the specified time", "started": 1}Or sometimes more cryptically:
{"failed": true, "msg": "job j123456789012.45678 not found or invalid", "stderr": "", "stdout": ""}This happened to me during a production database migration that was expected to take 20 minutes. The async task kept failing after just 2 minutes, and I couldn't figure out why.
Why This Happens
Async task failures typically stem from three root causes:
Timeout configuration mismatch - The async parameter sets the maximum runtime, but poll controls how often Ansible checks status. If poll is too aggressive, Ansible overwhelms the remote system. If too infrequent, you miss completion.
Job file cleanup - Ansible stores async job information in ~/.ansible_async/ on the remote host. If this directory gets cleaned up or the job file is deleted, subsequent status checks fail.
Shell environment issues - Long-running commands that depend on specific environment variables or shell settings may fail when run asynchronously because Ansible executes them in a non-interactive shell.
Diagnosing the Issue
Start by checking what's actually happening with your async job:
# Run the playbook with extreme verbosity
ansible-playbook playbook.yml -vvvv --tags "async-task"Look for the ansible_job_id in the output. Then SSH into the target server and inspect the job directly:
```bash # SSH to the target server ssh db-server
# Check the async job directory ls -la ~/.ansible_async/
# View the specific job file cat ~/.ansible_async/j123456789012.45678
# Check if your command is still running ps aux | grep -E "(ansible|your-command)" ```
The job file contains JSON output from the async task. If the file exists but shows an error, the command itself failed. If the file doesn't exist, it was cleaned up prematurely.
The Fix
Fix 1: Correct Async/Poll Configuration
The most common issue is incorrect timing. Here's the proper way to configure async tasks:
```yaml - name: Run database migration ansible.builtin.command: cmd: /opt/app/migrate.sh async: 3600 # 1 hour maximum runtime poll: 30 # Check every 30 seconds register: migration_result
- name: Verify migration completed
- ansible.builtin.debug:
- msg: "Migration stdout: {{ migration_result.stdout }}"
- when: migration_result.finished == 1
`
Rule of thumb: Set async to at least 3x your expected runtime. Set poll based on task type:
- File operations: poll: 5
- Network transfers: poll: 10
- Database operations: poll: 30
- System updates: poll: 60
Fix 2: Handle Fire-and-Forget Tasks
For tasks where you don't need to wait for completion, use poll: 0:
```yaml - name: Start background backup ansible.builtin.shell: | nohup /opt/app/backup.sh > /var/log/backup.log 2>&1 & async: 60 # Short timeout just to start the command poll: 0 # Don't wait at all register: backup_job
- name: Check backup status later
- ansible.builtin.async_status:
- jid: "{{ backup_job.ansible_job_id }}"
- register: backup_status
- until: backup_status.finished
- retries: 120 # Check for up to 2 hours
- delay: 60 # Every minute
`
Fix 3: Preserve Environment Variables
If your command needs specific environment settings:
- name: Run async task with environment
ansible.builtin.shell: |
source /etc/profile.d/app.sh
export PATH=$PATH:/opt/custom/bin
/opt/app/long-running-task.sh
args:
executable: /bin/bash
async: 1800
poll: 15
environment:
JAVA_HOME: /usr/lib/jvm/java-11
APP_ENV: productionFix 4: Check Job Status Properly
Always verify the job actually completed:
```yaml - name: Run async task ansible.builtin.command: cmd: /opt/app/process.sh async: 600 poll: 10 register: job_result failed_when: job_result.finished != 1 or job_result.rc != 0
- name: Explicitly check async job status
- ansible.builtin.async_status:
- jid: "{{ job_result.ansible_job_id }}"
- register: job_status
- until: job_status.finished
- retries: 100
- delay: 5
- failed_when: job_status.rc != 0
`
Verifying the Fix
After making changes, verify everything works:
```bash # Test with a simple async task first ansible all -m ansible.builtin.command -a "sleep 5" -B 30 -P 2
# Then test your actual playbook ansible-playbook playbook.yml --check
# Finally, run for real with verbose output ansible-playbook playbook.yml -vv ```
Prevention
Add validation to your playbooks to catch async issues early:
- name: Validate async configuration
ansible.builtin.assert:
that:
- async_value | default(0) > poll_value | default(0)
- async_value | default(0) >= expected_runtime | default(60)
fail_msg: "Async value must be greater than poll value and expected runtime"
success_msg: "Async configuration validated"Create a wrapper role for async tasks with sensible defaults:
```yaml # roles/async_task/tasks/main.yml - name: Run async task safely block: - name: Execute async command ansible.builtin.command: cmd: "{{ async_cmd }}" async: "{{ async_timeout | default(1800) }}" poll: "{{ async_poll | default(15) }}" register: async_result
- name: Verify task completed
- ansible.builtin.async_status:
- jid: "{{ async_result.ansible_job_id }}"
- register: job_status
- until: job_status.finished
- retries: "{{ (async_timeout | default(1800)) // 60 }}"
- delay: 60
- rescue:
- - name: Log async failure
- ansible.builtin.debug:
- msg: "Async task failed: {{ async_result }}"
`