# Debug Pipeline or Model

Systematically debug and fix issues with Axelera Metis pipelines, models, or inference

## Important Context

**This command debugs AI deployments on Axelera Metis AI Processing Units (APUs).**
All debugging targets Metis hardware and uses the Voyager SDK framework.

## Instructions

Debug the specified issue: **$ARGUMENTS**

### Step 0: Data Source & Environment Selection

**FIRST**, use the `AskUserQuestion` tool to determine the setup:

```
Question: "Which Voyager SDK setup would you like to use for debugging?"
Options:
1. "ChipOS Knowledge Base Only" - Use RAG for research and troubleshooting guidance only (no local execution)
2. "ChipOS Knowledge Base + Local SDK" - Use RAG for troubleshooting guidance, debug on local SDK
3. "Local Voyager SDK Repository" - Use an existing local installation only
4. "Clone Fresh Repository" - Clone latest Voyager SDK for clean debugging environment
```

**Based on selection:**

**If ChipOS Knowledge Base Only:**
- Search for relevant debugging info and documentation:
  ```python
  mcp__chipos__rag_search_knowledge_base(
      query="debug troubleshoot error",
      source_id="github.com/axelera-ai-hub/voyager-sdk",
      match_count=5
  )
  mcp__chipos__rag_search_code_examples(
      query="error handling debugging",
      source_id="github.com/axelera-ai-hub/voyager-sdk",
      match_count=5
  )
  ```
- Provide research-based guidance and documentation
- No local SDK execution required - purely informational

**If ChipOS Knowledge Base + Local SDK:**
- Search for relevant debugging info:
  ```python
  mcp__chipos__rag_search_knowledge_base(
      query="debug troubleshoot error",
      source_id="github.com/axelera-ai-hub/voyager-sdk",
      match_count=5
  )
  ```
- Ask for local SDK path for actual debugging
- Set environment variables accordingly

**If Local Repository:**
- Ask for the repository path: "What is the path to your Voyager SDK installation?"
- Set environment:
  ```bash
  export AXELERA_FRAMEWORK=/path/to/voyager-sdk
  export AXELERA_RUNTIME_DIR=$AXELERA_FRAMEWORK/runtime
  source $AXELERA_FRAMEWORK/venv/bin/activate
  ```

**If Clone Fresh:**
- Clone and install for clean environment:
  ```bash
  git clone https://github.com/axelera-ai-hub/voyager-sdk.git
  cd voyager-sdk
  ./install.sh --all --media
  source venv/bin/activate
  export AXELERA_FRAMEWORK=$(pwd)
  ```

### Step 0.5: ChipOS Project & Task Integration

**Check if project context is already set (from previous commands or `/init-chipos`).**

If NO project context exists, ask:
```
Question: "Would you like to track this debugging work in ChipOS?"
Options:
1. "Yes, use existing project" - Select from available ChipOS projects
2. "Yes, create new project" - Create a new project for this work
3. "No, skip task tracking" - Proceed without ChipOS task management
```

**If using existing project:**
```python
# List available projects
mcp__chipos__list_projects()
# User selects project, then create task
```

**If creating new project:**
```python
mcp__chipos__create_project(
    title="Voyager SDK Debug - [Issue Description]",
    description="Debugging session for: $ARGUMENTS",
    github_repo="https://github.com/axelera-ai-hub/voyager-sdk"
)
```

**Create debugging task:**
```python
mcp__chipos__create_task(
    project_id="[PROJECT_ID]",
    title="Debug: [Brief issue description from $ARGUMENTS]",
    description="Investigate and fix: $ARGUMENTS",
    assignee="AI IDE Agent",
    feature="debugging",
    task_order=10
)

# Mark task as in-progress
mcp__chipos__update_task(task_id="[TASK_ID]", status="doing")
```

**Throughout debugging, update task status:**
- When issue is identified: Add findings to task description
- When fix is implemented: Update status to "review"
- When user confirms fix works: Update status to "done"

### Step 1: Issue Classification
   Identify the type of issue:
   - **Deployment Error**: Model fails to compile/deploy
   - **Runtime Error**: Inference fails or crashes
   - **Accuracy Issue**: Model produces incorrect results
   - **Performance Issue**: Low throughput or high latency
   - **Display Issue**: Visualization problems

2. **Environment Diagnostics**
   Run initial diagnostics:
   ```bash
   # Check environment
   echo "AXELERA_FRAMEWORK: $AXELERA_FRAMEWORK"
   echo "AXELERA_RUNTIME_DIR: $AXELERA_RUNTIME_DIR"
   echo "PYTHONPATH: $PYTHONPATH"

   # Check hardware
   axdevice list

   # Check Python environment
   python --version
   pip list | grep axelera
   ```

3. **Deployment Debugging**

   **Quantization Issues:**
   ```bash
   # Run quantize-only to isolate issues
   ./deploy.py <model> --mode quantize

   # Enable debug quantization
   ./deploy.py <model> --mode quantize_debug

   # Check calibration data
   ls -la <calibration-data-path>
   ```

   **Compilation Issues:**
   ```bash
   # Verbose compilation logging
   ./deploy.py <model> --log-level debug

   # Check ONNX model validity
   python -c "import onnx; onnx.checker.check_model('<model>.onnx')"

   # Verify model inputs/outputs
   python -c "import onnx; m = onnx.load('<model>.onnx'); print(m.graph.input, m.graph.output)"
   ```

   **Common deployment errors:**
   - "Unsupported operator": Check ONNX opset version, may need model modification
   - "Out of memory": Reduce model size or batch size
   - "Calibration failed": Check calibration data format and quantity

4. **Runtime Debugging**

   **Inference Crashes:**
   ```bash
   # Run with verbose logging
   ./inference.py <model> <source> --log-level debug

   # Check GStreamer pipeline
   GST_DEBUG=3 ./inference.py <model> <source>

   # Check for segfaults
   gdb --args python inference.py <model> <source>
   ```

   **Stream Issues:**
   ```bash
   # Test input source independently
   gst-launch-1.0 uridecodebin uri=file:///path/to/video.mp4 ! fakesink

   # Check USB camera
   v4l2-ctl --list-devices
   gst-launch-1.0 v4l2src device=/dev/video0 ! fakesink
   ```

5. **Accuracy Debugging**

   **Evaluate Model Accuracy:**
   ```bash
   # Run accuracy evaluation
   ./inference.py <model> dataset --no-display

   # Compare with reference implementation
   # Check preprocessing matches training
   # Verify postprocessing/decoder settings
   ```

   **Common accuracy issues:**
   - Wrong normalization (mean/std values)
   - Incorrect input resolution
   - Wrong color space (RGB vs BGR)
   - Decoder threshold settings

   **Debug preprocessing:**
   ```python
   # Add to pipeline YAML for debugging
   preprocessing:
     - type: AxDebugSave
       path: /tmp/preprocessed_input.npy
   ```

6. **Performance Debugging**

   **Measure Throughput:**
   ```bash
   # Benchmark mode
   ./inference.py <model> <source> --benchmark

   # Save performance tracers
   ./inference.py <model> <source> --save-tracers perf.csv --show-stats
   ```

   **Performance Metrics:**
   - System throughput: End-to-end FPS
   - Device throughput: Metis-only FPS
   - CPU utilization: Host CPU usage
   - Latency: Frame processing time

   **Common performance issues:**
   - CPU bottleneck: Check preprocessing complexity
   - Memory bandwidth: Check tensor sizes
   - Pipeline stalls: Check synchronization

7. **Display Debugging**

   **No Display:**
   ```bash
   # Check display availability
   echo $DISPLAY

   # Try different renderers
   ./inference.py <model> <source> --display cv
   ./inference.py <model> <source> --display console
   ./inference.py <model> <source> --no-display
   ```

   **OpenGL Issues:**
   ```bash
   # Check OpenGL support
   glxinfo | head -20

   # Fallback to OpenCV
   ./inference.py <model> <source> --display cv
   ```

8. **YAML Pipeline Debugging**

   **Validate YAML:**
   ```python
   import yaml
   with open('ax_models/path/to/model.yaml') as f:
       config = yaml.safe_load(f)
       print(yaml.dump(config, default_flow_style=False))
   ```

   **Check operator configuration:**
   - Verify operator types are valid
   - Check parameter types and ranges
   - Ensure input/output connections are correct

9. **GStreamer Pipeline Debugging**
   ```bash
   # List available elements
   gst-inspect-1.0 | grep ax

   # Check specific element
   gst-inspect-1.0 axinference

   # Debug pipeline construction
   GST_DEBUG=4 ./inference.py <model> <source>

   # Generate pipeline graph
   GST_DEBUG_DUMP_DOT_DIR=/tmp ./inference.py <model> <source>
   dot -Tpng /tmp/pipeline.dot -o pipeline.png
   ```

10. **Model-Specific Debugging**

    **YOLO Models:**
    - Check anchor configuration
    - Verify class count matches dataset
    - Check stride configuration

    **Classification Models:**
    - Verify input normalization
    - Check class label mapping
    - Verify top-k settings

    **Segmentation Models:**
    - Check output resolution
    - Verify class IDs
    - Check threshold settings

11. **Logging and Tracing**
    ```bash
    # Enable all logging
    ./inference.py <model> <source> --log-level trace

    # Save logs to file
    ./inference.py <model> <source> --log-level debug 2>&1 | tee debug.log

    # Enable GStreamer tracing
    GST_TRACERS="latency;stats" ./inference.py <model> <source>
    ```

12. **Collecting Debug Information**
    For reporting issues, collect:
    ```bash
    # System info
    uname -a
    cat /etc/os-release
    python --version

    # SDK version
    cat $AXELERA_FRAMEWORK/RELEASE_NOTES.md | head -5

    # Hardware info
    axdevice list
    lspci | grep -i axelera

    # Error logs
    ./deploy.py <model> --log-level debug 2>&1 | tee deploy_debug.log
    ```

13. **Resolution Steps**
    After identifying the issue:
    1. Document the root cause
    2. Implement fix (code, configuration, or environment)
    3. Test the fix thoroughly
    4. Verify no regression in other functionality
    5. Update documentation if needed

14. **ChipOS Task Completion**

    **If ChipOS task tracking is enabled:**

    When debugging is complete:
    ```python
    # Update task with findings and solution
    mcp__chipos__update_task(
        task_id="[TASK_ID]",
        description="Original issue: $ARGUMENTS\n\nRoot Cause: [identified cause]\n\nSolution: [fix applied]",
        status="review"
    )
    ```

    Ask user to confirm the fix works, then:
    ```python
    # Mark task as done
    mcp__chipos__update_task(task_id="[TASK_ID]", status="done")
    ```

    **Summary Report:**
    Provide a summary including:
    - Issue description
    - Root cause identified
    - Solution implemented
    - Files modified
    - ChipOS task ID for reference
