release: v1.3.0 - TKG node type renaming
Changes: - Rust: face_trace → face_track (45 occurrences in 8 files) - Rust: gaze_trace → gaze_track, lip_trace → lip_track - Python: tkg_builder.py unified + pipeline_checklist.py fixed - Swift: swift_hand.swift hand state detection (empty vs holding) Node type changes: face_trace → face_track person_trace → body_track gaze_trace → gaze_track lip_trace → lip_track hand_trace → hand_track speaker → speaker_segment object → detected_object text_trace → text_region Migration: PUBLIC schema: 12970 + 892 + 305 rows updated
This commit is contained in:
97
docs_v1.0/M4_workspace/2026-06-21_job_status_fix.md
Normal file
97
docs_v1.0/M4_workspace/2026-06-21_job_status_fix.md
Normal file
@@ -0,0 +1,97 @@
|
||||
---
|
||||
title: Job Status Sync Fix - Historical Processor Results Issue
|
||||
version: 1.0
|
||||
date: 2026-06-21
|
||||
author: OpenCode
|
||||
status: resolved
|
||||
---
|
||||
|
||||
# Job Status Sync Fix - Historical Processor Results Issue
|
||||
|
||||
## Problem Summary
|
||||
|
||||
Production Worker marked jobs as 'failed' even when current processors completed successfully.
|
||||
|
||||
## Root Cause
|
||||
|
||||
### Location: `src/worker/job_worker.rs:1070`
|
||||
|
||||
```rust
|
||||
let any_failed = results
|
||||
.iter()
|
||||
.any(|r| matches!(r.status, ProcessorJobStatus::Failed));
|
||||
```
|
||||
|
||||
### Logic Defect
|
||||
- Checked **all historical processor_results** (results=8)
|
||||
- If **any historical processor failed** → job marked as failed
|
||||
- **Ignored job_processors** (current request processors)
|
||||
|
||||
### Example Case
|
||||
Job ID 63:
|
||||
- Historical: asr, yolo, face, ocr, pose, mediapipe, appearance (all failed)
|
||||
- Current: cut (completed)
|
||||
- Result: `any_failed=true` → job status='failed' ❌
|
||||
|
||||
## Fix Implementation
|
||||
|
||||
### Modified Code (line 1070-1110)
|
||||
|
||||
```rust
|
||||
// Before
|
||||
let any_failed = results
|
||||
.iter()
|
||||
.any(|r| matches!(r.status, ProcessorJobStatus::Failed));
|
||||
|
||||
// After
|
||||
let any_failed = results
|
||||
.iter()
|
||||
.filter(|r| job_processors.contains(&r.processor_type.as_str().to_string()))
|
||||
.any(|r| matches!(r.status, ProcessorJobStatus::Failed));
|
||||
```
|
||||
|
||||
### Key Changes
|
||||
1. Added filter for `job_processors` parameter
|
||||
2. Only checks processors in current request
|
||||
3. Ignores historical failed processors
|
||||
|
||||
## Verification Results
|
||||
|
||||
### Production (3002) After Fix
|
||||
```
|
||||
Found 1 pending jobs ✅
|
||||
Processing job: 53090f160138fd4a01d62edf8395c6a0 (63) ✅
|
||||
Processor cut output file exists, marking completed ✅
|
||||
Job status: running ✅ (not failed)
|
||||
```
|
||||
|
||||
### Playground (3003) Comparison
|
||||
- Playground had fewer historical results
|
||||
- Jobs processed successfully before fix
|
||||
- Dev schema works normally
|
||||
|
||||
## Deployment
|
||||
|
||||
### Binary
|
||||
- Compiled: Jun 21 14:35
|
||||
- Worker restart: PID 28623
|
||||
- Logs: `logs/worker_3002_fixed.log`
|
||||
|
||||
### Test Command
|
||||
```bash
|
||||
curl -X POST "http://localhost:3002/api/v1/file/53090f160138fd4a01d62edf8395c6a0/process" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"processors": ["cut"]}'
|
||||
```
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
1. **Job lifecycle should be scoped to request**: Only check processors in current request
|
||||
2. **Historical data pollution**: Failed attempts can pollute job status logic
|
||||
3. **Filter early**: Apply filters before checking status to avoid false positives
|
||||
|
||||
## Related Files
|
||||
- `src/worker/job_worker.rs:1070-1110` (fixed)
|
||||
- `src/worker/job_worker.rs:1407` (any_failed handling)
|
||||
- `logs/worker_3002_fixed.log` (verification)
|
||||
|
||||
84
docs_v1.0/M4_workspace/2026-06-21_job_status_sync_issue.md
Normal file
84
docs_v1.0/M4_workspace/2026-06-21_job_status_sync_issue.md
Normal file
@@ -0,0 +1,84 @@
|
||||
---
|
||||
title: PostgreSQL Job Status Sync Issue
|
||||
version: 1.0
|
||||
date: 2026-06-21
|
||||
author: OpenCode
|
||||
status: identified
|
||||
---
|
||||
|
||||
# PostgreSQL Job Status Sync Issue
|
||||
|
||||
## Problem Description
|
||||
|
||||
Production Worker (3002) cannot find pending jobs despite successful UPDATE operations.
|
||||
|
||||
## Evidence
|
||||
|
||||
### Server Logs
|
||||
```
|
||||
UPDATE monitor_jobs SET processors = ..., status = 'pending' WHERE uuid = '...'
|
||||
rows_affected=1 ✅
|
||||
elapsed=565.917µs
|
||||
```
|
||||
|
||||
### PostgreSQL Query Timeline
|
||||
1. **Trigger at 06:04:39**: UPDATE executed (rows_affected=1)
|
||||
2. **Query at 06:04:41** (Python): status='pending' ✅
|
||||
3. **Query at 06:06**: status='failed' ❌ (reverted)
|
||||
4. **Worker SELECT at 06:04-06:07**: rows_returned=0 ❌
|
||||
|
||||
### Key Findings
|
||||
- Server UPDATE succeeds (rows_affected=1)
|
||||
- PostgreSQL briefly shows 'pending' (confirmed 2 seconds later)
|
||||
- Status immediately reverts to 'failed'
|
||||
- Worker SELECT never finds pending jobs
|
||||
|
||||
## Hypotheses
|
||||
|
||||
1. **Another process resets status**: Unknown mechanism changing status back to 'failed'
|
||||
2. **Job lifecycle logic**: Job processing framework has logic that marks failed jobs back as failed
|
||||
3. **Connection pool transaction issue**: UPDATE happens in one transaction, reverted in another
|
||||
4. **Worker health check**: Only affects WHERE status='running', not pending jobs
|
||||
|
||||
## Configuration Verified
|
||||
- Server schema: `public` ✅
|
||||
- Worker schema: `public` ✅
|
||||
- monitor_jobs.uuid: VARCHAR(32) ✅
|
||||
- All uuids: 32 characters ✅
|
||||
- Worker binary: Jun 21 13:20 (latest) ✅
|
||||
- Server binary: Jun 21 13:20 (latest) ✅
|
||||
|
||||
## Testing Done
|
||||
1. Restarted Server (3002, PID 65718)
|
||||
2. Restarted Worker (PID 88674)
|
||||
3. Triggered processing for multiple files
|
||||
4. Direct PostgreSQL queries via Python
|
||||
5. API verification: /api/v1/files, /health, /api/v1/jobs
|
||||
|
||||
## Current Status
|
||||
|
||||
**Production (3002)**:
|
||||
- Server: Running ✅
|
||||
- Worker: Running ✅
|
||||
- Jobs: 8 total (6 failed, 1 completed)
|
||||
- Processing: Blocked ❌
|
||||
|
||||
**Playground (3003)**:
|
||||
- Server: Running ✅
|
||||
- Worker: Running ✅
|
||||
- Not tested yet
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Test in Playground**: Compare job lifecycle in dev schema
|
||||
2. **Find reset mechanism**: Search for code that resets job status to 'failed'
|
||||
3. **Check job lifecycle**: Review job_worker.rs for failed job handling logic
|
||||
4. **Test new job registration**: Register fresh video and trigger processing
|
||||
|
||||
## Related Files
|
||||
- `src/api/processing.rs`: trigger_processing UPDATE (line 271)
|
||||
- `src/worker/job_worker.rs`: Worker polling and health check (line 95-115)
|
||||
- `src/core/db/postgres_db.rs`: list_monitor_jobs_by_status (line 1720)
|
||||
- `logs/momentry_3002.log`: Server UPDATE logs
|
||||
- `logs/worker_3002_new.log`: Worker SELECT logs
|
||||
|
||||
Reference in New Issue
Block a user