release: v1.3.0 - TKG node type renaming
Changes: - Rust: face_trace → face_track (45 occurrences in 8 files) - Rust: gaze_trace → gaze_track, lip_trace → lip_track - Python: tkg_builder.py unified + pipeline_checklist.py fixed - Swift: swift_hand.swift hand state detection (empty vs holding) Node type changes: face_trace → face_track person_trace → body_track gaze_trace → gaze_track lip_trace → lip_track hand_trace → hand_track speaker → speaker_segment object → detected_object text_trace → text_region Migration: PUBLIC schema: 12970 + 892 + 305 rows updated
This commit is contained in:
@@ -127,13 +127,15 @@ curl -s "$API/api/v1/file/$FILE_UUID/probe" -H "X-API-Key: $KEY"
|
||||
|
||||
---
|
||||
|
||||
### `GET /api/v1/progress/:file_uuid`
|
||||
### `POST /api/v1/progress/:file_uuid`
|
||||
|
||||
**Auth**: Required
|
||||
**Scope**: file-level
|
||||
|
||||
Get real-time processing progress for a file via Redis pub/sub. Includes per-processor status, current/total frames, ETA, and system resource stats.
|
||||
|
||||
**Note**: This endpoint uses **POST** method, not GET. The progress data is stored in Redis as a hash, and POST is used to retrieve the latest state.
|
||||
|
||||
#### Pipeline Order
|
||||
|
||||
| Order | Processor | Dependencies | Description |
|
||||
@@ -154,7 +156,7 @@ All processors except `story` and `5w1h` run concurrently when their dependencie
|
||||
#### Example
|
||||
|
||||
```bash
|
||||
curl -s "$API/api/v1/progress/$FILE_UUID" -H "X-API-Key: $KEY" | jq '{overall_progress, processors: [.processors[] | {processor_type, status}]}'
|
||||
curl -s -X POST "$API/api/v1/progress/$FILE_UUID" -H "X-API-Key: $KEY" | jq '{overall_progress, processors: [.processors[] | {name, status}]}'
|
||||
```
|
||||
|
||||
#### Response (200)
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
title: Rule 2 TKG Relationship Chunks V1.0
|
||||
version: 1.0
|
||||
date: 2026-06-20
|
||||
version: 1.1
|
||||
date: 2026-06-22
|
||||
author: OpenCode
|
||||
status: approved
|
||||
---
|
||||
@@ -18,13 +18,26 @@ Rule 2 creates **relationship chunks** by converting TKG edges into searchable,
|
||||
|
||||
**Key Change:** Original Rule 2 (YOLO frame objects) is deprecated due to COCO classes being too generic. New Rule 2 focuses on TKG relationships.
|
||||
|
||||
## Node Types (V2.0 - Intuitive Naming)
|
||||
|
||||
| Old Name | New Name | Description | external_id Format |
|
||||
|----------|----------|-------------|-------------------|
|
||||
| `face_trace` | `face_track` | Face tracking across frames | `face_track_1` |
|
||||
| `person_trace` | `body_track` | Body appearance tracking | `body_track_0` |
|
||||
| `gaze_trace` | `gaze_track` | Gaze direction sequence | `gaze_track_1` |
|
||||
| `lip_trace` | `lip_track` | Lip sync sequence | `lip_track_1` |
|
||||
| `hand_trace` | `hand_track` | Hand state sequence | `hand_track_0` |
|
||||
| `speaker` | `speaker_segment` | Speaker segment | `speaker_01` |
|
||||
| `object` | `detected_object` | YOLO detected object | `car`, `phone` |
|
||||
| `text_trace` | `text_region` | OCR text region | `text_1` |
|
||||
|
||||
## Data Flow
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ UPSTREAM: TKG Builder │
|
||||
│ │
|
||||
│ tkg_nodes: face_trace, speaker, object, etc. │
|
||||
│ tkg_nodes: face_track, speaker_segment, detected_object │
|
||||
│ tkg_edges: speaker_face, mutual_gaze, co_occurs, etc. │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
@@ -42,7 +55,7 @@ Rule 2 creates **relationship chunks** by converting TKG edges into searchable,
|
||||
│ ├─ Query tkg_edges by type (priority order) │
|
||||
│ ├─ For each edge: │
|
||||
│ │ ├─ Resolve source_node / target_node │
|
||||
│ │ ├─ Resolve identity names (if face_trace) │
|
||||
│ │ ├─ Resolve identity names (if face_track) │
|
||||
│ │ ├─ Build context JSON │
|
||||
│ │ ├─ call_llm(context) → text_content │
|
||||
│ │ └─ INSERT INTO chunk (chunk_type='relationship') │
|
||||
@@ -68,12 +81,12 @@ Rule 2 creates **relationship chunks** by converting TKG edges into searchable,
|
||||
|
||||
| Priority | Edge Type | Description | Example Output |
|
||||
|----------|-----------|-------------|----------------|
|
||||
| P0 | `speaker_face` | Speaker ↔ Face trace | "SPEAKER_01 以 Cary Grant 的身份說話,從 frame 100 到 350" |
|
||||
| P0 | `mutual_gaze` | Two face traces looking at each other | "Cary Grant 和 Grace Kelly 互相看對方 24 幀,起始於 frame 450" |
|
||||
| P1 | `face_face` | Two face traces co-occurring | "Cary Grant 和 Grace Kelly 同框 180 幀" |
|
||||
| P1 | `co_occurs` | Object ↔ Object co-occurrence | "物件 'car' 和 'person' 在同一畫面出現 60 幀" |
|
||||
| P2 | `has_appearance` | Face trace ↔ Appearance trace | "Cary Grant 穿著藍色上衣,戴眼鏡" |
|
||||
| P2 | `wears` | Face trace ↔ Accessory | "Cary Grant 戴帽子,信心值 0.82" |
|
||||
| P0 | `speaker_face` | Speaker ↔ Face track | "SPEAKER_01 以 Cary Grant 的身份說話,從 frame 100 到 350" |
|
||||
| P0 | `mutual_gaze` | Two face tracks looking at each other | "Cary Grant 和 Grace Kelly 互相看對方 24 幀,起始於 frame 450" |
|
||||
| P1 | `face_face` | Two face tracks co-occurring | "Cary Grant 和 Grace Kelly 同框 180 幀" |
|
||||
| P1 | `co_occurs` | Detected object ↔ Detected object co-occurrence | "物件 'car' 和 'person' 在同一畫面出現 60 幀" |
|
||||
| P2 | `has_appearance` | Face track ↔ Body track | "Cary Grant 穿著藍色上衣,戴眼鏡" |
|
||||
| P2 | `wears` | Face track ↔ Accessory | "Cary Grant 戴帽子,信心值 0.82" |
|
||||
|
||||
## Chunk Data Structure
|
||||
|
||||
@@ -85,15 +98,15 @@ Rule 2 creates **relationship chunks** by converting TKG edges into searchable,
|
||||
"edge_id": 123,
|
||||
"source_node": {
|
||||
"id": 45,
|
||||
"node_type": "speaker",
|
||||
"external_id": "SPEAKER_01",
|
||||
"node_type": "speaker_segment",
|
||||
"external_id": "speaker_01",
|
||||
"label": "SPEAKER_01"
|
||||
},
|
||||
"target_node": {
|
||||
"id": 67,
|
||||
"node_type": "face_trace",
|
||||
"external_id": "trace_5",
|
||||
"label": "Face Trace 5",
|
||||
"node_type": "face_track",
|
||||
"external_id": "face_track_5",
|
||||
"label": "Face Track 5",
|
||||
"identity_name": "Cary Grant"
|
||||
},
|
||||
"properties": {
|
||||
@@ -157,21 +170,21 @@ LLM-generated natural language description in Traditional Chinese:
|
||||
### speaker_face Edge
|
||||
|
||||
```rust
|
||||
// Source: speaker node
|
||||
// Target: face_trace node
|
||||
// Source: speaker_segment node
|
||||
// Target: face_track node
|
||||
// Properties: first_frame, last_frame, lip_sync_confidence
|
||||
|
||||
let text_content = call_llm(format!(
|
||||
"SPEAKER {} 對應 face trace {},身份 {},frame {}-{}",
|
||||
speaker_id, trace_id, identity_name, first_frame, last_frame
|
||||
"SPEAKER {} 對應 face track {},身份 {},frame {}-{}",
|
||||
speaker_id, track_id, identity_name, first_frame, last_frame
|
||||
));
|
||||
```
|
||||
|
||||
### mutual_gaze Edge
|
||||
|
||||
```rust
|
||||
// Source: face_trace node A
|
||||
// Target: face_trace node B
|
||||
// Source: face_track node A
|
||||
// Target: face_track node B
|
||||
// Properties: first_frame, gaze_frame_count, yaw_a_avg, yaw_b_avg
|
||||
|
||||
let text_content = call_llm(format!(
|
||||
@@ -183,8 +196,8 @@ let text_content = call_llm(format!(
|
||||
### has_appearance Edge
|
||||
|
||||
```rust
|
||||
// Source: face_trace node
|
||||
// Target: appearance_trace node
|
||||
// Source: face_track node
|
||||
// Target: body_track node
|
||||
// Properties: clothing colors, accessories
|
||||
|
||||
let text_content = call_llm(format!(
|
||||
@@ -232,4 +245,5 @@ let text_content = call_llm(format!(
|
||||
|
||||
| Version | Date | Author | Change |
|
||||
|---------|------|--------|--------|
|
||||
| 1.1 | 2026-06-22 | OpenCode | Node type renaming: face_trace→face_track, person_trace→body_track, etc. |
|
||||
| 1.0 | 2026-06-20 | OpenCode | Initial design: TKG edges → relationship chunks |
|
||||
179
docs_v1.0/DESIGN/Redis_Prefix_Configuration.md
Normal file
179
docs_v1.0/DESIGN/Redis_Prefix_Configuration.md
Normal file
@@ -0,0 +1,179 @@
|
||||
---
|
||||
title: Redis Prefix Configuration
|
||||
version: 1.0
|
||||
date: 2026-06-21
|
||||
author: momentry_core development
|
||||
status: active
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Momentry Core uses Redis key prefixes to isolate namespaces between Production and Playground environments. This prevents cross-contamination of job queues, progress data, and cache entries.
|
||||
|
||||
## Environment Configuration
|
||||
|
||||
| Environment | Port | Redis Prefix | Config File |
|
||||
|-------------|------|--------------|-------------|
|
||||
| **Production** | 3002 | `momentry:` | `.env` (default) |
|
||||
| **Playground** | 3003 | `momentry_dev:` | `.env.development` |
|
||||
|
||||
### Configuration
|
||||
|
||||
```bash
|
||||
# Production (.env)
|
||||
MOMENTRY_REDIS_PREFIX=momentry: # Default if not set
|
||||
|
||||
# Playground (.env.development)
|
||||
MOMENTRY_REDIS_PREFIX=momentry_dev:
|
||||
```
|
||||
|
||||
## Redis Key Structure
|
||||
|
||||
All Redis keys follow this pattern:
|
||||
|
||||
```
|
||||
{prefix}{key_type}:{identifier}
|
||||
```
|
||||
|
||||
### Key Types
|
||||
|
||||
| Key Type | Pattern | Example |
|
||||
|----------|---------|---------|
|
||||
| Job | `{prefix}job:{file_uuid}` | `momentry:job:abc123...` |
|
||||
| Progress | `{prefix}progress:{file_uuid}` | `momentry:progress:abc123...` |
|
||||
| Processor | `{prefix}job:{file_uuid}:processor:{type}` | `momentry:job:abc123:processor:face` |
|
||||
| Health | `{prefix}health` | `momentry:health` |
|
||||
|
||||
## Namespace Isolation
|
||||
|
||||
### Production vs Playground
|
||||
|
||||
**Production (3002)**:
|
||||
- Jobs created by production API → `momentry:job:*`
|
||||
- Worker must run with production prefix
|
||||
- Production worker sees only production jobs
|
||||
|
||||
**Playground (3003)**:
|
||||
- Jobs created by playground API → `momentry_dev:job:*`
|
||||
- Worker must run with playground prefix
|
||||
- Playground worker sees only playground jobs
|
||||
|
||||
### Cross-Namespace Access
|
||||
|
||||
❌ **Cannot access**:
|
||||
- Production API cannot see playground jobs
|
||||
- Playground API cannot see production jobs
|
||||
- Worker with wrong prefix will not process jobs
|
||||
|
||||
✅ **Design intent**:
|
||||
- Complete isolation between environments
|
||||
- No accidental cross-contamination
|
||||
- Safe testing in playground without affecting production
|
||||
|
||||
## Worker Configuration
|
||||
|
||||
Workers must match the Redis prefix of the server that creates jobs:
|
||||
|
||||
```bash
|
||||
# Production worker
|
||||
./target/release/momentry worker
|
||||
# Uses: momentry: prefix (default)
|
||||
|
||||
# Playground worker
|
||||
./target/debug/momentry_playground worker
|
||||
# Uses: momentry_dev: prefix (from .env.development)
|
||||
```
|
||||
|
||||
### Worker Redis Connection
|
||||
|
||||
Workers read Redis prefix from environment:
|
||||
|
||||
1. Check `MOMENTRY_REDIS_PREFIX` environment variable
|
||||
2. If not set, use default prefix:
|
||||
- `momentry` binary → `momentry:`
|
||||
- `momentry_playground` binary → `momentry_dev:`
|
||||
|
||||
## Common Issues
|
||||
|
||||
### Issue: Jobs Not Being Processed
|
||||
|
||||
**Symptoms**:
|
||||
- API returns "Processing triggered"
|
||||
- Worker shows no activity
|
||||
- Redis job key created but not consumed
|
||||
|
||||
**Cause**: Worker running with wrong Redis prefix
|
||||
|
||||
**Solution**:
|
||||
```bash
|
||||
# Check worker prefix
|
||||
redis-cli keys "momentry*"
|
||||
|
||||
# If jobs in momentry: namespace
|
||||
# Production worker needed
|
||||
./target/release/momentry worker
|
||||
|
||||
# If jobs in momentry_dev: namespace
|
||||
# Playground worker needed
|
||||
./target/debug/momentry_playground worker
|
||||
```
|
||||
|
||||
### Issue: Progress API Returns Empty
|
||||
|
||||
**Symptoms**:
|
||||
- Progress API returns empty response
|
||||
- Job exists but progress not visible
|
||||
|
||||
**Cause**: Progress key in different namespace
|
||||
|
||||
**Solution**:
|
||||
- Ensure worker prefix matches server prefix
|
||||
- Check Redis keys: `redis-cli keys "{prefix}progress:*"`
|
||||
|
||||
## Redis CLI Examples
|
||||
|
||||
```bash
|
||||
# List all production jobs
|
||||
redis-cli -a accusys keys "momentry:job:*"
|
||||
|
||||
# List all playground jobs
|
||||
redis-cli -a accusys keys "momentry_dev:job:*"
|
||||
|
||||
# Check progress for specific file (production)
|
||||
redis-cli -a accusys HGETALL "momentry:progress:{file_uuid}"
|
||||
|
||||
# Check progress for specific file (playground)
|
||||
redis-cli -a accusys HGETALL "momentry_dev:progress:{file_uuid}"
|
||||
|
||||
# Delete all production jobs (⚠️ destructive)
|
||||
redis-cli -a accusys keys "momentry:job:*" | xargs redis-cli -a accusys del
|
||||
|
||||
# Delete all playground jobs (⚠️ destructive)
|
||||
redis-cli -a accusys keys "momentry_dev:job:*" | xargs redis-cli -a accusys del
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Always match worker to server**: Production worker for production server, playground worker for playground server
|
||||
|
||||
2. **Check Redis keys**: Before debugging worker issues, verify namespace alignment
|
||||
|
||||
3. **Document in AGENTS.md**: Update Redis prefix documentation when configuration changes
|
||||
|
||||
4. **Never mix namespaces**: Keep production and playground completely isolated
|
||||
|
||||
5. **Use environment variables**: Configure prefix via `.env` files, not hardcoded values
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- `docs_v1.0/DESIGN/Redis_Progress_Reporting_V1.0.md` - Progress reporting design
|
||||
- `docs_v1.0/M4_workspace/2026-06-21_issue_report.md` - Issue report with Redis prefix problem
|
||||
- `AGENTS.md` - Environment configuration reference
|
||||
|
||||
---
|
||||
|
||||
## Version History
|
||||
|
||||
| Version | Date | Changes |
|
||||
|---------|------|---------|
|
||||
| 1.0 | 2026-06-21 | Initial documentation for Redis prefix configuration |
|
||||
328
docs_v1.0/DESIGN/Worker_Health_Check_Mechanism.md
Normal file
328
docs_v1.0/DESIGN/Worker_Health_Check_Mechanism.md
Normal file
@@ -0,0 +1,328 @@
|
||||
---
|
||||
title: Worker Health Check Mechanism
|
||||
version: 1.0
|
||||
date: 2026-06-21
|
||||
author: momentry_core development
|
||||
status: active
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Momentry Core worker processes can become stuck due to:
|
||||
- Redis connection timeouts
|
||||
- Job queue corruption
|
||||
- Long-running processor hangs
|
||||
- Resource exhaustion
|
||||
|
||||
This document describes health check mechanisms and recommended solutions.
|
||||
|
||||
## Current Architecture
|
||||
|
||||
### Worker Process
|
||||
|
||||
```
|
||||
momentry worker
|
||||
│
|
||||
├─→ Redis connection pool
|
||||
│ └─→ Poll job queue ({prefix}job:*)
|
||||
│
|
||||
├─→ Processor executor
|
||||
│ ├─→ Python scripts (timeout: configurable)
|
||||
│ └─→ Resource monitoring (CPU, memory, GPU)
|
||||
│
|
||||
└─→ Dynamic concurrency
|
||||
└─→ Adjust based on system resources
|
||||
```
|
||||
|
||||
### Worker Logs
|
||||
|
||||
Worker logs are stored in:
|
||||
- `logs/nohup_worker*.log` - Historical worker logs
|
||||
- `logs/momentry_3002.log` - Production server logs
|
||||
- `logs/momentry_3003.log` - Playground server logs
|
||||
|
||||
## Known Issues
|
||||
|
||||
### Issue: Worker Stuck (2026-06-21)
|
||||
|
||||
**Symptoms**:
|
||||
- Worker process running but no activity
|
||||
- Last log timestamp outdated (>17 hours old)
|
||||
- Jobs triggered but never processed
|
||||
- Redis keys created but not consumed
|
||||
|
||||
**Cause**: Worker process running for extended period without proper cleanup
|
||||
|
||||
**Resolution**:
|
||||
```bash
|
||||
# 1. Check worker status
|
||||
ps aux | grep momentry.*worker
|
||||
|
||||
# 2. Check last activity
|
||||
tail -20 logs/nohup_worker*.log
|
||||
|
||||
# 3. Kill stuck worker
|
||||
kill <PID>
|
||||
|
||||
# 4. Restart worker
|
||||
./target/release/momentry worker
|
||||
```
|
||||
|
||||
## Recommended Health Check Mechanisms
|
||||
|
||||
### 1. Worker Heartbeat
|
||||
|
||||
**Implementation**:
|
||||
- Worker writes heartbeat to Redis every 30 seconds
|
||||
- Heartbeat key: `{prefix}health`
|
||||
- Heartbeat value: `{timestamp, worker_pid, status}`
|
||||
|
||||
**Check**:
|
||||
```bash
|
||||
# Check worker heartbeat
|
||||
redis-cli -a accusys HGETALL "momentry:health"
|
||||
```
|
||||
|
||||
**Expected output**:
|
||||
```json
|
||||
{
|
||||
"timestamp": "1782015243",
|
||||
"worker_pid": "52908",
|
||||
"status": "active",
|
||||
"last_job": "abc123..."
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Automatic Restart
|
||||
|
||||
**Recommendation**: Implement automatic restart on inactivity timeout
|
||||
|
||||
```bash
|
||||
# Example: Restart worker if no heartbeat for 60 seconds
|
||||
# (To be implemented in worker code)
|
||||
|
||||
while true; do
|
||||
# Check heartbeat
|
||||
LAST_HEARTBEAT=$(redis-cli HGET momentry:health timestamp)
|
||||
CURRENT_TIME=$(date +%s)
|
||||
|
||||
if [ $((CURRENT_TIME - LAST_HEARTBEAT)) > 60 ]; then
|
||||
echo "Worker stuck, restarting..."
|
||||
pkill -f "momentry worker"
|
||||
./target/release/momentry worker &
|
||||
fi
|
||||
|
||||
sleep 30
|
||||
done
|
||||
```
|
||||
|
||||
### 3. Worker Status API
|
||||
|
||||
**Recommendation**: Add `/api/v1/worker/status` endpoint
|
||||
|
||||
**Response**:
|
||||
```json
|
||||
{
|
||||
"worker_pid": 52908,
|
||||
"status": "active",
|
||||
"last_heartbeat": "2026-06-21T12:15:00Z",
|
||||
"jobs_processed": 42,
|
||||
"current_job": "abc123...",
|
||||
"uptime_seconds": 3600
|
||||
}
|
||||
```
|
||||
|
||||
### 4. Job Queue Monitoring
|
||||
|
||||
**Check for stuck jobs**:
|
||||
```bash
|
||||
# List all pending jobs
|
||||
redis-cli -a accusys keys "momentry:job:*"
|
||||
|
||||
# Check job timestamp
|
||||
redis-cli -a accusys HGET "momentry:job:{file_uuid}" created_at
|
||||
|
||||
# If job > 1 hour old without progress → stuck job
|
||||
```
|
||||
|
||||
### 5. Resource Monitoring
|
||||
|
||||
**Worker logs include system stats**:
|
||||
```
|
||||
System: CPU idle=50.0%, Memory=31948MB/49152MB (35.0%), No GPU
|
||||
Dynamic concurrency: 2 (config: 2)
|
||||
```
|
||||
|
||||
**Monitor**:
|
||||
- CPU idle > 90% for extended period → worker not processing
|
||||
- Memory > 90% → resource exhaustion risk
|
||||
- GPU not available → GPU-dependent processors will fail
|
||||
|
||||
## Monitoring Script
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# worker_health_monitor.sh
|
||||
|
||||
PREFIX="momentry:"
|
||||
REDIS_URL="redis://:accusys@localhost:6379"
|
||||
|
||||
while true; do
|
||||
echo "=== Worker Health Check ==="
|
||||
|
||||
# Check worker process
|
||||
WORKER_PID=$(pgrep -f "momentry worker")
|
||||
if [ -z "$WORKER_PID" ]; then
|
||||
echo "❌ No worker process running"
|
||||
echo "Starting worker..."
|
||||
./target/release/momentry worker &
|
||||
continue
|
||||
fi
|
||||
|
||||
echo "✅ Worker running (PID: $WORKER_PID)"
|
||||
|
||||
# Check Redis heartbeat
|
||||
HEARTBEAT=$(redis-cli -a accusys HGET "${PREFIX}health" timestamp)
|
||||
if [ -n "$HEARTBEAT" ]; then
|
||||
AGE=$(( $(date +%s) - $HEARTBEAT ))
|
||||
if [ $AGE > 60 ]; then
|
||||
echo "⚠️ Worker heartbeat stale ($AGE seconds old)"
|
||||
echo "Restarting worker..."
|
||||
kill $WORKER_PID
|
||||
./target/release/momentry worker &
|
||||
else
|
||||
echo "✅ Heartbeat recent ($AGE seconds old)"
|
||||
fi
|
||||
else
|
||||
echo "⚠️ No heartbeat found"
|
||||
fi
|
||||
|
||||
# Check pending jobs
|
||||
JOBS=$(redis-cli -a accusys keys "${PREFIX}job:*" | wc -l)
|
||||
echo "Pending jobs: $JOBS"
|
||||
|
||||
sleep 30
|
||||
done
|
||||
```
|
||||
|
||||
## Preventive Measures
|
||||
|
||||
### 1. Regular Worker Restart
|
||||
|
||||
**Recommendation**: Restart worker daily to prevent accumulation
|
||||
|
||||
```bash
|
||||
# Daily restart at 3 AM
|
||||
# Add to crontab:
|
||||
0 3 * * * pkill -f "momentry worker" && sleep 5 && ./target/release/momentry worker &
|
||||
|
||||
# Or use systemd/launchd for automatic restart
|
||||
```
|
||||
|
||||
### 2. Timeout Configuration
|
||||
|
||||
**Set reasonable timeouts**:
|
||||
```bash
|
||||
# Environment variables
|
||||
MOMENTRY_ASR_TIMEOUT=3600 # 1 hour for ASR
|
||||
MOMENTRY_CUT_TIMEOUT=3600 # 1 hour for CUT
|
||||
MOMENTRY_DEFAULT_TIMEOUT=7200 # 2 hours default
|
||||
```
|
||||
|
||||
### 3. Resource Limits
|
||||
|
||||
**Limit worker concurrency**:
|
||||
```bash
|
||||
# Worker flags
|
||||
./target/release/momentry worker \
|
||||
--max-concurrent 6 \ # Max parallel processors
|
||||
--poll-interval 10 \ # Poll every 10 seconds
|
||||
--batch-size 5 # Process 5 jobs per batch
|
||||
```
|
||||
|
||||
### 4. Logging Enhancement
|
||||
|
||||
**Recommendation**: Add structured logging for job lifecycle
|
||||
|
||||
```rust
|
||||
// In job_worker.rs
|
||||
tracing::info!(
|
||||
job_id = %job.id,
|
||||
file_uuid = %file_uuid,
|
||||
status = "started",
|
||||
"Worker started job"
|
||||
);
|
||||
|
||||
tracing::info!(
|
||||
job_id = %job.id,
|
||||
duration_ms = elapsed,
|
||||
status = "completed",
|
||||
"Worker completed job"
|
||||
);
|
||||
```
|
||||
|
||||
## Troubleshooting Guide
|
||||
|
||||
### Step 1: Check Process
|
||||
|
||||
```bash
|
||||
ps aux | grep momentry.*worker
|
||||
```
|
||||
|
||||
Expected: One worker process per environment (production + playground)
|
||||
|
||||
### Step 2: Check Logs
|
||||
|
||||
```bash
|
||||
tail -50 logs/nohup_worker*.log
|
||||
```
|
||||
|
||||
Look for:
|
||||
- Last log timestamp
|
||||
- Error messages
|
||||
- Processor failures
|
||||
|
||||
### Step 3: Check Redis
|
||||
|
||||
```bash
|
||||
redis-cli -a accusys keys "momentry:job:*"
|
||||
redis-cli -a accusys HGETALL "momentry:health"
|
||||
```
|
||||
|
||||
Look for:
|
||||
- Pending jobs count
|
||||
- Heartbeat timestamp
|
||||
- Job creation timestamps
|
||||
|
||||
### Step 4: Check Resources
|
||||
|
||||
```bash
|
||||
top -pid <worker_pid>
|
||||
```
|
||||
|
||||
Look for:
|
||||
- CPU usage (should be active if processing)
|
||||
- Memory usage (should not exceed 80%)
|
||||
- Process state (should be running, not sleeping)
|
||||
|
||||
### Step 5: Restart Worker
|
||||
|
||||
```bash
|
||||
kill <worker_pid>
|
||||
./target/release/momentry worker
|
||||
```
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- `docs_v1.0/DESIGN/Redis_Prefix_Configuration.md` - Redis namespace configuration
|
||||
- `docs_v1.0/M4_workspace/2026-06-21_issue_report.md` - Worker stuck issue report
|
||||
- `AGENTS.md` - Worker configuration reference
|
||||
- `src/worker/job_worker.rs` - Worker implementation
|
||||
|
||||
---
|
||||
|
||||
## Version History
|
||||
|
||||
| Version | Date | Changes |
|
||||
|---------|------|---------|
|
||||
| 1.0 | 2026-06-21 | Initial documentation for worker health check mechanisms |
|
||||
97
docs_v1.0/M4_workspace/2026-06-21_job_status_fix.md
Normal file
97
docs_v1.0/M4_workspace/2026-06-21_job_status_fix.md
Normal file
@@ -0,0 +1,97 @@
|
||||
---
|
||||
title: Job Status Sync Fix - Historical Processor Results Issue
|
||||
version: 1.0
|
||||
date: 2026-06-21
|
||||
author: OpenCode
|
||||
status: resolved
|
||||
---
|
||||
|
||||
# Job Status Sync Fix - Historical Processor Results Issue
|
||||
|
||||
## Problem Summary
|
||||
|
||||
Production Worker marked jobs as 'failed' even when current processors completed successfully.
|
||||
|
||||
## Root Cause
|
||||
|
||||
### Location: `src/worker/job_worker.rs:1070`
|
||||
|
||||
```rust
|
||||
let any_failed = results
|
||||
.iter()
|
||||
.any(|r| matches!(r.status, ProcessorJobStatus::Failed));
|
||||
```
|
||||
|
||||
### Logic Defect
|
||||
- Checked **all historical processor_results** (results=8)
|
||||
- If **any historical processor failed** → job marked as failed
|
||||
- **Ignored job_processors** (current request processors)
|
||||
|
||||
### Example Case
|
||||
Job ID 63:
|
||||
- Historical: asr, yolo, face, ocr, pose, mediapipe, appearance (all failed)
|
||||
- Current: cut (completed)
|
||||
- Result: `any_failed=true` → job status='failed' ❌
|
||||
|
||||
## Fix Implementation
|
||||
|
||||
### Modified Code (line 1070-1110)
|
||||
|
||||
```rust
|
||||
// Before
|
||||
let any_failed = results
|
||||
.iter()
|
||||
.any(|r| matches!(r.status, ProcessorJobStatus::Failed));
|
||||
|
||||
// After
|
||||
let any_failed = results
|
||||
.iter()
|
||||
.filter(|r| job_processors.contains(&r.processor_type.as_str().to_string()))
|
||||
.any(|r| matches!(r.status, ProcessorJobStatus::Failed));
|
||||
```
|
||||
|
||||
### Key Changes
|
||||
1. Added filter for `job_processors` parameter
|
||||
2. Only checks processors in current request
|
||||
3. Ignores historical failed processors
|
||||
|
||||
## Verification Results
|
||||
|
||||
### Production (3002) After Fix
|
||||
```
|
||||
Found 1 pending jobs ✅
|
||||
Processing job: 53090f160138fd4a01d62edf8395c6a0 (63) ✅
|
||||
Processor cut output file exists, marking completed ✅
|
||||
Job status: running ✅ (not failed)
|
||||
```
|
||||
|
||||
### Playground (3003) Comparison
|
||||
- Playground had fewer historical results
|
||||
- Jobs processed successfully before fix
|
||||
- Dev schema works normally
|
||||
|
||||
## Deployment
|
||||
|
||||
### Binary
|
||||
- Compiled: Jun 21 14:35
|
||||
- Worker restart: PID 28623
|
||||
- Logs: `logs/worker_3002_fixed.log`
|
||||
|
||||
### Test Command
|
||||
```bash
|
||||
curl -X POST "http://localhost:3002/api/v1/file/53090f160138fd4a01d62edf8395c6a0/process" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"processors": ["cut"]}'
|
||||
```
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
1. **Job lifecycle should be scoped to request**: Only check processors in current request
|
||||
2. **Historical data pollution**: Failed attempts can pollute job status logic
|
||||
3. **Filter early**: Apply filters before checking status to avoid false positives
|
||||
|
||||
## Related Files
|
||||
- `src/worker/job_worker.rs:1070-1110` (fixed)
|
||||
- `src/worker/job_worker.rs:1407` (any_failed handling)
|
||||
- `logs/worker_3002_fixed.log` (verification)
|
||||
|
||||
84
docs_v1.0/M4_workspace/2026-06-21_job_status_sync_issue.md
Normal file
84
docs_v1.0/M4_workspace/2026-06-21_job_status_sync_issue.md
Normal file
@@ -0,0 +1,84 @@
|
||||
---
|
||||
title: PostgreSQL Job Status Sync Issue
|
||||
version: 1.0
|
||||
date: 2026-06-21
|
||||
author: OpenCode
|
||||
status: identified
|
||||
---
|
||||
|
||||
# PostgreSQL Job Status Sync Issue
|
||||
|
||||
## Problem Description
|
||||
|
||||
Production Worker (3002) cannot find pending jobs despite successful UPDATE operations.
|
||||
|
||||
## Evidence
|
||||
|
||||
### Server Logs
|
||||
```
|
||||
UPDATE monitor_jobs SET processors = ..., status = 'pending' WHERE uuid = '...'
|
||||
rows_affected=1 ✅
|
||||
elapsed=565.917µs
|
||||
```
|
||||
|
||||
### PostgreSQL Query Timeline
|
||||
1. **Trigger at 06:04:39**: UPDATE executed (rows_affected=1)
|
||||
2. **Query at 06:04:41** (Python): status='pending' ✅
|
||||
3. **Query at 06:06**: status='failed' ❌ (reverted)
|
||||
4. **Worker SELECT at 06:04-06:07**: rows_returned=0 ❌
|
||||
|
||||
### Key Findings
|
||||
- Server UPDATE succeeds (rows_affected=1)
|
||||
- PostgreSQL briefly shows 'pending' (confirmed 2 seconds later)
|
||||
- Status immediately reverts to 'failed'
|
||||
- Worker SELECT never finds pending jobs
|
||||
|
||||
## Hypotheses
|
||||
|
||||
1. **Another process resets status**: Unknown mechanism changing status back to 'failed'
|
||||
2. **Job lifecycle logic**: Job processing framework has logic that marks failed jobs back as failed
|
||||
3. **Connection pool transaction issue**: UPDATE happens in one transaction, reverted in another
|
||||
4. **Worker health check**: Only affects WHERE status='running', not pending jobs
|
||||
|
||||
## Configuration Verified
|
||||
- Server schema: `public` ✅
|
||||
- Worker schema: `public` ✅
|
||||
- monitor_jobs.uuid: VARCHAR(32) ✅
|
||||
- All uuids: 32 characters ✅
|
||||
- Worker binary: Jun 21 13:20 (latest) ✅
|
||||
- Server binary: Jun 21 13:20 (latest) ✅
|
||||
|
||||
## Testing Done
|
||||
1. Restarted Server (3002, PID 65718)
|
||||
2. Restarted Worker (PID 88674)
|
||||
3. Triggered processing for multiple files
|
||||
4. Direct PostgreSQL queries via Python
|
||||
5. API verification: /api/v1/files, /health, /api/v1/jobs
|
||||
|
||||
## Current Status
|
||||
|
||||
**Production (3002)**:
|
||||
- Server: Running ✅
|
||||
- Worker: Running ✅
|
||||
- Jobs: 8 total (6 failed, 1 completed)
|
||||
- Processing: Blocked ❌
|
||||
|
||||
**Playground (3003)**:
|
||||
- Server: Running ✅
|
||||
- Worker: Running ✅
|
||||
- Not tested yet
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Test in Playground**: Compare job lifecycle in dev schema
|
||||
2. **Find reset mechanism**: Search for code that resets job status to 'failed'
|
||||
3. **Check job lifecycle**: Review job_worker.rs for failed job handling logic
|
||||
4. **Test new job registration**: Register fresh video and trigger processing
|
||||
|
||||
## Related Files
|
||||
- `src/api/processing.rs`: trigger_processing UPDATE (line 271)
|
||||
- `src/worker/job_worker.rs`: Worker polling and health check (line 95-115)
|
||||
- `src/core/db/postgres_db.rs`: list_monitor_jobs_by_status (line 1720)
|
||||
- `logs/momentry_3002.log`: Server UPDATE logs
|
||||
- `logs/worker_3002_new.log`: Worker SELECT logs
|
||||
|
||||
206
docs_v1.0/issues_2026-06-21.md
Normal file
206
docs_v1.0/issues_2026-06-21.md
Normal file
@@ -0,0 +1,206 @@
|
||||
# Issue Report: 2026-06-21
|
||||
|
||||
## Issue 1: Worker Process Stuck
|
||||
|
||||
### Description
|
||||
Worker process (PID 58279) started on Fri10PM was stuck and not processing new jobs. Last log entry dated 2026-06-20 06:52.
|
||||
|
||||
### Symptoms
|
||||
- Jobs triggered via API returned "Processing triggered" but never executed
|
||||
- Redis keys for new jobs were not created
|
||||
- Progress API returned empty response
|
||||
- Worker logs showed old timestamps
|
||||
|
||||
### Resolution
|
||||
- Killed stuck worker: `kill 58279`
|
||||
- Restarted worker: `cd /Users/accusys/momentry_core && ./target/release/momentry worker`
|
||||
- New worker PID: 52908
|
||||
|
||||
### Root Cause (Suspected)
|
||||
- Worker process running for extended period without proper cleanup
|
||||
- Possible Redis connection timeout or job queue corruption
|
||||
|
||||
### Recommendation
|
||||
- Add worker health check mechanism
|
||||
- Implement automatic worker restart on inactivity timeout
|
||||
- Add logging for job queue polling status
|
||||
|
||||
---
|
||||
|
||||
## Issue 2: Face/YOLO Processor Failure - Missing OpenCV
|
||||
|
||||
### Description
|
||||
Face and YOLO processors failed with `ModuleNotFoundError: No module named 'cv2'`
|
||||
|
||||
### Error Log
|
||||
```
|
||||
[ERROR] Processor face failed for job d8acb03870f0cc9b14e01f14a7bf24d6: Failed to run "/Users/accusys/momentry_core/scripts/face_processor.py"
|
||||
[ERROR] Processor yolo failed for job d8acb03870f0cc9b14e01f14a7bf24d6: Failed to run "/Users/accusys/momentry_core/scripts/yolo_processor.py"
|
||||
```
|
||||
|
||||
### Python Test Result
|
||||
```
|
||||
python3 /Users/accusys/momentry_core/scripts/face_processor.py --help
|
||||
Traceback (most recent call last):
|
||||
File ".../face_processor.py", line 25, in <module>
|
||||
import cv2
|
||||
ModuleNotFoundError: No module named 'cv2'
|
||||
```
|
||||
|
||||
### Resolution
|
||||
```bash
|
||||
pip3 install opencv-python
|
||||
```
|
||||
|
||||
### Recommendation
|
||||
- Add Python dependency check in worker startup
|
||||
- Document required Python packages in README
|
||||
- Add `requirements.txt` with all processor dependencies
|
||||
|
||||
---
|
||||
|
||||
## Issue 3: Redis Prefix Configuration Confusion
|
||||
|
||||
### Description
|
||||
Two different Redis namespaces exist:
|
||||
- `momentry:` - Production server (port 3002)
|
||||
- `momentry_dev:` - Playground server (port 3003)
|
||||
|
||||
### Impact
|
||||
- Jobs triggered on production server not visible to playground worker
|
||||
- Progress data stored in different namespaces
|
||||
- API proxy needs to match correct prefix
|
||||
|
||||
### Current Setup
|
||||
```
|
||||
Production Server (port 3002): Redis prefix "momentry:"
|
||||
Playground Server (port 3003): Redis prefix "momentry_dev:"
|
||||
```
|
||||
|
||||
### Recommendation
|
||||
- Document Redis prefix configuration clearly
|
||||
- Add environment variable for Redis prefix selection
|
||||
- Consider using same prefix for development simplicity
|
||||
|
||||
---
|
||||
|
||||
## Issue 4: Progress API Behavior
|
||||
|
||||
### Description
|
||||
`GET /api/v1/progress/:file_uuid` returns empty response when:
|
||||
1. No job exists for the file
|
||||
2. Job is complete (all processors finished)
|
||||
3. Worker is stuck/not processing
|
||||
|
||||
### Expected Behavior (from docs)
|
||||
```json
|
||||
{
|
||||
"file_uuid": "...",
|
||||
"overall_progress": 71,
|
||||
"processors": [
|
||||
{"processor_type": "asr", "status": "complete", "progress": 100},
|
||||
{"processor_type": "yolo", "status": "running", "progress": 65}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Actual Behavior
|
||||
- Returns empty response (no output) when job complete or missing
|
||||
- Frontend cannot distinguish between "not started" vs "completed"
|
||||
|
||||
### Recommendation
|
||||
- Return explicit status for completed jobs (e.g., `{"overall_progress": 100, "status": "completed"}`)
|
||||
- Return 404 when job not found (file never processed)
|
||||
- Add `status` field to response: `pending`, `running`, `completed`, `failed`
|
||||
|
||||
---
|
||||
|
||||
## Issue 5: Frontend Status Display Bug
|
||||
|
||||
### Description
|
||||
Frontend showed "處理中" (processing) status for Gamma Carry file but:
|
||||
- Database status: `registered` (not processed)
|
||||
- No job in Redis
|
||||
- No progress data
|
||||
|
||||
### Cause
|
||||
Frontend code sets `f.status = 'processing'` immediately after process trigger, without verifying job creation:
|
||||
|
||||
```typescript
|
||||
// LibraryView.vue line 463
|
||||
if (result.success) {
|
||||
f.status = 'processing' // Sets status prematurely
|
||||
pollProgress(f.file_uuid)
|
||||
}
|
||||
```
|
||||
|
||||
### Impact
|
||||
- User sees "processing" status but actual processing never started
|
||||
- Misleading UI feedback
|
||||
|
||||
### Recommendation
|
||||
- Verify job creation before setting status
|
||||
- Check Redis job key existence
|
||||
- Poll progress API and set status based on actual response
|
||||
- Handle case when progress API returns empty (job not created)
|
||||
|
||||
---
|
||||
|
||||
## Test Results Summary
|
||||
|
||||
### File: Gamma Carry Saves the World..mp4
|
||||
- UUID: `d8acb03870f0cc9b14e01f14a7bf24d6`
|
||||
- Processing triggered: 2026-06-21 12:13
|
||||
|
||||
### Processor Results
|
||||
| Processor | Status | Output |
|
||||
|-----------|--------|--------|
|
||||
| cut | ✓ Complete | 4825 frames |
|
||||
| asr | ✓ Complete | 0 segments |
|
||||
| face | ✗ Failed | Missing cv2 |
|
||||
| yolo | ✗ Failed | Missing cv2 |
|
||||
| ocr | - Not run | Dependency failed |
|
||||
| pose | - Not run | Dependency failed |
|
||||
|
||||
### Redis Keys Created
|
||||
```
|
||||
momentry:job:d8acb03870f0cc9b14e01f14a7bf24d6
|
||||
momentry:progress:d8acb03870f0cc9b14e01f14a7bf24d6
|
||||
momentry:job:d8acb03870f0cc9b14e01f14a7bf24d6:processor:cut
|
||||
momentry:job:d8acb03870f0cc9b14e01f14a7bf24d6:processor:asr
|
||||
momentry:job:d8acb03870f0cc9b14e01f14a7bf24d6:processor:face
|
||||
momentry:job:d8acb03870f0cc9b14e01f14a7bf24d6:processor:yolo
|
||||
```
|
||||
|
||||
### API Test Results
|
||||
| API | Status | Note |
|
||||
|-----|--------|------|
|
||||
| `POST /api/v1/file/:uuid/process` | ✓ Works | Job created |
|
||||
| `GET /api/v1/file/:uuid/processor-counts` | ✓ Works | Returns correct counts |
|
||||
| `GET /api/v1/progress/:uuid` | Partial | Empty when complete/missing |
|
||||
| `GET /api/v1/jobs` | - Not tested | No response via proxy |
|
||||
|
||||
---
|
||||
|
||||
## Recommended Actions
|
||||
|
||||
### Immediate
|
||||
1. Install OpenCV: `pip3 install opencv-python`
|
||||
2. Add worker health monitoring
|
||||
3. Fix progress API to return status for completed jobs
|
||||
|
||||
### Short-term
|
||||
1. Add Python dependency validation in worker
|
||||
2. Document Redis prefix configuration
|
||||
3. Improve frontend status verification
|
||||
|
||||
### Long-term
|
||||
1. Add `requirements.txt` for processor scripts
|
||||
2. Implement worker auto-restart mechanism
|
||||
3. Add comprehensive logging for job lifecycle
|
||||
4. Create integration tests for processing pipeline
|
||||
|
||||
---
|
||||
|
||||
*Report generated: 2026-06-21 12:15*
|
||||
*Reporter: momentry_studio development session*
|
||||
Reference in New Issue
Block a user