review · segments
Your task spec is at /tmp/legit-embedding-clip-spec.md — read it in full and execute it. Goal: make the worker's IMAGE embedding model selectable via the task's model_name (mirror the TEXT path), and add clip-vit-b-32 (OpenAI CLIP ViT-B/32, 512-dim,
codex 294 events 1 segments runpod-container-runtime
segment 1 of 1
Add selectable CLIP image model alongside default EVA02
Read the task spec and explored the repo structure, including models.py, preparation.py, embedding.py, text_models.py, config, tests, and dependencies. Implemented changes: parameterized image model selection in models.py with a cache keyed by model_name, added CLIP model loading via open_clip with 224px/CLIP-norm transforms, grouped image preprocessing by model in preparation.py so EVA02 and CLIP batches use separate transforms and tensor shapes, updated embedding.py to handle grouped payloads while maintaining backward compatibility with single-path payloads, added open-clip-torch dependency to requirements.txt and setup.py, and added clip-vit-b-32 entry to models-config.json. Added unit tests for model selection, transform dispatch, batch grouping, and embedding dimensions. Installed open-clip-torch dependency. Committed locally as commit 225d689 on runpod-container-runtime branch.
outcome
Local commit 225d689 on runpod-container-runtime branch with CLIP image embedding support implemented and tested.
next steps
- Rebuild and push the Docker image to GHCR via the publish workflow
key decisions
- Use open_clip.create_model_and_transforms for CLIP, extracting the visual tower and L2-normalizing output to 512-dim
- Group image preprocessing by resolved model_name in preparation.py so EVA02 and CLIP batches never share a tensor
- Keep backward compatibility in embedding.py: accept both legacy single-path payloads and new grouped payloads
- EVA02 remains the default image model; CLIP is only selected when model_name contains 'clip' or 'ViT-B-32'
- Use open-clip-torch>=2.24.0 for CLIP support
- Resolve CLIP aliases (clip-vit-b-32, ViT-B-32, etc.) to canonical name
- Cache image transforms per model name
- Group image batches by resolved model name before stacking tensors
open questions
—
2 weeks ago → 2 weeks ago