SigLIP SoViT-400M/14 Vision Encoder 384px (GGUF)

GGUF conversion of google/siglip-so400m-patch14-384 for use with CrispEmbed.

  • Architecture: SigLIP SoViT-400M/14 vision encoder (shape-optimized)
  • Parameters: 428M
  • Output: 1152-dimensional L2-normalized embeddings
  • Input: 384x384 RGB image with SigLIP normalization (mean=0.5, std=0.5)
  • Patch size: 14
  • Size: ~1.6 GB
  • Source: google/siglip-so400m-patch14-384

Usage

# Embed a single image
crispembed -m siglip-so400m-patch14-384 --image photo.jpg

# Batch processing
crispembed -m siglip-so400m-patch14-384 --image-dir ./photos/ --output embeddings.bin

About SigLIP SoViT-400M

SoViT-400M is a shape-optimized SigLIP variant that redistributes compute across width, depth, and MLP ratio for better efficiency. Combined with 384px input and patch size 14, it provides high-quality vision embeddings.

Notes

  • All output embeddings are L2-normalized.
  • This is a GGUF conversion; weights are numerically equivalent to the original HuggingFace model.
Downloads last month
60
GGUF
Model size
0.4B params
Architecture
vit
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support