Rag is very weak on relationships, meaning they are basically non existent. This leads to rag maps of sorts but still in a prompt sense. For images, you basically want to train on images not use rag functionality. You would want a slm (small language model) focused around your training data. But I think you will find the setup might be fairly difficult which is why it is probably not worth it.
You could use the select edit feature on the dalle gpt and just ask it to edit the selection, that might work.
[ + ] MaryXmas
[ - ] MaryXmas 0 points 3 monthsFeb 19, 2025 18:49:22 ago (+0/-0)
You could use the select edit feature on the dalle gpt and just ask it to edit the selection, that might work.
[ + ] Cantaloupe
[ - ] Cantaloupe 0 points 4 monthsFeb 18, 2025 23:12:11 ago (+0/-0)
Essentially:
Images to clip or blip
Queries into embeddings
Retrieve based on similarity
Image to text model
Blip2 gpt4v palmE
Descriptions into LLM for processing
Extract features OCR text, tesseract?
Stable diffusions