tldr: do a standard img2img workflow where you lay out a skeleton or skeleton or low-res version, and then turn it into the final high-quality photorealistic version, instead of trying to zeroshot it purely from a text prompt.