Image To Rap
Generate rap songs from an image—make a song about how you look or the products you’re selling.
In this guide we’re going to make a website that generates rap songs from images.
Our website will be able to take an image like this:
and turn it into a sweet rap song like this:
Check out a live version of the site here, and the soure code here.
We’ll rely on a few different third part services:
- Cloudflare Images for handling image upload and hosting.
- OpenAI’s GPT4 Vision model for getting a text description from the uploaded image.
- Uberduck’s rap song generation API for generating the songs.
First, we create a new NextJS project:
npx create-next-app@latest image-to-rap
In the sample repo, I used Typescript and the app router, which gives us access to the fancy new React Server Component functionality.
Next, create a .env file that contains credentials for interacting with the
Cloudflare Images API, the GPT4 Vision API, and the Uberduck API. It should look
like this (make sure you replace the *
characters with the actual
credentials):
OPENAI_API_KEY=sk-****
CLOUDFLARE_ACCOUNT_ID=****
CLOUDFLARE_ACCOUNT_HASH=****
CLOUDFLARE_IMAGES_API_TOKEN=****
UBERDUCK_API_KEY=****
UBERDUCK_API_SECRET=****
Now you can implement the frontend and backend of the site.
Our frontend is pretty simple: we have a web form which has an image upload and a select menu for modifying the tone of the AI-generated lyrics. You can find the full component here on GitHub.
export default function Form() {
const [imageSrc, setImageSrc] = useState<string | null>(null);
const [state, formAction] = useFormState(generateRap, initialFormState);
const handleImageChange: React.ChangeEventHandler<HTMLInputElement> = (
event
) => {
const file = event.target.files![0];
if (file && file.type.match("image.*")) {
const reader = new FileReader();
reader.onload = (e) => setImageSrc(e.target!.result as string);
reader.readAsDataURL(file);
}
};
return (
<form style={formStyle} action={formAction}>
<div>
<label htmlFor="subjectImage">Input Image</label>
<input
id="subjectImage"
style={inputStyle}
type="file"
name="image"
accept="image/*"
onChange={handleImageChange}
/>
{imageSrc && <img src={imageSrc} alt="Preview" style={imageStyle} />}
</div>
<div>
<label htmlFor="tone">Lyrics Tone</label>
<select id="tone" name="tone" style={inputStyle}>
<option value="happy">Happy</option>
<option value="sad">Sad</option>
<option value="angry">Angry</option>
<option value="loving">Loving</option>
<option value="sarcastic">Sarcastic</option>
</select>
</div>
<div>
<SubmitButton />
</div>
<div className="grid grid-cols-2">
{state.songUrl && (
<div>
<audio controls src={state.songUrl} />
</div>
)}
<div>
{state.lyrics[0].map((line: string, idx: number) => (
<div key={idx}>{line}</div>
))}
</div>
</div>
</form>
);
}
Note that the frontend uses a React Server action in the form. This server action is our entire backend. Here’s what the code looks like (you can the full file on GitHub here):
export async function generateRap(
currentState: { message: string; lyrics: string[][]; songUrl: string },
formData: FormData
) {
const image = formData.get("image");
const tone = formData.get("tone");
// Upload the image to Cloudflare images.
const cffd = new FormData();
cffd.append("file", image as Blob);
const imageResponse = await fetch(
`https://api.cloudflare.com/client/v4/accounts/${cfAccountId}/images/v1`,
{
method: "POST",
headers: new Headers({
Authorization: `Bearer ${cfImagesAPIToken}`,
}),
body: cffd,
}
);
const imageUrl = (await imageResponse.json()).result.variants[0];
// Get the description of the image from the OpenAI API.
const visionCompletion = await openai.chat.completions.create({
messages: [
{
role: "user",
content: [
{
type: "text",
text: `What's in this image? Describe it in a ${tone} tone.`,
},
{
type: "image_url",
image_url: {
url: imageUrl,
detail: "low",
},
},
],
},
],
model: "gpt-4-vision-preview",
max_tokens: 300,
});
const imageDescription = visionCompletion.choices[0].message.content;
// Use the description of the image as input to the Uberduck rap generation API.
const udHeaders = new Headers({
Authorization: `Basic ${Buffer.from(
`${uberduckAPIKey}:${uberduckAPISecret}`
).toString("base64")}`,
"Content-Type": "application/json",
});
const lyricsResponse = await fetch(`${uberduckAPI}/tts/lyrics`, {
method: "POST",
headers: udHeaders,
body: JSON.stringify({
backing_track: uberduckBackingTrackUUID,
subject: `Write a rap about the an image. Make the tone of the rap ${tone}. Here's what the image shows: ${imageDescription}`,
}),
});
const lyrics = (await lyricsResponse.json()).lyrics;
const rapResponse = await fetch(`${uberduckAPI}/tts/freestyle`, {
method: "POST",
headers: udHeaders,
body: JSON.stringify({
backing_track: uberduckBackingTrackUUID,
voicemodel_uuid: uberduckVoiceUUID,
lyrics: lyrics,
}),
});
const rapData = await rapResponse.json();
return {
message: imageDescription as string,
lyrics: lyrics as string[][],
songUrl: rapData.mix_url as string,
};
}
And that’s it! In just a few steps, we’ve delivered a multimodal AI application that can generate a rap song from an image input.