In this guide we’re going to make a website that generates rap songs from images.

Our website will be able to take an image like this:

Colorful image of a duck and cheetah

and turn it into a sweet rap song like this:

Check out a live version of the site here, and the soure code here.

We’ll rely on a few different third part services:

  • Cloudflare Images for handling image upload and hosting.
  • OpenAI’s GPT4 Vision model for getting a text description from the uploaded image.
  • Uberduck’s rap song generation API for generating the songs.

First, we create a new NextJS project:

npx create-next-app@latest image-to-rap

In the sample repo, I used Typescript and the app router, which gives us access to the fancy new React Server Component functionality.

Next, create a .env file that contains credentials for interacting with the Cloudflare Images API, the GPT4 Vision API, and the Uberduck API. It should look like this (make sure you replace the * characters with the actual credentials):


Now you can implement the frontend and backend of the site.

Our frontend is pretty simple: we have a web form which has an image upload and a select menu for modifying the tone of the AI-generated lyrics. You can find the full component here on GitHub.

export default function Form() {
  const [imageSrc, setImageSrc] = useState<string | null>(null);
  const [state, formAction] = useFormState(generateRap, initialFormState);

  const handleImageChange: React.ChangeEventHandler<HTMLInputElement> = (
  ) => {
    const file =![0];
    if (file && file.type.match("image.*")) {
      const reader = new FileReader();
      reader.onload = (e) => setImageSrc(!.result as string);

  return (
    <form style={formStyle} action={formAction}>
        <label htmlFor="subjectImage">Input Image</label>
        {imageSrc && <img src={imageSrc} alt="Preview" style={imageStyle} />}
        <label htmlFor="tone">Lyrics Tone</label>
        <select id="tone" name="tone" style={inputStyle}>
          <option value="happy">Happy</option>
          <option value="sad">Sad</option>
          <option value="angry">Angry</option>
          <option value="loving">Loving</option>
          <option value="sarcastic">Sarcastic</option>
        <SubmitButton />
      <div className="grid grid-cols-2">
        {state.songUrl && (
            <audio controls src={state.songUrl} />
          {state.lyrics[0].map((line: string, idx: number) => (
            <div key={idx}>{line}</div>

Note that the frontend uses a React Server action in the form. This server action is our entire backend. Here’s what the code looks like (you can the full file on GitHub here):

export async function generateRap(
  currentState: { message: string; lyrics: string[][]; songUrl: string },
  formData: FormData
) {
  const image = formData.get("image");
  const tone = formData.get("tone");

  // Upload the image to Cloudflare images.
  const cffd = new FormData();
  cffd.append("file", image as Blob);
  const imageResponse = await fetch(
      method: "POST",
      headers: new Headers({
        Authorization: `Bearer ${cfImagesAPIToken}`,
      body: cffd,
  const imageUrl = (await imageResponse.json()).result.variants[0];

  // Get the description of the image from the OpenAI API.
  const visionCompletion = await{
    messages: [
        role: "user",
        content: [
            type: "text",
            text: `What's in this image? Describe it in a ${tone} tone.`,
            type: "image_url",
            image_url: {
              url: imageUrl,
              detail: "low",
    model: "gpt-4-vision-preview",
    max_tokens: 300,
  const imageDescription = visionCompletion.choices[0].message.content;

  // Use the description of the image as input to the Uberduck rap generation API.
  const udHeaders = new Headers({
    Authorization: `Basic ${Buffer.from(
    "Content-Type": "application/json",
  const lyricsResponse = await fetch(`${uberduckAPI}/tts/lyrics`, {
    method: "POST",
    headers: udHeaders,
    body: JSON.stringify({
      backing_track: uberduckBackingTrackUUID,
      subject: `Write a rap about the an image. Make the tone of the rap ${tone}. Here's what the image shows: ${imageDescription}`,
  const lyrics = (await lyricsResponse.json()).lyrics;

  const rapResponse = await fetch(`${uberduckAPI}/tts/freestyle`, {
    method: "POST",
    headers: udHeaders,
    body: JSON.stringify({
      backing_track: uberduckBackingTrackUUID,
      voicemodel_uuid: uberduckVoiceUUID,
      lyrics: lyrics,
  const rapData = await rapResponse.json();

  return {
    message: imageDescription as string,
    lyrics: lyrics as string[][],
    songUrl: rapData.mix_url as string,

And that’s it! In just a few steps, we’ve delivered a multimodal AI application that can generate a rap song from an image input.