In this guide we’re going to make a website that generates rap songs from images.

Our website will be able to take an image like this:

Colorful image of a duck and cheetah

and turn it into a sweet rap song like this:

Check out a live version of the site here, and the soure code here.

We’ll rely on a few different third part services:

  • Cloudflare Images for handling image upload and hosting.
  • OpenAI’s GPT4 Vision model for getting a text description from the uploaded image.
  • Uberduck’s rap song generation API for generating the songs.

First, we create a new NextJS project:

npx create-next-app@latest image-to-rap

In the sample repo, I used Typescript and the app router, which gives us access to the fancy new React Server Component functionality.

Next, create a .env file that contains credentials for interacting with the Cloudflare Images API, the GPT4 Vision API, and the Uberduck API. It should look like this (make sure you replace the * characters with the actual credentials):

OPENAI_API_KEY=sk-****
CLOUDFLARE_ACCOUNT_ID=****
CLOUDFLARE_ACCOUNT_HASH=****
CLOUDFLARE_IMAGES_API_TOKEN=****
UBERDUCK_API_KEY=****
UBERDUCK_API_SECRET=****

Now you can implement the frontend and backend of the site.

Our frontend is pretty simple: we have a web form which has an image upload and a select menu for modifying the tone of the AI-generated lyrics. You can find the full component here on GitHub.

export default function Form() {
  const [imageSrc, setImageSrc] = useState<string | null>(null);
  const [state, formAction] = useFormState(generateRap, initialFormState);

  const handleImageChange: React.ChangeEventHandler<HTMLInputElement> = (
    event
  ) => {
    const file = event.target.files![0];
    if (file && file.type.match("image.*")) {
      const reader = new FileReader();
      reader.onload = (e) => setImageSrc(e.target!.result as string);
      reader.readAsDataURL(file);
    }
  };

  return (
    <form style={formStyle} action={formAction}>
      <div>
        <label htmlFor="subjectImage">Input Image</label>
        <input
          id="subjectImage"
          style={inputStyle}
          type="file"
          name="image"
          accept="image/*"
          onChange={handleImageChange}
        />
        {imageSrc && <img src={imageSrc} alt="Preview" style={imageStyle} />}
      </div>
      <div>
        <label htmlFor="tone">Lyrics Tone</label>
        <select id="tone" name="tone" style={inputStyle}>
          <option value="happy">Happy</option>
          <option value="sad">Sad</option>
          <option value="angry">Angry</option>
          <option value="loving">Loving</option>
          <option value="sarcastic">Sarcastic</option>
        </select>
      </div>
      <div>
        <SubmitButton />
      </div>
      <div className="grid grid-cols-2">
        {state.songUrl && (
          <div>
            <audio controls src={state.songUrl} />
          </div>
        )}
        <div>
          {state.lyrics[0].map((line: string, idx: number) => (
            <div key={idx}>{line}</div>
          ))}
        </div>
      </div>
    </form>
  );
}

Note that the frontend uses a React Server action in the form. This server action is our entire backend. Here’s what the code looks like (you can the full file on GitHub here):

export async function generateRap(
  currentState: { message: string; lyrics: string[][]; songUrl: string },
  formData: FormData
) {
  const image = formData.get("image");
  const tone = formData.get("tone");

  // Upload the image to Cloudflare images.
  const cffd = new FormData();
  cffd.append("file", image as Blob);
  const imageResponse = await fetch(
    `https://api.cloudflare.com/client/v4/accounts/${cfAccountId}/images/v1`,
    {
      method: "POST",
      headers: new Headers({
        Authorization: `Bearer ${cfImagesAPIToken}`,
      }),
      body: cffd,
    }
  );
  const imageUrl = (await imageResponse.json()).result.variants[0];

  // Get the description of the image from the OpenAI API.
  const visionCompletion = await openai.chat.completions.create({
    messages: [
      {
        role: "user",
        content: [
          {
            type: "text",
            text: `What's in this image? Describe it in a ${tone} tone.`,
          },
          {
            type: "image_url",
            image_url: {
              url: imageUrl,
              detail: "low",
            },
          },
        ],
      },
    ],
    model: "gpt-4-vision-preview",
    max_tokens: 300,
  });
  const imageDescription = visionCompletion.choices[0].message.content;

  // Use the description of the image as input to the Uberduck rap generation API.
  const udHeaders = new Headers({
    Authorization: `Basic ${Buffer.from(
      `${uberduckAPIKey}:${uberduckAPISecret}`
    ).toString("base64")}`,
    "Content-Type": "application/json",
  });
  const lyricsResponse = await fetch(`${uberduckAPI}/tts/lyrics`, {
    method: "POST",
    headers: udHeaders,
    body: JSON.stringify({
      backing_track: uberduckBackingTrackUUID,
      subject: `Write a rap about the an image. Make the tone of the rap ${tone}. Here's what the image shows: ${imageDescription}`,
    }),
  });
  const lyrics = (await lyricsResponse.json()).lyrics;

  const rapResponse = await fetch(`${uberduckAPI}/tts/freestyle`, {
    method: "POST",
    headers: udHeaders,
    body: JSON.stringify({
      backing_track: uberduckBackingTrackUUID,
      voicemodel_uuid: uberduckVoiceUUID,
      lyrics: lyrics,
    }),
  });
  const rapData = await rapResponse.json();

  return {
    message: imageDescription as string,
    lyrics: lyrics as string[][],
    songUrl: rapData.mix_url as string,
  };
}

And that’s it! In just a few steps, we’ve delivered a multimodal AI application that can generate a rap song from an image input.