Clive S. — Cloud Infrastructure Engineer

DrizzyGPT-6: Building a Drake oracle on Cloudflare’s free tier

People treat ChatGPT like a therapist. They type whatever is on their mind and an LLM reflects something soothing back. I wanted the same shape, except instead of generic LLM output, the user gets a real Drake lyric pulled from his actual voice. Fifteen studio projects, hundreds of songs, a man who has publicly cried on six different albums. The corpus is the punchline. The engineering challenge is retrieval: given an arbitrary conversational question, find the bar that lands.

The pitch: you type “she stopped texting me back,” the system returns four lines from HYFR about an ex texting Drake asking for closure. Played straight. Entertainment only. Lyrics © their respective rights holders.

DrizzyGPT-6 live app, empty state
The live app. Drake’s catalog, one question away.

End-to-end RAG over Drake’s catalog. This started as a homelab side project and pivoted to serverless when GPU passthrough on Proxmox got stupid (currently running the newest build, nvidia drivers are historically 6-8 months behind, any workaround WILL CRASH. Been there, done that). The point was to see how far you can go with few resources. Final bill came in under $9/mo: Workers Paid plan ($5), Vectorize storage for 44k vectors at 768d (~$3), and everything else inside free tiers (Workers AI embeddings, Groq Llama 3.3 70B for ranking, Groq Llama 3.1 8B for query classification, D1 SQLite, Pages bandwidth). Annotation of 44,161 chunks was done with Gemini 2.5 Flash on its free tier plus Claude Sonnet 4.6 via the Claude Max subscription (sunk cost, $0 marginal). There’s a false narrative that AI is incredibly expensive. The numbers at the bottom of this post say otherwise.

The Architecture

Eight steps in sequence, all on Cloudflare’s edge plus Groq for inference. No frameworks beyond Hono for routing. No agents, no chains, no LangChain.

drizzy-ai architecture
Request flow: classify user → HyDE expand → embed → hybrid retrieval (dense + BM25 via RRF) → drake_section filter → polarity filter → diversity dedup → LLM rank → LLM trim.
LayerToolWhat it does
ComputeCloudflare WorkersTypeScript, Hono router, MCP endpoint, ~30ms cold start
Dense retrievalCloudflare Vectorize44,161 vectors at 768d, cosine similarity
Sparse retrievalCloudflare D1 + FTS5BM25 lexical search over the same chunks
Metadata filterCloudflare D1 chunk_metadrake_section + role/sentiment/POV/themes per chunk
EmbeddingsWorkers AI bge-base-en-v1.5Query and chunks to 768d vectors
LLM rankingGroq Llama 3.3 70BHyDE expand, rank, trim (3 calls per query)
Query classifierGroq Llama 3.1 8B InstantTags user input into the same schema as chunks
FrontendCloudflare PagesSingle index.html, vanilla JS, no build step
AnnotationGemini 2.5 Flash + Sonnet 4.644,161 chunks labeled with role/sentiment/POV/themes

Scraping The Catalog

Genius has an API but it returns metadata, not lyrics. To get the actual text you scrape the song’s page. The Python lyricsgenius library handles this politely.

First attempt: I paginated artist_songs and stopped when a page returned fewer items than the page size, assuming end of data. Wrong assumption.

# Wrong: stops at 99 because page 2 returns 49 of 50
if len(batch) < per_page:
    break

# Right: trust the API's next_page token
if resp.get("next_page") is None:
    break
page = resp["next_page"]

Drake’s full Genius catalog goes 43 pages deep. First version found 99 songs. Fixed version found 1,889. Same code path, one boolean flip.

1,702 of those scraped cleanly (187 stub pages and rate-limit failures dropped). But not all 1,702 are Drake songs. That count includes every track he appears on as a guest. Properly classified: 806 are Drake-primary, the other 896 are features on someone else’s track. That distinction matters and I had to fix it in Step 6.

scrape results
1,889 candidates enumerated, 1,702 scraped, 187 skipped.

Chunking, and the oxtail problem

A “chunk” is the unit of retrieval. Get the size wrong and the whole pipeline suffers. The first version used a fixed 6-line sliding window, it worked great for emotional questions. It failed miserably on punchline questions where the bullseye is just 2-3 lines. I asked the live site “what should I eat today?” and got:

Fucking crazy
And then, I go on that wack-ass vacation
The chef is feeding us the same meal every day like I was, like, in jail still or something
I’m like, “Why are we eating jerk chicken and oxtail every single day, can we switch it up?”
No, the oxtail was fire though, but can we switch it up?
I can’t eat that shit every day, what?

Drake, Calling For You · 2023

Right song, right section, but the chunk diluted the punchline. The two-line bullseye (“can we switch it up? I can’t eat that shit every day“) would have been the perfect response.

Fix: multi-granularity chunking. Re-chunk at three window sizes simultaneously (3, 5, 7 lines), stride 2, dedupe globally. Retrieval now picks the right size on cosine similarity.

chunk count by version
5,124 → 19,334 → 83,868 → 44,161 across four chunker iterations. The final cut was after the drake-only filter in Step 6.

Embedding + Dense Retrieval

Each chunk goes through @cf/baai/bge-base-en-v1.5 (BAAI’s General Embeddings, the English RAG workhorse). 768-d vectors, normalized for cosine. Free Workers AI handles ~10k neurons/day, plenty for live queries (~1 neuron per question). The one-time 44k-chunk burst is done via REST.

At query time: env.LYRICS.query(vector, { topK: 25 }) returns the 25 nearest chunks in ~30ms.

BM25 Sparse Retrieval + RRF fusion

Dense vector search smears rare terms and proper nouns. “Toronto” gets embedded as a generic “place” vector. “Oxtail” barely registers. Add a sparse layer.

Cloudflare D1 supports FTS5, SQLite’s full-text-search virtual table with built-in BM25 ranking. tokenize = 'porter unicode61' means “running” matches “ran” via stemming. At query time, sparse runs in parallel with dense:

const [denseMatches, sparseMatches] = await Promise.all([
  env.LYRICS.query(vector, { topK: 25 }),
  env.DB.prepare(
    `SELECT id, text, song, rank FROM chunks
       WHERE chunks MATCH ? ORDER BY rank LIMIT 25`
  ).bind(escapeFts5(question)).all(),
]);

Fused via Reciprocal Rank Fusion, a dead-simple algorithm:

function rrf(dense, sparse, k = 60) {
  const scores = new Map();
  dense.forEach((it, i) => scores.set(it.id, { item: it, score: 1 / (k + i + 1) }));
  sparse.forEach((it, i) => {
    const prev = scores.get(it.id);
    const contrib = 1 / (k + i + 1);
    if (prev) prev.score += contrib;
    else scores.set(it.id, { item: it, score: contrib });
  });
  return [...scores.values()].sort((a, b) => b.score - a.score).map(x => x.item);
}

No weights, no calibration, just k=60 from the original paper. For “i miss toronto” the sparse retriever surfaces lines containing “Toronto” verbatim that dense had buried; RRF combines both and city bars float to the top.

HyDE, Crossing The Style Gap

Vector search assumes query and corpus live in the same stylistic neighborhood. That holds for technical questions over technical docs. It breaks when you’re searching Take Care for “why does she keep coming back.”

HyDE (Hypothetical Document Embeddings, Gao et al. 2022) is an elegant fix. Don’t embed the question. Embed a hypothetical answer to it, written in the documents’ style.

HyDE flow
HyDE: imagine a Drake response, embed THAT, find real chunks in the same neighborhood.

Heres the kicker… the mind-bending part. The LLM’s hypothetical response might be totally hallucinated fake Drake. That’s absolutely fine, you never show it to the user. You only use its embedding to find real chunks that are stylistically nearby.

drake_section filter: fixing the corpus purity problem

This was the biggest single quality lever. Initially I scraped all 1,702 tracks Drake appears on, but when he’s a feature on someone else’s song, my scraper pulled the WHOLE song lyrics: Travis Scott’s verses on SICKO MODE, Lil Wayne’s bars on Tunechi’s Room, 21 Savage’s hook on Issa. Then chunked and indexed all of it tagged as “Drake.”

The result was that the live system was returning 21 Savage lyrics in response to user heartbreak queries because Drake happens to be on the same track. Brutal.

Fix: parse the section headers Genius preserves on every page ([Verse 1: Drake], [Chorus: Travis Scott], [Hook: Drake & PartyNextDoor]) with a regex, and for feature tracks keep only sections attributed to Drake. Drake-primary tracks stay whole.

SEC_RE = re.compile(r"^\s*\[([^\]]+)\]\s*$", re.MULTILINE)
DRAKE_RE = re.compile(r"\b(drake|aubrey|champagne papi)\b", re.IGNORECASE)

def drake_only_text(lyrics):
    matches = list(SEC_RE.finditer(lyrics))
    if not matches: return ""
    parts = []
    for i, m in enumerate(matches):
        header = m.group(1)
        artists = header.split(":", 1)[1].strip() if ":" in header else ""
        if not DRAKE_RE.search(artists): continue
        start = m.end()
        end = matches[i+1].start() if i+1 < len(matches) else len(lyrics)
        body = lyrics[start:end].strip()
        if body: parts.append(body)
    return "\n\n".join(parts)

I also whitelisted just 25 frequent collaborators (Lil Wayne, PartyNextDoor, Future, Kanye, Rick Ross, Nicki Minaj, Travis Scott, Rihanna, Kendrick, French Montana, Chris Brown, Beyoncé, Big Sean, DJ Khaled, 21 Savage, Eminem, J. Cole, A$AP Rocky, 2 Chainz, Meek Mill, Wiz Khalifa, J Hus, Sexyy Red, Lil Baby, Lil Yachty). Other one-off features got dropped entirely.

Result: 1,702 → 1,015 tracks, 83,868 → 44,161 chunks. Each surviving chunk gets drake_section=1 in D1’s chunk_meta table. Worker filters at retrieval time.

per-chunk annotation + polarity filter

Even with corpus purity fixed, one failure mode persisted: polarity inversion. Surface keywords match but Drake is in the opposite emotional role from the user.

Each of the 44,161 chunks gets labeled with:

  • r (speaker role): a actor / s subject / o observational
  • s (sentiment): heartbreak / triumph / loneliness / anger / regret / anxious / confident / boasting / neutral
  • p (POV): 1st / 2nd / 3rd / mixed
  • to (addressed to): lover / self / rival / fan / team / mother / none
  • t (themes, 1-2): love, money, fame, family, city, vices, etc.

At query time a fast Llama 3.1 8B Instant call classifies the user’s message into the same schema (separate Groq quota from the 70B used elsewhere). The worker JOINs chunk_meta and filters by polarity compatibility:

const userMeta = await classifyUserQuery(env.GROQ_API_KEY, question);
// userMeta = { r: 's', s: 'H', t: ['love', 'ghosting'] }

const okRoles = COMPAT_ROLE[userMeta.r];        // user subject → ['s', 'o']
const okSentiments = SENTIMENT_GROUPS[userMeta.s]; // H ~ [H, L, R, X]

const drakeOnly = fused.filter(c => metaMap.get(c.id)?.drake_section === 1);
const eligible = drakeOnly.filter(c => {
  const m = metaMap.get(c.id);
  return okRoles.includes(m.r) && okSentiments.includes(m.s);
});

Cascading fallback: if polarity filter starves the pool, fall back to drake-only. If THAT starves, fall back to unfiltered. Never run dry.

The polarity fix, before and after

Test query: “she stopped texting me back”. User is the ignored party (role = subject, sentiment = heartbreak).

BEFORE (pre-filter, Llama self-judged): Issa (21 Savage feat. Drake)
“Girl, stop textin’ me so much, issa annoying”

Score 1. Polarity inverted, the lyric is the speaker telling someone ELSE to stop texting them. Surface keyword match overruled the emotional reality.

AFTER (drake_section + polarity filter): HYFR (2012)
“And now she’s texting me asking for closure, damn / She say this shit gon’ catch up to me, I keep tissue paper”

Score 4. Drake reflecting on an ex’s behavior, same emotional register as user.

The Issa lyric is now structurally impossible to surface. Its drake_section is 0 (it’s a 21 Savage track, not Drake’s voice), so the filter drops it before the ranker ever sees it.

diversity dedup + LLM rank + LLM trim

Final pipeline tail. Diversity dedup caps at 2 chunks per song in the ranker pool. Without this, a single song dominates and the ranker picks from a homogeneous set. LLM ranker (Llama 70B) picks the bullseye with a sharper specificity-first prompt. Then a final LLM trim extracts the punchiest 2-4 contiguous lines, with a substring check to guard against hallucination.

The frontend

This is where I got to have some fun. Whats more fitting than a Toronto skyline, ChatGPT-style chat UI with a sidebar for history (localStorage) and 👍 👎 buttons on every response (POSTs to /api/feedback, stored in D1 alongside traces).

DrizzyGPT-6 returning a Drake lyric for a real question
Ask it something real, get the exact bar back with the song and a source link.

Try it live: drizzygpt6.daytona-lab.com

Measuring it: the lift, honestly

50-question golden set initially, expanded to 150, scored by an LLM-as-judge. The first iteration used Groq Llama 3.3 70B as the judge, the same model as the ranker, which is a methodological mistake (self-preference bias). Llama gave 41 of 50 perfect fives. Headline number: 4.78. Suspicious.

Swapped to Claude Opus 4.7 as judge. Different model family, no self-correlation. On the same 50 questions: 16 fives (not 41). Mean dropped to 3.37. That’s the honest baseline. The score going DOWN after fixing the methodology is the better engineering story than going up.

baseline eval distribution
Original Llama-self-judged distribution. 41 of 50 perfect fives. The shape is the bias.

Expanded to 150 questions, ran through the v6 pipeline (drake_section + polarity filter), scored as Claude Opus 4.7 again. Final numbers:

Metric              v5 baseline    v6 (current)    Δ
─────────────────────────────────────────────────────────
mean score          3.37 / 5       3.67 / 5        +0.30  (+9%)
precision @1 (≥4)   52%            55%             +3pp
bullseyes (5s)      40             51              +11
disasters (1s)      16             8               −8  (halved)
distribution        16/31/25/38/40  8/18/41/32/51

per-category:
  confidence/swagger      4.33 → 4.67   (Drake's home register)
  loneliness              3.95 → 4.10
  fame-cost               3.73 → 3.85
  romance/heartbreak      2.76 → 3.40   (+0.64, polarity fix biggest win)
  impostor/anxiety        2.80 → 3.25   (+0.45)
  adversarial: off-topic  1.75 → 1.50   (still bad by design, see below)

The biggest visible win: disasters (score-1 responses) cut in half. That’s the structural filter doing its job. Wrong-artist chunks and wrong-polarity chunks can no longer reach the ranker, so the worst failures (returning a 21 Savage lyric, or a polarity-inverted Drake bar) are now structurally impossible.

Where it still misses

The remaining 8 disasters cluster in:

  • Off-topic adversarial (4 of 8). The product is hard-coded to always return a lyric. “Best italian restaurant in toronto” returns an OVO Sound Radio tracklist. Fix: add an intent classifier that refuses non-emotional inputs.
  • Homonym confusion (1). “I miss my dog” returns “I show love to my dogs.” Drake’s “dogs” = crew, not pet. Needs query intent disambiguation.
  • Edge polarity cases (3). “She ran the streets behind my back” → One Dance “Streets not safe / But I never run away”. Surface keyword match, role-inverted. The polarity filter has imperfect coverage on un-annotated chunks (they pass through), so a few slipped.

Observability

Every request writes a trace row to D1: question, HyDE text, user-classification result, top candidates, filter status, chosen chunk, per-step latencies, final trimmed lyric. The frontend’s 👍 👎 buttons POST { traceId, rating } to /api/feedback, joined back to the trace. Downvoted queries become next-round regression tests.

What it costs

# monthly bill at 1,000 user questions/day

  workers paid (base)          $5.00
  vectorize storage            $1.62   # 34M dims above 5M free
  vectorize queries            $0.00   # under 30M/mo free
  workers AI embeds            $0.00   # 10k/day free, ~1k/day used
  groq llama 70b (ranker)      $0.00   # 14k/day free
  groq llama 8b (classifier)   $0.00   # separate 14k/day free
  d1 storage + queries         $0.00   # under 5GB / 5M reads free
  pages bandwidth              $0.00   # unlimited free
                               -----
  total                        $6.62 / mo

One-time annotation cost: Gemini 2.5 Flash on free tier handled most of 83k chunks (1,500 RPD daily quota), Claude Sonnet 4.6 via Max subscription handled the rest. Total marginal cash: $0.

At 10,000 users a day, around $12-15/mo. At 100,000 (genuine virality), around $40-80/mo. Free-tier composition holds well into real traffic.


Try it / read the code

Built with Cloudflare Workers, Vectorize, Workers AI, D1, Groq, Gemini Flash, Claude Sonnet, and a Python venv. Drake unaffiliated. For entertainment only.