How To Build a Pastebin Clone for Free

Today, we will be building a Pastebin clone — a web service that allows users to upload and share text through links known as ‘pastes’. What follows is my journey of how I create a Pastebin clone using serverless functions through Cloudflare Worker. If you are not familiar with Pastebin, I’d highly recommend you to give it a try before reading on.

“Why Pastebin?” you might ask. Well, sending >50 lines long block of text (or code) through a chat app (looking at you, IRC) isn’t exactly the best way to communicate.

TL;DR

  • Building a Pastebin clone using Cloudflare Worker and KV
  • Project requirements and limitations planning
  • Paste URL UUID generation logic with key generation service (KGS)
  • GraphQL API design and implementation
  • Live demo at paste.jerrynsh.com
  • GitHub repository

The design of this Pastebin clone would be very similar to building a TinyURL clone, except we need to store the paste content instead of the original unshortened URL.

Before we begin, this is NOT a tutorial or guide on:

  • How to tackle an actual system design interview
  • Building a commercial grade paste tool like Pastebin or GitHub Gist

Rather, this is a proof of concept (POC) of how to build a simple paste tool using serverless computing with Cloudflare Worker. To follow through this article, check out Steps 1 to 3 of this Get Started Guide.

Let’s go!


Requirements

Let’s start by clarifying the use cases and constraints of our project.

Functional

  • Whenever a user enters a block of text (or code), our web service should generate a URL with a random key (UUID), e.g. paste.jerrynsh.com/aj7kLmN9
  • Whenever a user visits the generated URL, the user should be redirected to view the original paste content, i.e. the original block of text
  • The link to the paste should expire after 24 hours
  • The UUID should only contain alphanumeric characters (Base62)
  • The length of our UUID should be 8 characters

Non-Functional

  • Low latency
  • Highly available

Budget, Capacity, & Limitations Planning

Like our previous attempt, the goal here is to host this service for free. With Cloudflare Worker’s pricing and platform limits in mind, our constraints are:

  • 100k requests/day at 1k requests/min
  • CPU runtime not exceeding 10ms

Similar to a URL shortener, our application is expected to undergo a high read-to-write ratio. That being said, we will be using Cloudflare KV (KV in the following), a low-latency key-value store for this project.

At the time of writing, the free tier of KV comes with the following limits:

  • 100k reads/day
  • 1k writes/day
  • 1 GB of stored data (key size of 512 bytes; value size of 25 MiB)

How many pastes can we store

In this section, we are going to do estimation on how many pastes can our Pastebin clone possibly store, given the limitations above. Unlike storing a URL, storing text blocks can consume much more space (relatively speaking). Here are the assumptions that we are going to make:

  • 1 character is 1 byte (using this byte counter)
  • Assuming on average, a single paste (file) can consist of about 200 lines of code (text), that would mean that the size of each paste would be about 10 KB
  • With 1 GB of maximum storage size, that means that our Pastebin clone can only store up to 100,000 pastes

Do take note that the limits are applied on a per-account basis.


Storage & Database

Cloudflare Worker KV

For this POC, we are going to use KV as our database of choice. Let’s dive a little bit deeper into what it does.

At present, CAP Theorem is often used to model distributed data stores. CAP Theorem states that a distributed system can only provide 2 of the following 3 guarantees (source):

  1. Consistency - is my data the same everywhere?
  2. Availability - is my data always accessible?
  3. Partition tolerance - is my data resilient to regional outages?

In KV’s case, Cloudflare chooses to guarantee Availability and Partition tolerance — which fits our non-functional requirement. Even though this combination screams eventual consistency, that is a tradeoff that we are fine with.

Not forgetting to mention KV supports exceptionally high read volumes with ultra-low latency — perfect for our high read-to-write ratio application.

Now that we understood the tradeoffs, let’s move on!


Implementation

URL Generation Logics

The paste URL UUID generation logic is going to be very similar to a URL shortener. Here’s a quick summary of the possible approaches:

  1. Use a UUID generator to generate a UUID on demand for every new request
  2. Use the hash (MD5) of the paste content as our UUID, then use the first N characters of the hash as part of our URL
  3. Using a combination of hashing + Base62 encoding
  4. Use an auto-incremented integer as our UUID

However, we are going with another solution that is not mentioned above.

Pre-generate UUID Key

For this POC, we will pre-generate a list of UUID in a KV using a separate worker. We shall refer to the worker as a key generator service (KGS). Whenever we want to create a new paste, we will assign a pre-generated UUID to the new paste.

So, what are the advantages of doing things in such a way?

With this approach, we will not have to worry about key duplication or hash collisions (e.g. from approach 2 or 3) as our key generator will ensure that the keys inserted in our KV are unique.

Here, we will be using 2 KVs:

  • KEY_KV — used by our KGS to store a pre-generated list of UUID
  • PASTE_KV — used by our main app server to store a key-value pair; where the key is the UUID and the value is the content of a paste.

To create a KV, simply run the following commands with Wrangler CLI (source).

# Production namespace:
wrangler kv:namespace create "PASTE_DB"
wrangler kv:namespace create "KEY_DB"

# This namespace is used for `wrangler dev` local testing:
wrangler kv:namespace create "PASTE_DB" --preview
wrangler kv:namespace create "KEY_DB" --preview


For creating these KV namespaces, we will need to update our wrangler.toml files to include the namespace bindings accordingly. To view your KV’s dashboard, visit https://dash.cloudflare.com/<your_cloudflare_account_id>/workers/kv/namespaces.

How to generate UUID

For KGS to generate new UUIDs, we will be using the nanoid package. In case you’re lost, you can always refer to the /kgs folder on the GitHub repository.

How does KGS know if there’s a duplicated key? Whenever KGS generates a key, it should always check if the UUID already exists in KEY_DB and PASTE_DB.

In addition, the UUID should be removed from KEY_DB and be created at PASTE_DB upon generating a new paste. We will cover the code in the API section.

// /kgs/src/utils/keyGenerator.js
import { customAlphabet } from "nanoid";
import { ALPHABET } from "./constants";

/*
Generate a `uuid` using `nanoid` package.

Keep retrying until a `uuid` that does not exist in both KV (`PASTE_DB` and `KEY_DB`) is generated.

KGS guarantees that the pre-generated keys are always unique.
*/
export const generateUUIDKey = async () => {
    const nanoId = customAlphabet(ALPHABET, 8);

    let uuid = nanoId();

    while (
        (await KEY_DB.get(uuid)) !== null &&
        (await PASTE_DB.get(uuid)) !== null
    ) {
        uuid = nanoId();
    }

    return uuid;
};

Running out of unique keys to generate

Another potential issue that we might run into is — what should we do when all our UUIDs in our KEY_KV are completely used up?

For this, we will set up a Cron trigger that replenishes our list of UUID periodically daily. To respond to a Cron trigger, we must add a "scheduled" event listener to the Workers script as shown later in the code below.

// /kgs/src/index.js
import { MAX_KEYS } from "./utils/constants";
import { generateUUIDKey } from "./utils/keyGenerator";

/*
Pre-generate a list of unique `uuid`s.

Ensures that pre-generated `uuid` KV list always has `MAX_KEYS` number of keys.
*/
const handleRequest = async () => {
    const existingUUIDs = await KEY_DB.list();

    let keysToGenerate = MAX_KEYS - existingUUIDs.keys.length;

    console.log(`Existing # of keys: ${existingUUIDs.keys.length}.`);
    console.log(`Estimated # of keys to generate: ${keysToGenerate}.`);

    while (keysToGenerate != 0) {
        const newKey = await generateUUIDKey();

        await KEY_DB.put(newKey, "");
        console.log(`Generated new key in KEY_DB: ${newKey}.`);

        keysToGenerate--;
    }

    const currentUUIDs = await KEY_DB.list();
    console.log(`Current # of keys: ${currentUUIDs.keys.length}.`);
};

addEventListener("scheduled", (event) => {
    event.waitUntil(handleRequest(event));
});

As our POC can only support up to 1k writes/day, we will set the MAX_KEYS to generate to 1000. Feel free to tweak around according to your account limits.


API

On the high level, we probably need 2 APIs:

  • Creating a URL for paste content
  • Redirecting to the original paste content

For this POC, we will be developing our API in GraphQL using the Apollo GraphQL server. Specifically, we will be using the itty-routerworker template alongside workers-graphql-server.

Before we move along, you can directly interact with the GraphQL API of this POC via the GraphQL playground endpoint in case you are not familiar with GraphQL.

When lost, you can always refer to the /server folder.

Routing

To start, the entry point of our API server lies in src/index.js where all the routing logic is handled by itty-router.

// server/src/index.js
const { missing, ThrowableRouter, withParams } = require("itty-router-extras");
const apollo = require("./handlers/apollo");
const index = require("./handlers/index");
const paste = require("./handlers/paste");
const playground = require("./handlers/playground");

const router = ThrowableRouter();

router.get("/", index);

router.all("/graphql", playground);

router.all("/__graphql", apollo);

router.get("/:uuid", withParams, paste);

router.all("*", () => missing("Not found"));

addEventListener("fetch", (event) => {
    event.respondWith(router.handle(event.request));
});

Creating paste

Typically to create any resource in GraphQL, we need a mutation. In the REST API world, a GraphQL mutation to create would be very much similar to sending a request to a POST endpoint, e.g. /v1/api/paste. Here’s what our GraphQL mutation would look like:

mutation {
    createPaste(content: "Hello world!") {
        uuid
        content
        createdOn
        expireAt
    }
}

Under the hood, the handler (resolver) should call createPaste that takes in content from the HTTP JSON body. This endpoint is expected to return the following:

{
    "data": {
        "createPaste": {
            "uuid": "0pZUDXzd",
            "content": "Hello world!",
            "createdOn": "2022-01-29T04:07:06+00:00",
            "expireAt": "2022-01-30T04:07:06+00:00"
        }
    }
}

You can check out the GraphQL schema here.

Here’s the implementation in code of our resolvers:

// /server/src/resolvers.js
const { ApolloError } = require("apollo-server-cloudflare");

module.exports = {
    Query: {
        getPaste: async (_source, { uuid }, { dataSources }) => {
            return dataSources.pasteAPI.getPaste(uuid);
        },
    },
    Mutation: {
        createPaste: async (_source, { content }, { dataSources }) => {
            if (!content || /^\s*$/.test(content)) {
                throw new ApolloError("Paste content is empty");
            }

            return dataSources.pasteAPI.createPaste(content);
        },
    },
};

To mitigate spam, we also added a small check to prevent the creation of empty pastes.

Paste creation data source

We are keeping the API logic that interacts with our database (KV) within /datasources.

As mentioned previously, we need to remove the key used from our KGS KEY_DB KV to avoid the risk of assigning duplicated keys for new pastes.

Here, we can also set our key to have the expirationTtl of one day upon paste creation:

// /server/src/datasources/paste.js
const { ApolloError } = require('apollo-server-cloudflare')
const moment = require('moment')

/*
Create a new paste in `PASTE_DB`.

Fetch a new `uuid` key from `KEY_DB`.

UUID is then removed from `KEY_DB` to avoid duplicates.
*/
async createPaste(content) {
    try {
        const { keys } = await KEY_DB.list({ limit: 1 })
        if (!keys.length) {
            throw new ApolloError('Ran out of keys')
        }
        const { name: uuid } = keys[0]

        const createdOn = moment().format()
        const expireAt = moment().add(ONE_DAY_FROM_NOW, 'seconds').format()

        await KEY_DB.delete(uuid) // Remove key from KGS
        await PASTE_DB.put(uuid, content, {
            metadata: { createdOn, expireAt },
            expirationTtl: ONE_DAY_FROM_NOW,
        })

        return {
            uuid,
            content,
            createdOn,
            expireAt,
        }
    } catch (error) {
        throw new ApolloError(`Failed to create paste. ${error.message}`)
    }
}

Similarly, I have also created a getPaste GraphQL query to retrieve the paste content via UUID. We won’t be covering it in this article but feel free to check it out in the source code. To try it out on the playground:

query {
    getPaste(uuid: "0pZUDXzd") {
        uuid
        content
        createdOn
        expireAt
    }
}

In this POC, we won’t be supporting any deletion of the pastes since pastes would expire after 24 hours.

Getting paste

Whenever a user visits a paste URL (GET /:uuid) the original content of the paste should be returned. If an invalid URL is entered, users should get a missing error code. View the full HTML here.

// /server/src/handlers/paste.js
const { missing } = require("itty-router-extras");
const moment = require("moment");

const handler = async ({ uuid }) => {
    const { value: content, metadata } = await PASTE_DB.getWithMetadata(uuid);
    if (!content) {
        return missing("Invalid paste link");
    }

    const expiringIn = moment(metadata.expireAt).from(metadata.createdOn);

    return new Response(html(content, expiringIn), {
        headers: { "Content-Type": "text/html" },
    });
};

Finally, to start the development API server locally, simply run wrangler dev


Deployment

Before publishing your code, you will need to edit the wrangler.toml files (within server/ & kgs/) and add your Cloudflare account_id inside. You can read more information about configuring and publishing your code can be found in the official documentation.

Do make sure that the KV namespace bindings are added to your wrangler.toml files as well.

To publish any new changes to your Cloudflare Worker, simply run wrangler publish in the respective service.

To deploy your application to a custom domain, check out this short clip.

CI/CD

In the GitHub repository, I have also set up a CI/CD workflow using GitHub Actions. To use Wrangler actions, add CF_API_TOKEN into your GitHub repository secrets.

You can create your API tokens by using the Edit Cloudflare Workers template.


Closing Remark

I did not expect this POC to take me this long to write and complete, I probably slacked more than I should.

Like my previous post, I would love to end this with some potential improvements that can be made (or sucked into the backlog blackhole for eternity) in the future:

  • Allowing users to set custom expiry
  • Pastes edit and deletion
  • Syntax highlighting
  • Analytics
  • Private pastes with password protection

Like URL shorteners, Paste tools have a certain stigma about them — both tools make URLs opaque which spammers love to abuse. Well, at least the next time you ask “why doesn’t this code work?”, you’ll have your own paste tool to use, at least until you add in syntax highlighting.

Hosted on Digital Ocean.