The Cron Job That Ran Three Times: Building Distributed Locks on Cloud Run

Picture this: it's the week before HackPSU and you're checking Discord when you see a message from a teammate -"Hey, a bunch of hackers are saying they got three RSVP cancellation emails?"

You check the logs. Sure enough, the cron that cancels stale RSVPs fired three times. At the exact same second. From three different Cloud Run instances. Every registered hacker got hit with duplicate cancellation emails and reminder blasts. Our sender reputation tanked overnight.

Here's something they don't teach you in the NestJS docs: when you slap a @Cron() decorator on a method and deploy it to a platform that auto-scales, every single instance independently fires that cron job. Three instances? Three cancellation emails. Five instances during a traffic spike? Five reminder blasts. You get the idea.

This is a classic distributed systems problem -distributed mutual exclusion, or more casually, "please only let one of you do this at a time." And honestly, it's one of those problems that seems simple until you start thinking about edge cases. What if the instance that grabs the lock crashes halfway through? What if two instances try to claim the lock at the exact same millisecond?

Here's how we solved it with three small components and Firebase Realtime Database.

The Plan

We needed a system where:

Any number of NestJS instances can be running simultaneously
When a cron fires, only one instance actually executes the job
If that instance crashes mid-job, the lock doesn't get stuck forever
Adding a new distributed cron should be as easy as adding a regular @Cron() -just a decorator on a method

We settled on the classic NestJS "decorator + explorer" pattern: a decorator to mark methods, an explorer to discover them at startup and wire up the actual scheduling with distributed locking. And for the lock itself? Firebase RTDB transactions.

Why Firebase? We were already using it for auth, and RTDB gives you atomic transactions out of the box -no need to spin up Redis or manage a ZooKeeper cluster. For a few lock operations per cron tick, the free tier is more than enough.

Component 1: The Decorator

Let's start with the developer-facing API -the thing you actually use in your service classes:

import { CronExpression, CronOptions } from '@nestjs/schedule'

export const DISTRIBUTED_CRON_LOCK = 'DISTRIBUTED_CRON_LOCK'
export const DISTRIBUTED_CRON_TIME = 'DISTRIBUTED_CRON_TIME'
export const DISTRIBUTED_CRON_OPTIONS = 'DISTRIBUTED_CRON_OPTIONS'

export type DistributedCronOptions = CronOptions & {
  /** TTL for the distributed lock in ms (default: 5 minutes) */
  lockTtlMs?: number
}

export function DistributedCron(
  lockName: string,
  cronTime: string | CronExpression,
  options?: DistributedCronOptions,
): MethodDecorator {
  return (target, propertyKey) => {
    Reflect.defineMetadata(DISTRIBUTED_CRON_LOCK, lockName, target, propertyKey)
    Reflect.defineMetadata(DISTRIBUTED_CRON_TIME, cronTime, target, propertyKey)
    Reflect.defineMetadata(
      DISTRIBUTED_CRON_OPTIONS,
      options || {},
      target,
      propertyKey,
    )
  }
}

Here's the thing about this decorator -it does absolutely nothing at runtime in terms of scheduling. All it does is attach three pieces of metadata to the method:

DISTRIBUTED_CRON_LOCK -A unique string identifier like "send-daily-digest". This becomes the key in Firebase that instances compete to lock.
DISTRIBUTED_CRON_TIME -The cron expression ("0 9 * * *" for 9 AM daily, or a CronExpression enum).
DISTRIBUTED_CRON_OPTIONS -Optional config, including a custom lockTtlMs for long-running jobs.

Why not just schedule directly in the decorator? Because separation of concerns makes testing way easier. The decorator is pure metadata -no side effects, no dependencies. The complex wiring lives in the explorer (we'll get there).

Using it in a service looks exactly like you'd hope:

@DistributedCron("send-daily-digest", CronExpression.EVERY_DAY_AT_9AM, {
  lockTtlMs: 10 * 60 * 1000, // 10 minute lock for a long-running job
})
async sendDailyDigest() {
  // Only executes on ONE instance, even if 5 are running
  await this.emailService.sendDigestToAllUsers();
}

Clean, readable, and a drop-in replacement for @Cron(). Your team doesn't need to understand distributed locks to use it.

Component 2: The Lock Service

This is the core distributed systems piece -and honestly, the part where I spent most of my time staring at edge cases.

import { Injectable, Logger } from '@nestjs/common'
import * as admin from 'firebase-admin'

@Injectable()
export class DistributedLockService {
  private readonly logger = new Logger(DistributedLockService.name)

  private get locksRef() {
    return admin.database().ref('cron-locks')
  }

  async acquireLock(lockName: string, ttlMs = 5 * 60 * 1000): Promise<boolean> {
    const lockRef = this.locksRef.child(lockName)
    const now = Date.now()

    try {
      const result = await lockRef.transaction((current) => {
        if (current && current.expiresAt > now) {
          return undefined // Lock is held -abort
        }
        return {
          acquiredAt: now,
          expiresAt: now + ttlMs,
          instanceId: process.env.K_REVISION || 'local',
        }
      })

      return result.committed
    } catch (error) {
      this.logger.error(
        `RTDB unreachable - lock "${lockName}" could not be acquired. Job will NOT run.`,
        error,
      )
      return false
    }
  }

  async releaseLock(lockName: string): Promise<void> {
    try {
      await this.locksRef.child(lockName).remove()
    } catch (error) {
      this.logger.error(`Failed to release lock "${lockName}":`, error)
    }
  }

  async withLock(
    lockName: string,
    fn: () => Promise<void>,
    ttlMs?: number,
  ): Promise<void> {
    const acquired = await this.acquireLock(lockName, ttlMs)
    if (!acquired) {
      this.logger.log(`Lock "${lockName}" held by another instance, skipping`)
      return
    }

    try {
      await fn()
    } finally {
      await this.releaseLock(lockName)
    }
  }
}

Let's unpack the interesting bits.

The Transaction -Heart of the System

The acquireLock() method uses an RTDB transaction, which is essentially an atomic compare-and-swap. The callback receives the current value at cron-locks/{lockName} and you return what you want to write:

Lock exists and hasn't expired (current.expiresAt > now): Return undefined to abort. result.committed will be false.
Lock doesn't exist or has expired: Return a new lock object. RTDB writes it atomically -if another instance sneaked in between your read and write, RTDB re-runs your callback with the updated value, and this time you'll see the other instance's lock and abort.

This gives you exactly-once semantics. No matter how many instances fire at the same millisecond, only one transaction commits.

The lock data structure in RTDB looks like this:

{
  "acquiredAt": 1711929600000,
  "expiresAt": 1711929900000,
  "instanceId": "api-v3-00042-abc"
}

The instanceId (from Cloud Run's K_REVISION env var) is purely for observability -peek at the Firebase console and you can see exactly which instance holds each lock. Super handy for debugging at 2 AM.

The TTL -Your Safety Net

Here's where things get interesting. What happens if an instance acquires the lock, starts processing a job, and then... dies? Cloud Run terminated it. OOM kill. Network blip. Whatever.

Without a TTL, that lock is held forever. No instance can ever acquire it again. Your cron job never runs again. You've created a deadlock -which is arguably worse than the original triple-execution problem.

The expiresAt field is the safety net. Even in the worst case, the lock automatically becomes available again after the TTL period (default: 5 minutes).

But here's the tradeoff you need to think about: the TTL needs to be longer than your longest expected job execution time. If your email digest takes 8 minutes but the TTL is 5 minutes, another instance could grab the lock while the first is still sending emails. And now you're back to square one -concurrent execution.

That's exactly why the decorator accepts a custom lockTtlMs. Know your job's runtime, pad it generously, and set it accordingly.

Fail-Closed by Design

catch (error) {
  this.logger.error(
    `RTDB unreachable - lock "${lockName}" could not be acquired. Job will NOT run.`,
    error,
  );
  return false;
}

If Firebase RTDB is down, acquireLock returns false. No instance runs the job. This is a deliberate fail-closed design.

You know what's worse than skipping one RSVP reminder? Blasting every hacker with five duplicate emails and tanking your sender reputation because you decided "eh, if the lock service is down, just let everyone run." For most cron jobs, the safer default is "if we can't coordinate, nobody acts."

The `withLock` Wrapper

async withLock(lockName: string, fn: () => Promise<void>, ttlMs?: number): Promise<void> {
  const acquired = await this.acquireLock(lockName, ttlMs);
  if (!acquired) {
    this.logger.log(`Lock "${lockName}" held by another instance, skipping`);
    return;
  }
  try {
    await fn();
  } finally {
    await this.releaseLock(lockName);
  }
}

This is a classic RAII pattern adapted for async JS. The finally block is critical -without it, a throwing job would leave the lock held until TTL expiry, blocking all instances from picking up the next tick.

After a successful job, releaseLock() deletes the RTDB node immediately rather than waiting for the TTL. This means the next cron tick can acquire the lock right away without waiting for the window to close.

Component 3: The Explorer

This is the orchestrator that ties it all together. It runs once at application startup and discovers every method decorated with @DistributedCron:

import { Injectable, Logger, OnModuleInit } from '@nestjs/common'
import { DiscoveryService, MetadataScanner } from '@nestjs/core'
import { SchedulerRegistry } from '@nestjs/schedule'
import { CronJob } from 'cron'
import { DistributedLockService } from './distributed-lock.service'
import {
  DISTRIBUTED_CRON_LOCK,
  DISTRIBUTED_CRON_TIME,
  DISTRIBUTED_CRON_OPTIONS,
  DistributedCronOptions,
} from './distributed-cron.decorator'

@Injectable()
export class DistributedCronExplorer implements OnModuleInit {
  private readonly logger = new Logger(DistributedCronExplorer.name)

  constructor(
    private readonly discoveryService: DiscoveryService,
    private readonly metadataScanner: MetadataScanner,
    private readonly schedulerRegistry: SchedulerRegistry,
    private readonly lockService: DistributedLockService,
  ) {}

  onModuleInit() {
    if (process.env.RUNTIME_INSTANCE !== 'production') {
      this.logger.log(
        'Distributed crons disabled in non-production environment',
      )
      return
    }

    const providers = this.discoveryService.getProviders()

    providers
      .filter((wrapper) => wrapper.instance && !wrapper.isAlias)
      .forEach((wrapper) => {
        const { instance } = wrapper
        const prototype = Object.getPrototypeOf(instance)

        this.metadataScanner
          .getAllMethodNames(prototype)
          .forEach((methodName) => {
            const lockName = Reflect.getMetadata(
              DISTRIBUTED_CRON_LOCK,
              prototype,
              methodName,
            )
            const cronTime = Reflect.getMetadata(
              DISTRIBUTED_CRON_TIME,
              prototype,
              methodName,
            )
            const options: DistributedCronOptions =
              Reflect.getMetadata(
                DISTRIBUTED_CRON_OPTIONS,
                prototype,
                methodName,
              ) || {}

            if (!lockName || !cronTime) return
            if (options.disabled) return

            const { lockTtlMs, ...cronOptions } = options
            const boundMethod = instance[methodName].bind(instance)

            const job = CronJob.from({
              cronTime,
              onTick: () => {
                this.lockService
                  .withLock(lockName, () => boundMethod(), lockTtlMs)
                  .catch((error) => {
                    this.logger.error(`Cron job "${lockName}" failed:`, error)
                  })
              },
              start: false,
              ...cronOptions,
            })

            this.schedulerRegistry.addCronJob(lockName, job)
            job.start()

            this.logger.log(
              `Registered distributed cron "${lockName}" [${cronTime}] → ${prototype.constructor.name}.${methodName}`,
            )
          })
      })
  }
}

There's a lot happening here, but the flow is actually pretty straightforward once you break it down.

The Production Guard

if (process.env.RUNTIME_INSTANCE !== 'production') {
  this.logger.log('Distributed crons disabled in non-production environment')
  return
}

First thing the explorer does: bail out if we're not in production. You really don't want your local dev server competing for locks against production instances. And you definitely don't want your test suite accidentally triggering real cron jobs.

Discovery and Wiring

The explorer uses NestJS's DiscoveryService to iterate through every provider in the DI container, then MetadataScanner to check every method for our decorator's metadata. For each decorated method, it:

Extracts the lock name, cron time, and options from metadata
Binds the method to its instance (without .bind(), this inside the method would be undefined -a fun bug to track down)
Creates a CronJob where the onTick wraps the method in withLock()
Registers it with NestJS's SchedulerRegistry for lifecycle management

The SchedulerRegistry registration is more important than it looks. It means NestJS automatically stops all cron jobs during shutdown, and you can inject the registry elsewhere to inspect, stop, or restart specific jobs at runtime -useful for admin endpoints or debugging.

The `.catch()` at the End

.catch((error) => {
  this.logger.error(`Cron job "${lockName}" failed:`, error);
});

This catches any unhandled errors that escape past the withLock's finally block. Without it, you'd get an unhandled promise rejection -which in newer versions of Node can crash your process entirely. Not ideal for a production cron system.

The Full Flow at Runtime

Let's trace through what actually happens when a cron fires:

App boots → NestJS initializes modules → DistributedCronExplorer.onModuleInit() scans and registers all jobs
Cron tick fires (say, 9:00 AM) → onTick runs on every instance simultaneously
Each instance calls withLock() → acquireLock() → RTDB transaction against cron-locks/send-daily-digest
RTDB serializes the transactions → Only one commits → That instance gets true, everyone else gets false
The winner executes the actual job → On completion, finally calls releaseLock() → RTDB node deleted
The losers log "held by another instance, skipping" and go about their day
Next tick → Rinse and repeat. A different instance might win -there's no sticky leader

What I love about this design is that there's no leader election, no heartbeats, no coordination protocol. Every tick is an independent race. It's stateless from the application's perspective -all the state lives in RTDB.

Why Not Just Use Cloud Scheduler?

Fair question. Google Cloud Scheduler sends HTTP requests to your endpoints on a schedule -it's built for this.

But here's the thing: with Cloud Scheduler, every cron job needs a dedicated HTTP endpoint. That means route handlers, IAM permissions, possibly authentication middleware. And the cron configuration lives outside your codebase -in Terraform, the GCP console, or some deployment script. Adding a new cron means touching infrastructure.

With our approach, adding a cron is literally one decorator:

@DistributedCron("cleanup-expired-sessions", "0 */6 * * *")
async cleanupSessions() {
  await this.sessionService.deleteExpired();
}

No infrastructure changes. No Terraform. No console clicks. Just code.

The tradeoff is real though: this approach needs at least one instance running. Cloud Run can scale to zero, and when it does, no crons fire. Cloud Scheduler solves this by waking up an instance with an HTTP request. For an API that always has traffic and at least one warm instance, this isn't an issue. For a service that goes hours without requests? Cloud Scheduler might be the better call.

Lessons Learned

TTLs are a conversation, not a constant. We started with a blanket 5-minute TTL for everything, which worked great until someone added a job that processes 50k records and takes 12 minutes. Now we always ask: "How long can this job realistically take?" and set the TTL to 2x that.

Fail-closed was the right call. We had one incident where RTDB had a brief outage. No cron jobs ran for about 20 minutes. That was annoying but fine -the next tick picked up where things left off. The alternative (fail-open, everyone runs) would have been much, much worse.

The logs from the losers are surprisingly useful. When you see "held by another instance, skipping" appearing on 4 out of 5 instances, you know the system is working correctly. When you see it on all instances, something's wrong with the lock. The instanceId field has saved us multiple times during debugging.

Keep the lock scope tight. Early on, we had a single "daily-jobs" lock for all daily crons. Bad idea -if one job was slow, it blocked all the others. One lock per logical job, always.

At the end of the day, this is about 150 lines of code solving a problem that could've easily spiraled into "let's set up a Redis cluster" or "let's migrate to Cloud Scheduler and rewrite our cron architecture." Sometimes the best solution is the one that fits naturally into what you already have.

The Cron Job That Ran Three Times: Building Distributed Locks on Cloud Run

The Plan

Component 1: The Decorator

Component 2: The Lock Service

The Transaction -Heart of the System

The TTL -Your Safety Net

Fail-Closed by Design

The `withLock` Wrapper

Component 3: The Explorer

The Production Guard

Discovery and Wiring

The `.catch()` at the End

The Full Flow at Runtime

Why Not Just Use Cloud Scheduler?

Lessons Learned

Related Articles

How We Turned a $5 Compute Engine Into a Swiss Army Knife

Building a Real‑Time Inventory Management System for HackPSU

Weekend Project: Building a High-Performance Maps Renderer in Rust

The Cron Job That Ran Three Times: Building Distributed Locks on Cloud Run

The Plan

Component 1: The Decorator

Component 2: The Lock Service

The Transaction -Heart of the System

The TTL -Your Safety Net

Fail-Closed by Design

The withLock Wrapper

Component 3: The Explorer

The Production Guard

Discovery and Wiring

The .catch() at the End

The Full Flow at Runtime

Why Not Just Use Cloud Scheduler?

Lessons Learned

Related Articles

How We Turned a $5 Compute Engine Into a Swiss Army Knife

Building a Real‑Time Inventory Management System for HackPSU

Weekend Project: Building a High-Performance Maps Renderer in Rust

The `withLock` Wrapper

The `.catch()` at the End