The Cron Job That Ran Three Times: Building Distributed Locks on Cloud Run
Picture this: it's the week before HackPSU and you're checking Discord when you see a message from a teammate -"Hey, a bunch of hackers are saying they got three RSVP cancellation emails?"
You check the logs. Sure enough, the cron that cancels stale RSVPs fired three times. At the exact same second. From three different Cloud Run instances. Every registered hacker got hit with duplicate cancellation emails and reminder blasts. Our sender reputation tanked overnight.
Here's something they don't teach you in the NestJS docs: when you slap a @Cron() decorator on a method and deploy it to a platform that auto-scales, every single instance independently fires that cron job. Three instances? Three cancellation emails. Five instances during a traffic spike? Five reminder blasts. You get the idea.
This is a classic distributed systems problem -distributed mutual exclusion, or more casually, "please only let one of you do this at a time." And honestly, it's one of those problems that seems simple until you start thinking about edge cases. What if the instance that grabs the lock crashes halfway through? What if two instances try to claim the lock at the exact same millisecond?
Here's how we solved it with three small components and Firebase Realtime Database.
The Plan
We needed a system where:
- Any number of NestJS instances can be running simultaneously
- When a cron fires, only one instance actually executes the job
- If that instance crashes mid-job, the lock doesn't get stuck forever
- Adding a new distributed cron should be as easy as adding a regular
@Cron()-just a decorator on a method
We settled on the classic NestJS "decorator + explorer" pattern: a decorator to mark methods, an explorer to discover them at startup and wire up the actual scheduling with distributed locking. And for the lock itself? Firebase RTDB transactions.
Why Firebase? We were already using it for auth, and RTDB gives you atomic transactions out of the box -no need to spin up Redis or manage a ZooKeeper cluster. For a few lock operations per cron tick, the free tier is more than enough.
Component 1: The Decorator
Let's start with the developer-facing API -the thing you actually use in your service classes:
import { CronExpression, CronOptions } from '@nestjs/schedule'
export const DISTRIBUTED_CRON_LOCK = 'DISTRIBUTED_CRON_LOCK'
export const DISTRIBUTED_CRON_TIME = 'DISTRIBUTED_CRON_TIME'
export const DISTRIBUTED_CRON_OPTIONS = 'DISTRIBUTED_CRON_OPTIONS'
export type DistributedCronOptions = CronOptions & {
/** TTL for the distributed lock in ms (default: 5 minutes) */
lockTtlMs?: number
}
export function DistributedCron(
lockName: string,
cronTime: string | CronExpression,
options?: DistributedCronOptions,
): MethodDecorator {
return (target, propertyKey) => {
Reflect.defineMetadata(DISTRIBUTED_CRON_LOCK, lockName, target, propertyKey)
Reflect.defineMetadata(DISTRIBUTED_CRON_TIME, cronTime, target, propertyKey)
Reflect.defineMetadata(
DISTRIBUTED_CRON_OPTIONS,
options || {},
target,
propertyKey,
)
}
}Here's the thing about this decorator -it does absolutely nothing at runtime in terms of scheduling. All it does is attach three pieces of metadata to the method:
DISTRIBUTED_CRON_LOCK-A unique string identifier like"send-daily-digest". This becomes the key in Firebase that instances compete to lock.DISTRIBUTED_CRON_TIME-The cron expression ("0 9 * * *"for 9 AM daily, or aCronExpressionenum).DISTRIBUTED_CRON_OPTIONS-Optional config, including a customlockTtlMsfor long-running jobs.
Why not just schedule directly in the decorator? Because separation of concerns makes testing way easier. The decorator is pure metadata -no side effects, no dependencies. The complex wiring lives in the explorer (we'll get there).
Using it in a service looks exactly like you'd hope:
@DistributedCron("send-daily-digest", CronExpression.EVERY_DAY_AT_9AM, {
lockTtlMs: 10 * 60 * 1000, // 10 minute lock for a long-running job
})
async sendDailyDigest() {
// Only executes on ONE instance, even if 5 are running
await this.emailService.sendDigestToAllUsers();
}Clean, readable, and a drop-in replacement for @Cron(). Your team doesn't need to understand distributed locks to use it.
Component 2: The Lock Service
This is the core distributed systems piece -and honestly, the part where I spent most of my time staring at edge cases.
import { Injectable, Logger } from '@nestjs/common'
import * as admin from 'firebase-admin'
@Injectable()
export class DistributedLockService {
private readonly logger = new Logger(DistributedLockService.name)
private get locksRef() {
return admin.database().ref('cron-locks')
}
async acquireLock(lockName: string, ttlMs = 5 * 60 * 1000): Promise<boolean> {
const lockRef = this.locksRef.child(lockName)
const now = Date.now()
try {
const result = await lockRef.transaction((current) => {
if (current && current.expiresAt > now) {
return undefined // Lock is held -abort
}
return {
acquiredAt: now,
expiresAt: now + ttlMs,
instanceId: process.env.K_REVISION || 'local',
}
})
return result.committed
} catch (error) {
this.logger.error(
`RTDB unreachable - lock "${lockName}" could not be acquired. Job will NOT run.`,
error,
)
return false
}
}
async releaseLock(lockName: string): Promise<void> {
try {
await this.locksRef.child(lockName).remove()
} catch (error) {
this.logger.error(`Failed to release lock "${lockName}":`, error)
}
}
async withLock(
lockName: string,
fn: () => Promise<void>,
ttlMs?: number,
): Promise<void> {
const acquired = await this.acquireLock(lockName, ttlMs)
if (!acquired) {
this.logger.log(`Lock "${lockName}" held by another instance, skipping`)
return
}
try {
await fn()
} finally {
await this.releaseLock(lockName)
}
}
}
Let's unpack the interesting bits.
The Transaction -Heart of the System
The acquireLock() method uses an RTDB transaction, which is essentially an atomic compare-and-swap. The callback receives the current value at cron-locks/{lockName} and you return what you want to write:
- Lock exists and hasn't expired (
current.expiresAt > now): Returnundefinedto abort.result.committedwill befalse. - Lock doesn't exist or has expired: Return a new lock object. RTDB writes it atomically -if another instance sneaked in between your read and write, RTDB re-runs your callback with the updated value, and this time you'll see the other instance's lock and abort.
This gives you exactly-once semantics. No matter how many instances fire at the same millisecond, only one transaction commits.
The lock data structure in RTDB looks like this:
{
"acquiredAt": 1711929600000,
"expiresAt": 1711929900000,
"instanceId": "api-v3-00042-abc"
}The instanceId (from Cloud Run's K_REVISION env var) is purely for observability -peek at the Firebase console and you can see exactly which instance holds each lock. Super handy for debugging at 2 AM.
The TTL -Your Safety Net
Here's where things get interesting. What happens if an instance acquires the lock, starts processing a job, and then... dies? Cloud Run terminated it. OOM kill. Network blip. Whatever.
Without a TTL, that lock is held forever. No instance can ever acquire it again. Your cron job never runs again. You've created a deadlock -which is arguably worse than the original triple-execution problem.
The expiresAt field is the safety net. Even in the worst case, the lock automatically becomes available again after the TTL period (default: 5 minutes).
But here's the tradeoff you need to think about: the TTL needs to be longer than your longest expected job execution time. If your email digest takes 8 minutes but the TTL is 5 minutes, another instance could grab the lock while the first is still sending emails. And now you're back to square one -concurrent execution.
That's exactly why the decorator accepts a custom lockTtlMs. Know your job's runtime, pad it generously, and set it accordingly.
Fail-Closed by Design
catch (error) {
this.logger.error(
`RTDB unreachable - lock "${lockName}" could not be acquired. Job will NOT run.`,
error,
);
return false;
}If Firebase RTDB is down, acquireLock returns false. No instance runs the job. This is a deliberate fail-closed design.
You know what's worse than skipping one RSVP reminder? Blasting every hacker with five duplicate emails and tanking your sender reputation because you decided "eh, if the lock service is down, just let everyone run." For most cron jobs, the safer default is "if we can't coordinate, nobody acts."
The withLock Wrapper
async withLock(lockName: string, fn: () => Promise<void>, ttlMs?: number): Promise<void> {
const acquired = await this.acquireLock(lockName, ttlMs);
if (!acquired) {
this.logger.log(`Lock "${lockName}" held by another instance, skipping`);
return;
}
try {
await fn();
} finally {
await this.releaseLock(lockName);
}
}
This is a classic RAII pattern adapted for async JS. The finally block is critical -without it, a throwing job would leave the lock held until TTL expiry, blocking all instances from picking up the next tick.
After a successful job, releaseLock() deletes the RTDB node immediately rather than waiting for the TTL. This means the next cron tick can acquire the lock right away without waiting for the window to close.
Component 3: The Explorer
This is the orchestrator that ties it all together. It runs once at application startup and discovers every method decorated with @DistributedCron:
import { Injectable, Logger, OnModuleInit } from '@nestjs/common'
import { DiscoveryService, MetadataScanner } from '@nestjs/core'
import { SchedulerRegistry } from '@nestjs/schedule'
import { CronJob } from 'cron'
import { DistributedLockService } from './distributed-lock.service'
import {
DISTRIBUTED_CRON_LOCK,
DISTRIBUTED_CRON_TIME,
DISTRIBUTED_CRON_OPTIONS,
DistributedCronOptions,
} from './distributed-cron.decorator'
@Injectable()
export class DistributedCronExplorer implements OnModuleInit {
private readonly logger = new Logger(DistributedCronExplorer.name)
constructor(
private readonly discoveryService: DiscoveryService,
private readonly metadataScanner: MetadataScanner,
private readonly schedulerRegistry: SchedulerRegistry,
private readonly lockService: DistributedLockService,
) {}
onModuleInit() {
if (process.env.RUNTIME_INSTANCE !== 'production') {
this.logger.log(
'Distributed crons disabled in non-production environment',
)
return
}
const providers = this.discoveryService.getProviders()
providers
.filter((wrapper) => wrapper.instance && !wrapper.isAlias)
.forEach((wrapper) => {
const { instance } = wrapper
const prototype = Object.getPrototypeOf(instance)
this.metadataScanner
.getAllMethodNames(prototype)
.forEach((methodName) => {
const lockName = Reflect.getMetadata(
DISTRIBUTED_CRON_LOCK,
prototype,
methodName,
)
const cronTime = Reflect.getMetadata(
DISTRIBUTED_CRON_TIME,
prototype,
methodName,
)
const options: DistributedCronOptions =
Reflect.getMetadata(
DISTRIBUTED_CRON_OPTIONS,
prototype,
methodName,
) || {}
if (!lockName || !cronTime) return
if (options.disabled) return
const { lockTtlMs, ...cronOptions } = options
const boundMethod = instance[methodName].bind(instance)
const job = CronJob.from({
cronTime,
onTick: () => {
this.lockService
.withLock(lockName, () => boundMethod(), lockTtlMs)
.catch((error) => {
this.logger.error(`Cron job "${lockName}" failed:`, error)
})
},
start: false,
...cronOptions,
})
this.schedulerRegistry.addCronJob(lockName, job)
job.start()
this.logger.log(
`Registered distributed cron "${lockName}" [${cronTime}] → ${prototype.constructor.name}.${methodName}`,
)
})
})
}
}There's a lot happening here, but the flow is actually pretty straightforward once you break it down.
The Production Guard
if (process.env.RUNTIME_INSTANCE !== 'production') {
this.logger.log('Distributed crons disabled in non-production environment')
return
}First thing the explorer does: bail out if we're not in production. You really don't want your local dev server competing for locks against production instances. And you definitely don't want your test suite accidentally triggering real cron jobs.
Discovery and Wiring
The explorer uses NestJS's DiscoveryService to iterate through every provider in the DI container, then MetadataScanner to check every method for our decorator's metadata. For each decorated method, it:
- Extracts the lock name, cron time, and options from metadata
- Binds the method to its instance (without
.bind(),thisinside the method would beundefined-a fun bug to track down) - Creates a
CronJobwhere theonTickwraps the method inwithLock() - Registers it with NestJS's
SchedulerRegistryfor lifecycle management
The SchedulerRegistry registration is more important than it looks. It means NestJS automatically stops all cron jobs during shutdown, and you can inject the registry elsewhere to inspect, stop, or restart specific jobs at runtime -useful for admin endpoints or debugging.
The .catch() at the End
.catch((error) => {
this.logger.error(`Cron job "${lockName}" failed:`, error);
});This catches any unhandled errors that escape past the withLock's finally block. Without it, you'd get an unhandled promise rejection -which in newer versions of Node can crash your process entirely. Not ideal for a production cron system.
The Full Flow at Runtime
Let's trace through what actually happens when a cron fires:
- App boots → NestJS initializes modules →
DistributedCronExplorer.onModuleInit()scans and registers all jobs - Cron tick fires (say, 9:00 AM) →
onTickruns on every instance simultaneously - Each instance calls
withLock()→acquireLock()→ RTDB transaction againstcron-locks/send-daily-digest - RTDB serializes the transactions → Only one commits → That instance gets
true, everyone else getsfalse - The winner executes the actual job → On completion,
finallycallsreleaseLock()→ RTDB node deleted - The losers log
"held by another instance, skipping"and go about their day - Next tick → Rinse and repeat. A different instance might win -there's no sticky leader
What I love about this design is that there's no leader election, no heartbeats, no coordination protocol. Every tick is an independent race. It's stateless from the application's perspective -all the state lives in RTDB.
Why Not Just Use Cloud Scheduler?
Fair question. Google Cloud Scheduler sends HTTP requests to your endpoints on a schedule -it's built for this.
But here's the thing: with Cloud Scheduler, every cron job needs a dedicated HTTP endpoint. That means route handlers, IAM permissions, possibly authentication middleware. And the cron configuration lives outside your codebase -in Terraform, the GCP console, or some deployment script. Adding a new cron means touching infrastructure.
With our approach, adding a cron is literally one decorator:
@DistributedCron("cleanup-expired-sessions", "0 */6 * * *")
async cleanupSessions() {
await this.sessionService.deleteExpired();
}No infrastructure changes. No Terraform. No console clicks. Just code.
The tradeoff is real though: this approach needs at least one instance running. Cloud Run can scale to zero, and when it does, no crons fire. Cloud Scheduler solves this by waking up an instance with an HTTP request. For an API that always has traffic and at least one warm instance, this isn't an issue. For a service that goes hours without requests? Cloud Scheduler might be the better call.
Lessons Learned
TTLs are a conversation, not a constant. We started with a blanket 5-minute TTL for everything, which worked great until someone added a job that processes 50k records and takes 12 minutes. Now we always ask: "How long can this job realistically take?" and set the TTL to 2x that.
Fail-closed was the right call. We had one incident where RTDB had a brief outage. No cron jobs ran for about 20 minutes. That was annoying but fine -the next tick picked up where things left off. The alternative (fail-open, everyone runs) would have been much, much worse.
The logs from the losers are surprisingly useful. When you see "held by another instance, skipping" appearing on 4 out of 5 instances, you know the system is working correctly. When you see it on all instances, something's wrong with the lock. The instanceId field has saved us multiple times during debugging.
Keep the lock scope tight. Early on, we had a single "daily-jobs" lock for all daily crons. Bad idea -if one job was slow, it blocked all the others. One lock per logical job, always.
At the end of the day, this is about 150 lines of code solving a problem that could've easily spiraled into "let's set up a Redis cluster" or "let's migrate to Cloud Scheduler and rewrite our cron architecture." Sometimes the best solution is the one that fits naturally into what you already have.

