The Job Search Spreadsheet That Drove Me to Automation
Picture this: you're a college student staring at your 47th internship posting of the day, and you realize it's for "Summer 2024" (you need Summer 2025), located in Toronto (you can't work in Canada), and it's the third time this week the same company has reposted the exact same role with a slightly different title.
You know what's fun? Manually tracking internship applications in a Google Sheet. Said no one ever.
But here's the thing. I was doing exactly that, spending 2+ hours every day copying and pasting listings from various job boards into my meticulously organized "Job Application Tracker." And honestly? I was losing my mind.
So I did what any reasonable CS student would do: I built a Python script to do the boring stuff for me. Here's how that went down, including all the weird edge cases I didn't see coming.
The Problem That Started It All
Let me paint you a picture of my daily routine circa October 2024. I'd wake up, grab coffee, and open multiple tabs: various internship boards, my Google Sheet, and whatever streaming service I was using to stay sane during the copy-paste marathon.
The process was mind-numbing:
- Scroll through 200+ new internship postings across different sites
- Click each one that looked promising
- Copy the title, company, location, and application deadline
- Paste it into my sheet
- Add my own notes about whether I actually wanted to apply
- Repeat until my eyes started bleeding
Here's what really got me though. I'd spend all this time curating listings, only to discover later that half of them were completely wrong for me:
- Location issues: "Oh cool, a great SWE internship at Shopify!" (Located in Toronto. I can't work in Canada.)
- Timeline problems: "This looks perfect!" (It's for Summer 2024. I need Summer 2025.)
- Duplicates everywhere: The same company would post the identical internship 3-4 times with slightly different job IDs
I was essentially building a database of internships I couldn't even apply to. That's when I realized: if I'm going to spend hours on this anyway, why not automate the parts that don't require human judgment?
Building Something That Actually Worked
Here's where things got interesting. I'd never worked with the Google Sheets API before, but I figured "how hard could it be?" (Famous last words, right?)
My first attempt was embarrassingly simple:
import gspread
import requests
# This was literally my entire first script
def get_jobs_and_dump_them():
gc = gspread.service_account()
sheet = gc.open("Job Application Tracker").sheet1
response = requests.get("https://api.jobboard.example/v1/positions")
jobs = response.json()
for job in jobs:
sheet.append_row([job['title'], job['company'], job['location']])You know what happened? It worked! Sort of. I ended up with 847 internship listings in my sheet, including gems like:
- A nursing position (I'm a CS major)
- Internships for "Spring 2023" (time travel not required, apparently)
- 47 duplicate postings from the same company
The script technically did what I asked it to do, but it was like asking someone to bring you "some food" and they show up with everything from the grocery store. Technically correct, completely useless.
Getting Smarter About Filtering
The real challenge was teaching my script to think like me. What makes a job listing worth my time?
I started simple with location filtering:
EXCLUDED_LOCATIONS: Set[str] = {
loc.lower()
for loc in ["canada", "toronto", "montreal", "ontario", "london"]
}
def is_location_excluded(location: str) -> bool:
return any(loc in location.lower() for loc in EXCLUDED_LOCATIONS)But here's something they don't teach you in CS classes: job posting data is messy. Like, really messy. I'd get locations like:
- "New York, NY; Toronto, ON; Remote"
- "Multiple Locations (see description)"
- "USA Remote" vs "Remote - Global"
My simple string matching was catching false positives left and right. A "New York, NY; Toronto, ON" listing would get filtered out entirely, even though I could work in New York.
The timeline filtering was even trickier. I needed to catch internships for my graduation timeline:
INCLUDED_TERMS: Set[str] = {
"Spring 2025", "Summer 2025", "Fall 2025", "Winter 2025",
"Spring 2026", "Summer 2026", "Fall 2026", "Winter 2026",
# Also included broader terms as fallbacks
"Fall", "Summer", "Spring", "Winter",
}
def is_terms_included(terms: List[str]) -> bool:
return any(term in INCLUDED_TERMS for term in terms)But what about postings that don't have terms at all? My solution was a fallback based on posting date:
FALLBACK_CUTOFF_DATE: str = "2025-03-01"
FALLBACK_CUTOFF_TS: int = int(
datetime.datetime.fromisoformat(FALLBACK_CUTOFF_DATE).timestamp()
)
# In the filtering logic
if job.terms:
if not is_terms_included(job.terms):
continue # Skip if terms don't match
else:
if job.date_posted < FALLBACK_CUTOFF_TS:
continue # Skip if no terms and too old
The real breakthrough came when I added comprehensive logging to understand what was being filtered out:
def summarize_filters(postings, existing_urls):
excluded_locations = set()
excluded_terms = set()
for job in postings:
if not job.active:
continue
if is_location_excluded(" ".join(job.locations)):
for loc in job.locations:
if is_location_excluded(loc):
excluded_locations.add(loc)
continue
# More filtering logic...
logger.info(f"Excluded Locations: {excluded_locations}")
logger.info(f"Excluded Terms: {excluded_terms}")
logger.info(f"Passed Jobs: {len(passed_jobs)}")This logging saved my sanity. I could finally see exactly what was being filtered and why.
The Deduplication Nightmare
You'd think deduplication would be simple. Just check if the job URL already exists in your sheet. But oh boy, you'd be wrong.
Companies love to repost the same internship with different URLs:
careers.company.com/job/123456careers.company.com/job/123456?source=indeedcareers.company.com/job/123456/apply
Same internship, three different URLs. My solution? URL cleaning:
# Clean tracking parameters from URLs
url = job_url.replace("?utm_source=JobBoard&ref=JobBoard", "")
url = url.replace("&utm_source=JobBoard&ref=JobBoard", "")Then I just check if the cleaned URL already exists:
def get_existing_urls(sheet: Worksheet) -> Set[str]:
rows = sheet.get_all_values()
return {row[5] for row in rows if len(row) > 5} # URL is in column 5
# Later in filtering
if job.url in existing_urls:
logger.debug(f"Skipping duplicate URL: {job.url}")
continueSimple, but effective. This approach caught about 95% of duplicates without being overly complex.
Going Full Automation Mode
Once I had the filtering logic working, I wanted to eliminate the manual "remember to run the script" step entirely. GitHub Actions seemed like the obvious choice. Free, reliable, and I could keep everything in version control.
Here's where things got interesting though. I discovered I needed to track multiple data sources, so I ended up with a more complex workflow:
name: Job Application Tracker
on:
schedule:
- cron: '0 */1 * * *' # Every hour
workflow_dispatch: # Manual trigger
jobs:
run-tracker-internship:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.x'
- name: Install dependencies
run: |
pip install --upgrade pip
pip install -r requirements.txt
- name: Configure Google credentials
env:
GOOGLE_CREDENTIALS: ${{ secrets.GOOGLE_CREDENTIALS }}
run: |
echo "$GOOGLE_CREDENTIALS" > creds.json
echo "GOOGLE_APPLICATION_CREDENTIALS_CUSTOM=$(pwd)/creds.json" >> $GITHUB_ENV
- name: Run tracker for internships
env:
SHEET_ID: ${{ secrets.SHEET_ID_INTERNSHIP }}
JOB_LISTINGS_URL: ${{ secrets.JOB_LISTINGS_URL_INTERNSHIP }}
run: python job_tracker.py
run-tracker-internship-secondary:
needs: run-tracker-internship
runs-on: ubuntu-latest
# Similar setup but with different URL source...You'll notice I'm running this every hour for internships. Why? Because internship postings move fast, especially during recruiting season. I learned this the hard way when I missed a great opportunity at a startup because their posting was only live for 6 hours.
The authentication setup was trickier than I expected. Instead of parsing JSON in Python, I ended up writing the credentials directly to a file in the GitHub Action:
# In my actual script
def authenticate_gspread() -> gspread.client.Client:
scope = [
"https://spreadsheets.google.com/feeds",
"https://www.googleapis.com/auth/spreadsheets",
"https://www.googleapis.com/auth/drive.file",
"https://www.googleapis.com/auth/drive",
]
creds_path = os.environ["GOOGLE_APPLICATION_CREDENTIALS_CUSTOM"]
creds = ServiceAccountCredentials.from_json_keyfile_name(creds_path, scope)
client = gspread.auth.authorize(creds)
return clientWhat I discovered was that different job boards have different APIs and data structures. Some use "terms" for graduation timelines, others use "seasons." Some have clean location data, others... don't. My script had to handle all of this messiness.
The Results (And What I Actually Learned)
Three months later, here's what actually happened:
The Good:
- I went from spending 2+ hours daily on internship tracking to maybe 10 minutes reviewing the filtered results
- My sheet went from 800+ irrelevant postings to ~50 high-quality matches per week
- I could actually focus on writing decent cover letters instead of data entry
- The hourly automation means I never miss short-lived postings anymore
The Weird:
- My script occasionally finds internships I never would have discovered manually (algorithm serendipity!)
- I started getting paranoid about missing opportunities because I wasn't manually reviewing everything
- Other students started asking me to set up similar trackers for them (never expected to become the "internship automation guy")
- I now have separate trackers for internships AND full-time roles, running different schedules
The Real Learning: The biggest lesson wasn't technical. It was realizing how much mental energy I was wasting on repetitive tasks. Even if building the automation took 20 hours upfront, it paid for itself within two weeks and gave me back my sanity.
From a technical perspective, this project taught me that real-world data is always messier than you expect. Every API has its quirks, every data source has its blind spots, and every "simple" filtering rule has edge cases that'll surprise you.
But here's the thing: perfect is the enemy of good. My script probably misses some great opportunities and occasionally includes some irrelevant ones. But it catches 90% of what I care about while eliminating 95% of what I don't. That's a trade-off I'll take any day.
You can check out my sheet I use with my tracker at go.codepath.org/jobtracker and my GitHub repo for the script at github.com/kensac/job-scripts
What's Next?
I'm thinking about adding:
- Notifications for "urgent" postings (application deadlines within 48 hours)
- Better company research integration (automatically pull recent news, funding info, etc.)
- Smart categorization (startup vs big tech vs consulting, etc.)
- Integration with more job boards beyond just the two I'm currently tracking
- AI-powered resume matching (suggest which roles my resume is best suited for based on job descriptions)
But honestly? The current version works so well that I sometimes forget it's running. Which is exactly what good automation should feel like. Invisible until you need it, reliable when you do.
If you're doing any kind of repetitive data gathering (internship hunting, apartment searching, monitoring price drops), seriously consider automating it. Your future self will thank you, and you might just learn something useful about APIs, data processing, and the beautiful chaos of real-world software development.
The best part? Every time I open my Job Application Tracker sheet and see fresh, relevant internship postings that I didn't have to manually copy and paste, I get a little reminder that sometimes the best solution to a boring problem is to just... not do the boring part yourself.

