Software Engineer and Student
Loading posts...

How we designed and built a comprehensive inventory tracking system for HackPSU using Next.js, React Query, and NestJS to manage thousands of items across multiple event locations.

How we built a scalable email delivery system for HackPSU to send thousands of personalized emails with templates, tracking, and automated workflows.

How I built a comprehensive reimbursement management system with automated PDF generation, email workflows, and real-time analytics for HackPSU using Next.js, NestJS, and some creative PDF manipulation.
Feel free to contact me at kanishksachdev@gmail.com
Ahhh Somehow we are back here again, another weekend, another project that went way out of scope.
Do you know the game where you start on a random Wikipedia article and try to get to another article by clicking links on each page? It's a fun way to procrastinate, and you can end up in some weird corners of the internet.
This is what I was doing at 2 AM on Friday and I thought "Wait, what if I could actually visualize these connections?"
36 Hours later, I had downloaded the entire English Wikipedia, parsed 8.1 million articles and their 251 million connections, and built a Neo4j graph database that maps the structure of human knowledge (Well at least the english version).
Because if you're going to procrastinate, you might as well do it at scale.
This was part of my challenge: learn something completely new every weekend. Not just reading about it, but actually building something functional by Sunday night.
My first naive thought? "I'll just download Wikipedia and throw it into Gephi."
Famous last words.
Let me give you some numbers that made my MacBook start thermal throttling (and my sanity start to fray):
Gephi crashed trying to load even a subset of the nodes. Cytoscape gave up. My browser-based graph tools just... stopped responding.
Turns out, you can't just "visualize" 251 million relationships. You need to be a lot more strategic.
After watching Gephi crash on even a subset of my data, I realized this wasn't just a "big data" problem but rather a "my laptop isn't a supercomputer" problem.
The Tech Stack (chosen mostly by elimination):
Why Rust for parsing?
Let me put it this way: my first Python attempt was taking forever to process even a fraction of the data. The Rust version? Much faster. Sometimes performance actually matters. (And I have also been wanting to learn Rust for a while.)
The Multithreading Experiment That Backfired
Here's something that caught me off guard: I initially tried to speed things up with multithreading during the parsing phase. More threads = faster processing, right?
Wrong.
Turns out, when you're dealing with a single massive XML file, multithreading actually made things slower. The bottleneck wasn't the CPU, it was I/O and memory bandwidth. Multiple threads were just fighting over the same file handle and creating memory pressure. I switched back to single-threaded parsing and got better performance.
// The regex that extracts Wikipedia links from article text
let re = Regex::new(r"\[\[([^:\[\]|#]+(?:\/[^:\[\]|#]+)*)(?:#[^\[\]|]*)?(?:\|[^\]]+)?\]\]")?;
// Parse each page and extract all the links
while let Some(page_result) = parser.next() {
let page = page_result.unwrap();
// Skip non-main namespace articles (talk pages, user pages, etc.)
if page.namespace != Namespace::Main {
continue;
}
// Extract all [[Article Name]] links from the text
let links: Vec<String> = re
.captures_iter(&page.text)
.filter_map(|caps| caps.get(1))
.map(|m| m.as_str().trim().to_string())
.collect::<HashSet<_>>() // Remove duplicates
.into_iter()
.collect();
// Write to CSV for Neo4j import
for target in &links {
writeln!(edges_writer, "\"{}\",\"{}\",\"LINKS_TO\"",
page_title.replace('"', "\"\""),
target.replace('"', "\"\""))?;
}
}
The Character Limit Problem
Here's something they don't mention in the graph theory textbooks: some Wikipedia article titles are ridiculously long. We're talking 200+ character monsters like "Cneoridium dumosum (Nuttall) Hooker F. Collected March 26, 1960, at an Elevation of about 1450 Meters on Cerro Quemazón, 15 Miles South of Bahía de Los Angeles, Baja California, México, Apparently for a Southeastward Range Extension of Some 140 Miles"
Neo4j import doesn't love these, and neither do visualization tools. So I made a judgment call: anything over 120 characters gets skipped. (Sorry, "D-beta-D-heptose 7-phosphate kinase/D-beta-D-heptose 1-phosphate adenylyltransferase"—you didn't make the cut.)
Interesting Note on Compression
The Wikipedia dump is compressed with bzip2, which is great for reducing file size. The way the Wikipedia dump is structured, it compresses down to about 25GB, and you don't have to decompress the entire file to read it. They provide an index file that allows you to seek to specific pages using offsets without loading the whole thing into memory. This is a lifesaver when dealing with such large files.
However to make this challenge a little more interesting, I decided to decompress the entire file into a 108GB XML file. This was a bad idea, but it made my attempts at testing the parser easier.
You know what's fun? Trying to parse 108GB of XML without running out of memory. The Wikipedia dump format is... let's call it "XML in a mood."
<page>
<title>Quantum mechanics</title>
<ns>0</ns>
<id>25433</id>
<revision>
<id>1234567</id>
<timestamp>2024-08-01T10:30:45Z</timestamp>
<text xml:space="preserve">
Quantum mechanics is a fundamental theory in physics...
[[Physics]] is the natural science that studies [[matter]]...
<!-- And 50,000 more characters of wiki markup -->
</text>
</revision>
</page>
The challenge isn't just the size, it's the nested structure. You can't just regex your way through this. You need a proper XML parser that can handle streaming (because loading 108GB into memory is a great way to discover your laptop's limits).
The Filtering Pipeline:
[[Article Name]] markup// Skip titles that are too long (Neo4j import doesn't love them)
if page_title.len() > 120 {
continue;
}
// Extract unique links from the article text
let links: Vec<String> = re
.captures_iter(&page.text)
.filter_map(|caps| caps.get(1))
.map(|m| m.as_str().trim().to_string())
.collect::<HashSet<_>>() // Remove duplicates within the same article
.into_iter()
.collect();
CSV Generation: The Format That Actually Works
After the parsing nightmare, generating CSV files was almost boring. Almost.
# nodes.csv
title:ID,:LABEL
"Quantum mechanics","Page"
"Physics","Page"
"Mathematics","Page"
# edges.csv
source,target,:TYPE
"Quantum mechanics","Physics","LINKS_TO"
"Physics","Mathematics","LINKS_TO"
"Mathematics","Philosophy","LINKS_TO"
The CSV format is unglamorous but it works. Neo4j can import millions of rows in minutes, and pretty much every tool on earth can read CSV files. Sometimes boring is beautiful.
Here's where things get satisfying. After hours of attempting to parsing and processing, the actual import into Neo4j is almost anticlimactic:
neo4j-admin database import full \
--nodes=/path/to/nodes.csv \
--relationships=/path/to/edges.csv \
--delimiter=',' \
--quote='"' \
--overwrite-destination
This command ingests 26.6 million nodes and 251 million relationships in about 10 minutes. It's actually pretty satisfying to watch the progress counter tick up.
Post-Import Housekeeping:
Once everything's loaded, we calculate some basic graph metrics to make queries faster:
// Calculate in-degree and out-degree for each page
MATCH (p:Page)
OPTIONAL MATCH (p)-[:LINKS_TO]->()
WITH p, count(*) as out_degree
SET p.out_degree = out_degree;
MATCH (p:Page)
OPTIONAL MATCH ()-[:LINKS_TO]->(p)
WITH p, count(*) as in_degree
SET p.in_degree = in_degree;
// Create indexes for faster queries
CREATE INDEX in_degree_index IF NOT EXISTS FOR (p:Page) ON (p.in_degree);
CREATE INDEX out_degree_index IF NOT EXISTS FOR (p:Page) ON (p.out_degree);
The Numbers That Made Me Go "Whoa":
This is where it gets really interesting. Once you have Wikipedia as a graph, you can ask questions that are impossible to answer otherwise.
Remember that Wikipedia game where you try to get from any article to "Philosophy" by clicking the first link? Turns out, it's not just a game, it's a real pattern in the graph.
// Find what percentage of Wikipedia can reach Philosophy
MATCH (phil:Page {title:'Philosophy'})
MATCH (p:Page)-[:LINKS_TO*]->(phil)
WITH count(DISTINCT p) AS connectedCount
MATCH (total:Page)
RETURN
connectedCount AS numConnected,
count(total) AS totalPages,
round(connectedCount * 100.0 / count(total), 2) AS pctConnectedToPhilosophy;
Result: About 66% of Wikipedia articles can reach Philosophy through direct links.
Note: This figure is inaccurate because it doesn't account for redirect pages and Italicized articles. The number should be closer to 90% when considering all rules. For the rules of the game you can read the (you guessed it) Wikipedia page.
Philosophy's Position in the Graph:
To visualize Philosophy's neighborhood we ran:
// Philosophy's immediate neighborhood - articles it links to and from
MATCH (phil:Page {title:'Philosophy'})
OPTIONAL MATCH (phil)-[:LINKS_TO]->(outgoing:Page)
OPTIONAL MATCH (incoming:Page)-[:LINKS_TO]->(phil)
WITH phil, collect(DISTINCT outgoing)[0..1000] as out_nodes,
collect(DISTINCT incoming)[0..1000] as in_nodes
UNWIND (out_nodes + in_nodes) as connected
MATCH (phil)-[r:LINKS_TO]-(connected)
RETURN phil, r, connected
LIMIT 5000;

Some articles are just more "central" to human knowledge:
// Find the most referenced articles (highest in-degree)
MATCH ()-[:LINKS_TO]->(p:Page)
RETURN p.title, count(*) as in_degree
ORDER BY in_degree DESC
LIMIT 10;
Top 10 Most Referenced Articles:
Based on my analysis, geographic locations and major historical events dominate the most-referenced articles:
On the flip side, some articles are "rabbit holes"—they link to tons of other articles but aren't linked to much themselves:
// Find articles with high out-degree but low in-degree (rabbit holes)
MATCH (rabbit_hole:Page)-[:LINKS_TO]->()
WITH rabbit_hole, count(*) as out_degree
OPTIONAL MATCH ()-[:LINKS_TO]->(rabbit_hole)
WITH rabbit_hole, out_degree, count(*) as in_degree
WHERE out_degree > 50 AND in_degree < 5
RETURN rabbit_hole.title, out_degree, in_degree
ORDER BY out_degree DESC
LIMIT 15;
Interesting Discovery: One example we found was "List_of_La_CQ_episodes" with 192 incoming links but only 1 outgoing link. These types of list articles act as knowledge endpoints—they collect information but don't distribute it much further.
Using Neo4j's community detection algorithms, we found that Wikipedia naturally clusters into knowledge domains. The analysis revealed both large connected components and smaller, specialized clusters.
Community Analysis Results:
First, let's find a smaller, interesting community:
// Find a manageable-sized community to visualize
CALL gds.wcc.stream('neo4j')
YIELD componentId, nodeId
WITH componentId, collect(nodeId) AS members, count(*) AS size
WHERE size > 100 AND size < 2000
ORDER BY size ASC
LIMIT 1
UNWIND members AS memberId
MATCH (p:Page) WHERE id(p) = memberId
OPTIONAL MATCH (p)-[r:LINKS_TO]-(q:Page)
WHERE id(q) IN members
RETURN p, r, q;
Community Detection Insights: The Louvain algorithm revealed distinct knowledge clusters, with some communities being highly specialized (like episodes of specific TV shows or regional topics) while others represent broader academic or cultural domains. Here is one of the communities we visualized:

Want to see the famous "six degrees of separation" in action? Let's find the shortest path between completely unrelated topics:
// Find shortest path from Mathematics to Philosophy
MATCH (start:Page {title:'Mathematics'}), (end:Page {title:'Philosophy'})
MATCH path = shortestPath((start)-[:LINKS_TO*1..10]->(end))
RETURN path;
Actual Path Results: Mathematics → Philosophy connections exist through multiple paths, typically involving intermediate topics like Logic, Science, or Abstract concepts. The shortest paths we found was 3 hops, demonstrating how closely related these fundamental concepts are in human knowledge.

We also tried to find paths between several interesting topic pairs:
// Show paths between several interesting topic pairs
MATCH (start:Page), (end:Page)
WHERE start.title IN ['Mathematics', 'Quantum mechanics', 'Pizza', 'Basketball']
AND end.title IN ['Philosophy', 'Art', 'History', 'Biology']
AND start <> end
MATCH path = shortestPath((start)-[:LINKS_TO*1..6]->(end))
WITH start, end, path, length(path) as pathLength
ORDER BY pathLength
LIMIT 20
UNWIND nodes(path) as n
UNWIND relationships(path) as r
RETURN n, r;

Here's the part where I learned that wanting to visualize 8.1 million nodes and 251 million edges doesn't make it possible.
Attempt #1: Gephi "Let me just load all 6 million nodes..." Gephi has run out of memory
Attempt #2: Cytoscape "Maybe I'll filter it down to 1 million nodes..." Cytoscape has stopped responding
The Reality Check: You can't visualize the entire Wikipedia graph. The human eye can't process that much information, and computers struggle to render it. Instead, you need to be strategic about what you visualize.
What Actually Works:
Even then, anything over 1,000 nodes starts looking like digital spaghetti.
Running graph algorithms on 251 million edges teaches you a thing or two about performance.
Query Optimization Lessons:
// BAD: This query will run until the heat death of the universe
MATCH (a:Page)-[:LINKS_TO*]->(b:Page)
WHERE a.title = 'Mathematics'
RETURN count(*);
// GOOD: Limit the path length and use indexes
MATCH (a:Page {title:'Mathematics'})-[:LINKS_TO*1..3]->(b:Page)
RETURN count(DISTINCT b);
Index Everything That Matters:
Batch Processing is Your Friend: Large operations need to be chunked. Calculating in-degree for 8.1 million nodes? Do it in batches of 10,000.
Memory Management: Neo4j loves RAM. Like, really loves RAM. Our final setup uses 32GB and still occasionally asks for more during complex graph algorithms.
"Cool graph, but what's it actually useful for?" This was a real question my mom asked when I told her what I was working on. Fair question. Here are some practical applications:
1. Content Recommendation "People who read about Quantum Mechanics also read about..." becomes a graph traversal problem.
2. Knowledge Gap Detection Articles with high out-degree but low in-degree might be undervalued topics that need more attention.
3. Curriculum Design The shortest paths between topics reveal natural learning progressions.
4. Quality Assessment Articles with very few connections might be stubs or need better linking.
5. Research Discovery Find unexpected connections between fields by exploring graph neighborhoods.
The graph structure reveals patterns that aren't obvious when you're just browsing article by article.
Final Numbers:
Most Useful Queries:
// Find articles similar to a given topic (by shared connections)
MATCH (topic:Page {title:'Machine Learning'})-[:LINKS_TO]->(shared:Page)<-[:LINKS_TO]-(similar:Page)
WHERE similar <> topic
RETURN similar.title, count(shared) as shared_connections
ORDER BY shared_connections DESC
LIMIT 10;
// Find the most "central" articles using betweenness centrality
CALL gds.betweenness.stream('wikiGraph')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).title AS page, score
ORDER BY score DESC
LIMIT 10;
// Detect connected components (knowledge clusters)
CALL gds.wcc.stream('wikiGraph')
YIELD componentId, nodeId
WITH componentId, count(nodeId) AS size
RETURN size, componentId
ORDER BY size DESC
LIMIT 5;
Technical Lessons:
Domain Lessons:
Project Management Lessons:
Just like the previous weekend project, realistically nothing. While I do have ideas, the weekend is officially over so here are the ideas if anyone wants to chip away at them:
Immediate Improvements:
Research Questions:
Practical Applications:
The Big Insight: Wikipedia isn't just a collection of articles, it's a map of human knowledge. And that map has structure, patterns, and surprising shortcuts that reveal how we think and learn.
Want to Try This Yourself? The Wikipedia dumps are free, the tools are open source, and the patterns are waiting to be discovered. Just be prepared for your computer to work harder than it ever has before (I considered cooking an egg on my laptop at one point). You can find the code for this project here (it's only the parser): GitHub.
Got your own Wikipedia rabbit hole stories? Or questions about graph databases at scale? I'd love to hear them. Building tools to explore human knowledge is endlessly fascinating, even when your visualization software keeps crashing.