$catMANUAL||~58 min

10,000 GitHub Repos Are Distributing Trojans: I Dug Into the Attack Chain and It's Worse Than You Think

advertisement

10,000 GitHub Repos Are Distributing Trojans: I Dug Into the Attack Chain and It's Worse Than You Think

I was scrolling Hacker News yesterday when a post stopped me cold. Someone discovered 10,000 repositories on GitHub distributing Trojan malware. 625 points, 141 comments, and the number keeps climbing.

My first thought was "clickbait." It wasn't. The author wrote a script that scanned GitHub's entire event stream and uncovered a coordinated network of malicious repositories. These aren't forks. Each one has different contributors, different names, but they all share the same attack pattern.

I spent an afternoon running the author's tool and dissecting every step of the attack chain. The deeper I went, the worse it got.

How the Attack Works

Let me walk through the whole attack chain, because understanding the pattern is the only way to defend against it.

What the attackers do isn't complicated. It's clever, though.

Step 1: Clone legitimate repos. They find less-known open source projects (not popular ones — those attract too much scrutiny) and copy the entire commit history into a new repository. When you search GitHub, these fake repos look identical to the real ones — same name, same description, same commit history. The original author even shows up as a "contributor."

One detail worth noting: the attackers deliberately pick obscure projects. They don't clone repos with thousands of stars. Why? Because the maintainers of small projects might not check GitHub for months, so they won't notice the clone. And since obscure projects get less traffic, users are less likely to spot the anomaly by comparing repos.

Step 2: Inject a malicious link into the README. The attackers add a single line to the README file — a download link pointing to a zip archive. Inside the zip are four files:

  • Application.cmd or Launcher.cmd (a launcher script)
  • loader.exe or luajit.exe (the actual Trojan, with randomized names)
  • A .cso or .txt file (likely an encrypted payload)
  • lua51.dll (a dynamic library the Trojan depends on)

The file structure looks like a normal Lua script runtime, right? luajit.exe is the LuaJIT interpreter, lua51.dll is the Lua 5.1 runtime. Most people wouldn't blink at this — it looks like a legitimate Lua project packaging its executables.

But luajit.exe isn't real LuaJIT. It's been replaced with a Trojan. Double-click it and you're compromised.

Step 3: Force-push the same commit every few hours. This is the most interesting part. The attackers delete the previous commit and push an identical one every few hours. Why?

The author speculates several reasons:

  1. Evading GitHub's security scanning. GitHub might flag repos that go dormant and then suddenly update. If a repo appears to be "actively maintained," it might skip that detection layer.
  2. Making the repo look legitimate. A repo with weekly commits looks more trustworthy than one that hasn't been touched in six months.
  3. Keeping search engine rankings fresh. Search engines prioritize actively updated pages. Repeated commits keep the fake repo high in search results.

Step 4: Wait for search engines to index. Because GitHub has massive SEO authority, these fake repos quickly appear in Google and Bing results. When a user searches for a tool's name, the fake repo might show up first.

Here's a critical detail about VirusTotal: if you submit the zip file's URL for scanning, it returns 0 viruses. But if you download the zip file itself and submit that, it detects the Trojan.

Why the discrepancy? URL scanning and file scanning are completely different things. URL scanning checks domain reputation and known-malicious URL databases — it doesn't actually download and analyze the file. The attackers host their malicious files on services whose domains haven't been flagged, so URL scanning passes them right through.

Why GitHub Hasn't Caught These Repos

This is the most unsettling part.

The author reported two malicious repos to GitHub Support two months ago. Two weeks passed with no response. Another month went by before GitHub emailed to say they'd been removed.

But that was just two repos. The author's script later found 10,000.

GitHub has 500 million repositories. You'd think they'd have automated detection. But these malicious repos survive for several reasons:

They're not forks. GitHub's fork detection doesn't work here. The attackers create new repos and push history — technically these are completely new repositories with no fork relationship to the originals.

Every repo looks different. Different names, different contributors, different project types. Traditional similarity detection can't find them. You can't flag a repo just because it "looks like" another one — the open source world genuinely has many similar projects.

The attackers actively evade detection. Force-pushing commits, naming every commit "Update README.md," picking obscure projects to clone — these behaviors are specifically designed to bypass automated scanning.

The scale is overwhelming. 10,000 repos. Even if GitHub processed 100 reports a day, that's over three months to clear them all. Meanwhile, attackers create new ones daily.

GitHub's reporting process is slow. The author's experience says it all: nearly two months from report to resolution. For a campaign of this scale, two months is an eternity.

I Ran the Author's Tool

The author open-sourced a Python script called git-malware-finder for scanning GitHub for malicious repos. I ran it. Here's what I found.

How It Works

The core logic is elegant:

  1. Download recent GitHub event data from GH Archive (tens of millions of push events per day)
  2. Filter for repos that update every few hours
  3. For each candidate, hit the GitHub API to check if the latest commit only modified the README
  4. Check if the README contains a zip download link
  5. Exclude forks and bot accounts

The approach works because malicious repos have a unique behavioral signature — "repeatedly force-pushing commits" — that normal repos don't exhibit. By detecting behavior rather than content, it sidesteps the "every repo looks different" problem.

The Filter Tuning Process

The author's article describes several rounds of filter refinement:

  • Round 1: Only repos updated "every few hours" → found 14 (too few)
  • The author realized "every few hours" was too strict — some malicious repos update daily, not hourly
  • Round 2: Broadened to "1-24 updates per day" → found 40,000
  • Added other filters (README has zip link, commit only modifies README, not a fork, not a bot) → final count: 10,000

This tuning process itself reveals something important: the attackers' behavior isn't perfectly consistent. Some repos update every few hours, some daily. That inconsistency makes detection harder.

What the Results Look Like

I picked a few repos from the output at random. All malicious:

  • The last line of the README is a zip link pointing to some file hosting service
  • Commit history is nothing but "Update README.md" — no actual code changes
  • Repo creation date is recent, but the commit history looks like a years-old project (because it was copied from a legitimate repo)
  • Contributor profiles are nearly empty — no other projects, no activity

Running It Yourself

If you want to check for yourself:

bash
1
git clone https://github.com/orchidfiles/git-malware-finder.git
2
cd git-malware-finder
3
pip install -r requirements.txt
4
python3 finder.py

A few things to know:

  • The script needs a GitHub Personal Access Token for API access
  • There's a rate limit (5,000 requests per hour), so a full scan takes several hours
  • GH Archive data is large — downloading several days' worth requires significant disk space
  • The script outputs a list of suspected repos — you'll need to manually verify

Are These Repos Still Up?

As of June 19, 2026, most of the 10,000 repos the author published have been deleted by GitHub. But I have no way to know if the attackers have created replacements.

This is the nature of the game — it's whack-a-mole. Delete one batch, they create another. As long as the attack cost is low, they'll keep going.

What This Means for You

You might think: "I never download zip files from random repos. This doesn't affect me."

It affects you more than you realize.

Scenario 1: Search engine poisoning. You Google an open source tool's name. The first result is a fake repo. You click through, the star count looks reasonable (probably inflated), the README is convincing. You clone it, install dependencies, run it. Your machine now has a backdoor.

Scenario 2: CI/CD pipeline compromise. Your project depends on a lesser-known npm or Python package. That package's GitHub repo got cloned and tampered with. The attacker pushed a new version with malicious code. Your Dependabot auto-updates the dependency. The malicious code is now in your build pipeline.

Scenario 3: AI tool poisoning. More and more AI coding tools (Cursor, Claude Code, Copilot) search GitHub for code references. If the AI tool indexes a fake repo, it might introduce malicious code patterns into your project.

Scenario 4: Colleague compromise. Someone on your team clones a tool from a fake repo and runs it on the company network. The Trojan moves laterally through the internal network. Your entire dev environment is compromised.

These aren't hypothetical. Every one of these scenarios has real-world precedent.

Common GitHub Supply Chain Attack Patterns

Beyond the "clone repo + inject Trojan" pattern, there are several other supply chain attacks you should know about:

1. Typosquatting

Attackers register repo names that look similar to popular projects. Search for requests (the Python HTTP library), and you might find request (missing the 's'). This is more common on npm and PyPI, but it happens on GitHub too.

2. Star Jacking

An attacker creates a legitimate-looking open source project, builds up stars and trust over time, then quietly injects malicious code in a later version. Because the project already has "credibility," users don't question it.

3. Dependency Confusion

Attackers publish a package on a public registry with the same name as your private package. If your package manager is misconfigured, it might pull the malicious public package instead of your private one.

4. Commit Squatting

Attackers submit malicious code through PRs on popular projects. If a maintainer merges without careful review, the malicious code enters the main branch. This requires social engineering skill, but it's been done.

5. GitHub Actions Supply Chain Attacks

GitHub Actions is the backbone of CI/CD. If an Action's repo is compromised, every project using that Action is affected. There have been several GitHub Actions supply chain attacks in the past couple of years.

This Isn't the First Time, and It Won't Be the Last

GitHub supply chain attacks aren't new. A few notable cases from recent years:

The event-stream incident (2018): The maintainer of a popular npm package handed publish rights to a new contributor. That contributor injected code designed to steal Bitcoin wallets. The package had millions of weekly downloads.

The ua-parser-js incident (2021): Another npm package with millions of weekly downloads was taken over. The attacker added cryptocurrency miners and password stealers to new versions.

The XZ Utils backdoor (2024): The most chilling one. Someone named Jia Tan spent two years slowly infiltrating the xz project's maintainer team, eventually planting a backdoor in the compression library that nearly made it into every Linux distribution. This one made global headlines.

Every time one of these happens, the community discusses "what do we do about supply chain security" for a week, then forgets. But the attackers don't forget. They keep refining their methods.

This 10,000-repo campaign is different from previous supply chain attacks in one key way: previous attacks mostly involved injecting malicious code into existing packages or projects. This one creates entirely new fake repos from scratch. The entry point changed, but the underlying playbook is the same — exploit developers' trust in the open source ecosystem.

How to Protect Yourself: A Detailed Guide

Basic Defense: The Pre-Clone Checklist

Before cloning any repo, spend 30 seconds on these checks:

Check stars and forks. A genuinely active project won't have 0 stars and 0 forks. Stars can be inflated, but 0 stars almost certainly means the project is inactive or suspicious.

Check recent commit content. Click into the commit history and look at what recent commits actually change. If the last several commits are all "Update README.md" with no substantive code changes, that's a red flag.

Check the contributor list. Click through to contributors' profiles. Malicious repo contributors typically have no other projects, recently created profiles, and no profile pictures.

Check Issues and Discussions. Real projects have users filing issues, asking questions, reporting bugs. Fake repos either have empty Issues or self-answered questions from the attackers.

Check Releases and Tags. Normal projects have version release histories. If a project "looks active" but has zero releases, be suspicious.

Check repo creation date. If a repo claims to be a "mature project" but was only created a few months ago, and the commit history looks like years of work, the history was probably copied from elsewhere.

Advanced Defense: Using the gh CLI

GitHub's gh CLI provides some useful inspection commands:

bash
1
# View basic repo info
2
gh repo view owner/repo
3
 
4
# View last 5 commits
5
gh api repos/owner/repo/commits --jq '.[0:5][] | "\(.commit.message) - \(.commit.author.date)"'
6
 
7
# View contributors
8
gh api repos/owner/repo/contributors --jq '.[].login'
9
 
10
# View language breakdown (normal projects usually have multiple languages)
11
gh repo view owner/repo --json languages
12
 
13
# View recent releases
14
gh api repos/owner/repo/releases --jq '.[0:3][] | "\(.tag_name) - \(.published_at)"'

For Repo Maintainers

If you maintain a repository, do this:

Enable GitHub's security features:

  • Dependabot: Automatically detects dependency vulnerabilities and opens PRs to fix them
  • Code Scanning: Uses CodeQL to find security issues in your code
  • Secret Scanning: Detects leaked secrets and tokens in your code
  • Branch Protection: Prevents direct pushes to main, requires PR review

Configure .github/dependabot.yml:

yaml
1
version: 2
2
updates:
3
  - package-ecosystem: "npm"
4
    directory: "/"
5
    schedule:
6
      interval: "weekly"
7
  - package-ecosystem: "pip"
8
    directory: "/"
9
    schedule:
10
      interval: "weekly"

Add a Security Policy: Create a .github/SECURITY.md file telling users how to report security issues. It won't prevent attacks directly, but it makes it easier for security researchers to contact you.

Development Workflow Defenses

Never run scripts from untrusted sources blindly. I see people clone a repo and immediately run python3 script.py or bash setup.sh all the time. That's the most dangerous habit in software development. At least read the script first.

Use lockfiles. If your project uses npm, pip, cargo, or any package manager, always use lockfiles (package-lock.txt, requirements.txt with hashes, Cargo.lock). Even if someone tampers with a package version, the hash check in the lockfile will catch it.

bash
1
# npm: use lockfile
2
npm ci  # instead of npm install
3
 
4
# pip: use hash verification
5
pip install -r requirements.txt --require-hashes
6
 
7
# pip: generate requirements with hashes
8
pip-compile --generate-hashes requirements.in

Audit dependencies regularly:

bash
1
# npm audit
2
npm audit
3
 
4
# pip audit (requires pip-audit)
5
pip-audit
6
 
7
# cargo audit (requires cargo-audit)
8
cargo audit

Use ghq or similar tools to manage local repos:

bash
1
# ghq keeps all repos under ~/.ghq/
2
ghq get https://github.com/official-org/project

This way you know exactly what repos you have locally, and cleanup is straightforward.

Impact on AI Coding Tools

This topic has direct implications for AI-assisted development, and they're significant.

More and more AI coding tools pull code from GitHub and reference open source projects. If an AI tool references a malicious repo's code patterns, it might introduce malicious code into your project.

The more immediate concern: many AI Agents have gh CLI access or direct GitHub API permissions. If you ask an AI Agent to find and install an open source tool, it might find a fake repo.

This isn't hypothetical — it's entirely plausible.

Here's a concrete example: you tell Claude Code "find me a Python HTTP request library and clone it so I can see how it works." Claude Code searches GitHub, finds a repo with a name similar to requests, clones it, and starts referencing it in your project. If that repo is fake, you've just invited the attacker in.

Worse, AI tools don't check repos the way humans do. They don't look at star counts, commit history, or contributor profiles. They look at search ranking and repo description. If the fake repo has good SEO, the AI tool will recommend it first.

I've encountered something similar myself. I once asked an AI Agent to find a Go logging library, and it recommended a project I'd never heard of. I checked — single-digit stars, recent commits all "Update README.md." Not necessarily malicious, but it showed me that AI tools aren't careful enough when choosing dependencies.

If you use AI coding tools, configure them to use trusted sources. In Claude Code's CLAUDE.md, for example:

code
1
When installing third-party tools or libraries, only use these trusted sources:
2
- Official GitHub organizations (microsoft/, facebook/, google/, psf/)
3
- Official npm registry
4
- Official PyPI registry
5
- GitHub repos explicitly linked from official websites
6
Never clone unknown repos from search results.

The Bigger Problem: GitHub's Trust Mechanisms Are Failing

This incident raises a larger question: can we actually trust anything on GitHub?

When we judge whether an open source project is trustworthy, we typically look at these signals:

  • Star count — Broken. There are dedicated "star-selling" services. A few hundred bucks buys thousands of stars.
  • Fork count — Also gameable. And forks don't mean anyone actually uses the project.
  • Commit history — This incident proves commit histories can be copied wholesale.
  • Contributors — Can be faked, or use stolen commit identities.
  • Search ranking — GitHub SEO is a thing. Attackers can push fake repos to the top of results.

What's left? Issues and Discussions activity? Attackers can create fake conversations. Download counts? Those can be inflated too.

The signals on GitHub are all depreciating. Star count used to be a quality proxy. Now it might just be a marketing metric.

GitHub alone can't solve this. The open source community needs better trust mechanisms. Some possible directions:

Reproducible Builds: Every step from source code to binary can be verified by third parties. If a release's build process is reproducible, you can confirm the binary was actually compiled from the stated source.

Signed Releases: Maintainers sign releases with GPG keys. You verify the signature came from the actual maintainer. Large projects like the Linux Kernel and Git itself already do this, but small-to-medium projects rarely bother.

Transparent Dependency Chains: Frameworks like SLSA (Supply-chain Levels for Software Artifacts) try to establish a complete trust chain from source code to final artifact. Still in early adoption.

Decentralized Code Hosting: If code doesn't live exclusively on GitHub, attackers need to compromise multiple platforms simultaneously. But GitHub's network effect is so strong that change is slow.

Each of these has limitations, but the direction is right. Until they mature, we're on our own.

Quick FAQ

Q: I've already cloned repos I don't recognize. What do I do?

Check your local directories for unfamiliar projects. If you find any, delete them. More importantly: did you run any code from those repos? If so, check your system for unusual processes.

Q: Can Dependabot protect against this?

Dependabot mainly detects known vulnerability updates in dependencies. It's useless against fake repos because you're cloning an entirely new repo, not updating a known package. Dependabot is better suited for dependency confusion and known CVEs.

Q: Does Docker isolate the risk?

Somewhat, but it's not a silver bullet. Running untrusted code in a Docker container limits the blast radius — if compromised, only the container is affected. But only if your Docker config is secure: no --privileged, no sensitive directory mounts, no host network mode.

Q: How do I know if a zip file is safe?

The safest approach is to download and extract in an isolated environment (VM or Docker container), then scan with multiple engines. VirusTotal's file scanning (not URL scanning) catches most known Trojans. But if it's a brand-new piece of malware, no scanner will detect it.

Q: Will GitHub improve its detection?

The author noted that GitHub took two months to process his report. That's clearly too slow for this scale. But with 10,000 malicious repos making HN headlines, GitHub will probably accelerate cleanup and invest more in automated detection. Still, relying on the platform to solve everything is naive. Protect yourself first.

Wrapping Up

The core lessons from this incident:

  1. GitHub is not a safe harbor. With 500 million repos, GitHub can't audit them all. You have to judge for yourself.
  2. Attackers are getting sophisticated. This isn't simple phishing — it's full repo cloning, commit history replication, SEO optimization, VirusTotal evasion. It's a mature attack supply chain.
  3. Trust but verify. 30 seconds of checking before you clone prevents 90% of the risk.
  4. Use the security tools that exist. Dependabot, Code Scanning, pip-audit, npm audit — these tools exist for a reason.

The original article is a great read: I discovered a large-scale malware distribution on GitHub. The scanning tool is here: git-malware-finder.

Drop a comment if you have questions. I'm planning to write a follow-up about securing AI coding tools — how to use them without opening yourself up to exactly these kinds of attacks.

advertisement

10,000 GitHub Repos Are Distributing Trojans: I Dug Into the Attack Chain and It's Worse Than You Think — AI Hub