☠️Security

I Found 10,000 GitHub Repositories Distributing Trojan Malware

Michael Sintim-Koree · June 2026

Ten thousand repositories. Not ten. Not a hundred. Ten thousand GitHub repositories, most of them clones of legitimate projects, all modified to include malicious payloads that execute on install or build. Researchers tracking this campaign have been digging through it for several weeks, and the scale of it is genuinely uncomfortable to look at directly.

What separates this campaign from the usual supply chain incidents is the infrastructure. Automated, systematic, and built to survive detection and takedown. Understanding the mechanics is the useful part.

How the campaign is structured

The basic pattern: take a real, modestly popular repository (something in the 50 to 500 star range, plausible enough to pass a casual inspection), fork it under a fresh account, inject a payload into the build pipeline or install script, then make the repository look active. Fake stars, fake forks, commit history populated to look like maintenance. The accounts hosting these repos are themselves farmed at scale; many have been on GitHub for some period before the malicious repository appears on them, though some campaigns have used accounts created just days before publishing malicious content.

Payload delivery varies by target ecosystem. In the Python repositories observed across this campaign, the trojan lives in setup.py or a post-install hook: code that runs when you pip install from source. In Node.js repositories, it's buried in a lifecycle script in package.json; preinstall or postinstall, executing before the developer sees any output. Some payloads are obfuscated with base64 encoding inside what looks like a legitimate configuration file. Others are plaintext but rely on the file being long enough that the injection point sits far enough down that a quick scroll won't catch it.

The trojan payloads cluster into a few categories: credential harvesters targeting environment variables (AWS keys, GitHub tokens, SSH keys), downloaders that pull a second-stage binary from an external URL after initial execution, and persistent backdoors establishing an outbound connection on a rotating C2 domain list. Some repositories carry more than one.

Why GitHub search surfaces these

GitHub's search and discovery surfaces make this easier than it should be. Search for a library name and you get results ordered partly by stars, partly by relevance matching. A repository with a hundred stars and a README that closely mirrors the legitimate project's shows up credibly in results. If the legitimate project has a gap (no recent releases, archived status, a README mentioning it's looking for maintainers) the trojanized clone fills that search result with something that looks current.

Developer habit makes it worse. Cloning a repository URL from a search result, running the install command from the README, and proceeding: that workflow is deeply ingrained. There's no natural checkpoint where a developer pauses to verify they're on the canonical repository rather than a malicious fork with a similar name and URL pattern. The attack is designed around that gap.

CI/CD pipelines compound the exposure. A developer who manually verifies the repository they install locally may never apply that scrutiny to automated builds. A poisoned dependency pulled at build time in a GitHub Actions workflow or a Docker build runs with whatever permissions that pipeline has, and those permissions are often substantial.

The detection problem

GitHub's automated scanning catches known malware signatures and some categories of secrets committed to repositories. It is not designed to detect a malicious payload embedded in a plausible Python package's setup.py. The payload doesn't match any signature. The account wasn't flagged before the repository was created. The repository looks legitimate on every surface metric GitHub's tooling checks.

Community reporting is the mechanism that actually catches most of these, and it's slow. A developer has to install the package, notice something anomalous, connect that back to the specific repository, and file a report. By the time that chain completes, the repository has been available for days or weeks. The campaign operators clearly know this: reported time-to-detection across analyzed samples has been long enough to accumulate hundreds of installs before any repository was flagged.

A significant portion of the accounts hosting these repositories were created with enough prior activity to avoid looking obviously fresh. GitHub account age and prior activity feed into automated trust signals, and whoever is running this campaign has clearly thought about how those signals work. The accounts weren't tripped on creation; the threshold of prior activity the operators decided was sufficient appears to be calibrated carefully.

What the payloads contain

The credential harvesting code in these samples targets a predictable but comprehensive list. AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, GITHUB_TOKEN, NPM_TOKEN, PYPI_TOKEN: cloud credentials with immediate monetizable value. But the harvesters also sweep more broadly, hitting SSH private keys in ~/.ssh/, shell history files, .netrc files, and in some cases browser-stored credentials through a platform-specific extraction path.

The exfiltration methods are deliberately low-profile. DNS exfiltration appears in some samples: credentials encoded and sent as subdomain queries to attacker-controlled nameservers rather than outbound HTTP requests. DNS queries are noisier in logs than most developers realize, but they're also less likely to get caught by egress filtering rules that block unknown HTTP endpoints. HTTPS to a URL that looks like a telemetry or analytics endpoint appears in others; the traffic blends with normal development environment noise.

The downloader variants are the more concerning category. A first-stage payload that does nothing visually detectable except pull a second-stage binary from a rotating set of URLs is built to survive initial payload scanning. You scan setup.py, you see the base64-encoded download call, you might flag it. But the second-stage binary lives somewhere else entirely and rotates frequently enough to stay ahead of URL blocklists. Scanning the repository tells you only part of the story.

What developers can actually do

The supply chain problem is structural and GitHub won't solve it through detection alone. That means the burden is partly on individual developers and teams, which is frustrating but true.

Pin dependencies to specific commit hashes or verified release tags, not branch names. Branches can be updated under you; a commit hash cannot.
Install from package registries (pip, npm, cargo) over direct repository installs where possible. Registries have their own problems, but they carry more scrutiny than arbitrary GitHub repositories.
Verify the canonical source before cloning anything. Check the official project site, not just search results, and confirm the GitHub organization matches what you expect.
Read install scripts before running them. setup.py, package.json lifecycle scripts, Makefile install targets: three minutes of reading catches the payload class found in the majority of these repositories.
Run dependency installs in isolated environments without production credentials in the shell environment. A developer machine with AWS keys in environment variables is a high-value target for anything that runs at install time.
Monitor outbound network activity during build and install steps. Unexpected DNS queries or outbound connections during a pip install are a detection signal that costs almost nothing to instrument.

The CI/CD surface deserves its own attention

Most of the developer-focused advice above covers what happens on a local machine. The CI/CD exposure is more serious because the blast radius is larger. A GitHub Actions workflow that installs a malicious package with GITHUB_TOKEN in the environment has just handed over credentials that can push to the repository containing the workflow and interact with cloud provider credentials bound to OIDC roles. Note that GITHUB_TOKEN is scoped to the single repository where the workflow runs (not the entire organization) but the token scope within that boundary is often wider than anyone specified deliberately, and OIDC-bound cloud credentials in the same pipeline can carry far broader permissions.

Scoping workflow permissions explicitly is documented and straightforward: specify permissions at the workflow or job level, request only what the job needs, default to read-only. Most repositories don't do this because the default works and nobody went back to tighten it. A malicious install in a workflow with write-all permissions is categorically more damaging than the same install on a developer laptop.

GitHub's dependency-review-action runs as a step that compares pull request changes against known vulnerable packages. It won't catch a trojanized repository that isn't in a vulnerability database yet, but it catches known CVEs and costs nothing to add. The step that actually helps here is a custom one: scan for install scripts in any new dependency being added and fail the job if unexpected lifecycle scripts appear in new packages. That's not a packaged tool; it's a few lines of shell. Write it once and reuse it.

GitHub's position in this

GitHub has a genuine detection and enforcement problem here, not a policy one. The platform's policies clearly prohibit distributing malware. The gap is detecting it before enough developers install it to cause real damage. The detection surface is enormous: over 420 million repositories, with a campaign specifically engineered to look like legitimate activity by every coarse-grained signal GitHub's systems check.

The tooling that would actually help is harder to build than it sounds. Behavioral analysis of install scripts at repository creation time, graph-based detection of forking patterns consistent with mass repository farming, velocity anomaly detection on star counts: the fake star pattern is detectable and GitHub has the data. Whether the detection pipeline can run fast enough to matter before a repository accumulates installs is the real operational question, and the honest read of the current state is that the pipeline can't move fast enough.

Reported repositories in this campaign have been taken down. But ten thousand repositories means there are active ones not yet found, hosted by accounts running the same campaign under a different naming pattern. Detection and takedown is whack-a-mole against an automated operation that recreates faster than reports are processed. That's the structural problem, and reporting alone doesn't close it.

Where supply chain attacks are heading

Typosquatting on package registries used to be the dominant supply chain attack pattern: register a package name one character off from a popular one, wait for mistyped installs. Registries got better at detecting that. This campaign is the next evolution: the same goal (code execution in developer environments and CI pipelines), executed through a surface with weaker detection.

If registries get better at catching this pattern too, the next move is probably deeper social engineering of legitimate maintainers. SolarWinds and XZ Utils are both examples of that approach: patient compromise of the legitimate source, no clone required. XZ Utils in particular should be the case study every platform security team is running tabletop exercises against right now, because the patience involved (years of relationship-building before the payload went in) is something automated detection handles poorly.

The developer community and the platforms they depend on are in a detection arms race with operators who have automated tooling, time, and no shortage of fresh GitHub accounts. The fundamental asymmetry: attackers only need to succeed once per victim; defenders need to catch every instance. That asymmetry doesn't resolve through any single countermeasure. The install script audit and CI permission scoping recommendations above are the layers most developers are currently skipping — and right now, that's where this campaign is winning.

If you've hit a trojanized repository in the wild — or built detection tooling for install script analysis in CI — I'd genuinely like to hear what you found and how you caught it. Specifically: what tipped you off first, the network activity or the script itself?