TL;DR: You can pull the 3GB clone of the Catalog API , though note that it will unpack to 4.8 million files over 1.6 million folders and weigh in at about 52GB uncompressed.
You can pull the .NET Core console application that produced the clone from GitHub.
NuGet is the .NET package manager, and https://nuget.org hosts almost all publicly-published NuGet packages. A package is essentially a ZIP file with a metadata component and some binaries. The metadata portion details version and author information, descriptions and so on – it also lists the dependencies of the package upon other packages in the ecosystem using SemVer.
Package management data sources like this are interesting for playing around with graphs. They’re big, well-used, well-structured data sets of highly connected data . At time of writing, there are 167,733 unique packages published with over 1.8 million versions of those packages listed.
As part of an upcoming series, I wanted to load that information into a Neo4j graph database to see if there were any interesting insights or visualisations we could create when given access to such a big data set. Unfortunately each call to the API takes between 100ms and 500ms – doesn’t sound like much, but if you’re pulling 4.8 million documents you’re looking at around 23 days of time just sequentially pulling files.
You also have to process the files sequentially – each catalog page has a commit timestamp that gives it a strong ordering, and catalog entries are essentially events that have happened to a package version. It’s possible a single package version has multiple different package metadata pages associated with it spanning an arbitrary period of time as the package is listed, de-listed or metadata amended.
I wanted to have Neo4j load the data via REST API calls, rather than going with a standard file load as that was the point of the exercise. This meant that not only did I have to clone the dataset, but I had to host it locally so that it looked like the live API.
Catalog API
The Catalog API exposes every version of every package published to NuGet. Publishes and updates to published packages are recorded as separate documents, and the catalog is paged into batches of 500 or so changes per batch, plus or minus.
There are nearly 9,000 batches reported by the Catalog API, and a total of around 4.8 million documents. Some of those documents relate to the same version of a package – for example, when a package gets de-listed for some reason there will be an updated document in one of the Catalog API pages detailing the new state of the package.
There’s no rate limit on the Catalog API – it’s just files hosted in Azure blob storage – and each document contains navigable URLs to related information. If we wanted to pull a clone of the Catalog API, we could just start from the root document and crawl the links found.
Cloning the documents
Even though Neo4j needs to process the files sequentially, we don’t need to clone them sequentially. To retrieve 4.5 million documents in any sensible amount of time, we need to go multi-threaded. I bashed together a quick .NET console app to do just that.
The app just a pretty simple breadth-first search of links found in documents. We start off with the catalog root URL which details the URL of each of the 8,800 ish catalog pages:
We then search for anything that looks like a URI within the file, check we’ve not already processed that URI and then add it to a ConcurrentQueue
. Since the .NET BCL doesn’t have a ConcurrentSet
, we use a ConcurrentDictionary
and just care about the keys within it.
The queue ends up containing 4.8 million items at its peak, and the ‘processed files set’ slowly grows as the queue is drained.
We spin out 64 System.Threading.Tasks.Task
objects in an array. Each Task takes a single URI from the queue, quickly validates that we actually have work to do, pulls the contents of the file and parses out any new URIs. Each new URI is added to the processing queue, and the contents of the file are written to disk in a folder structure that mirrors the segments of the URI so that we can easily host it later. The Task then polls the queue again and waits until there’s more work to do, or until the queue is drained of work items.
Every minute or so a watchdog Task clones the work queue and the ‘processed’ set and persists them to disk. Fun fact – awaiting a WriteLineAsync is incredibly slow relative to just letting it block, especially when you’re calling it millions of times in a loop.
On startup, the app looks for these checkpoint files and reloads its state from them if found – that way we can pause the processing by just killing the process.
At peak the process was using around 2GB of RAM and pulling down files at around 10MBps.
You can find the source for the application on GitHub, as well as a link to the but the tool itself would need more work before you could for example resume from a snapshot tarball such as this.
24 hours later, what I ended up with was… extensive.
Now what?
Serving a mirror of the Catalog API with Nginx
Now that we’ve got our archive of files, we can spin out a super simple docker-compose script to host Nginx, serve the files from some URL and then replace our usages of api.nuget.org with localhost:8192 or whatever.
First off, the docker-compose file – we’re assuming that the file lives in the same folder as the api.nuget.org root folder of documents:
web:
image: nginx
volumes:
- ./api.nuget.org:/usr/share/nginx/html:ro
- ./nginx.conf:/etc/nginx/conf.d/default.conf:ro
ports:
- "8192:80"
You’ll have to make sure Docker has access to the drive where the mirrored files are stored to pull this off. We just serve that folder directly as Nginx’s default document base, but then also map in a file to configure Nginx to rewrite any URL that looks like ‘api.nuget.org’ to point to localhost:
server {
listen 80;
location / {
root /usr/share/nginx/html;
sub_filter 'https://api.nuget.org/' 'http://$http_host/';
sub_filter_types 'application/json';
sub_filter_once off;
}
}
Because the crawler just saved files to a directory structure that matches the URI components, we can trivially serve the files without needing to translate names.
A quick docker-compose up and we’re off to the races:
Of course, we could just use the .JSON files directly, without getting Nginx in the way but since the Neo4j post I did this for was about crawling a REST API, it seemed like a bit of a cheat to just index files on disk instead.
Downloads
You can download the archive – it’ll take… some time to decompress, and you’ll end up with around 52GB of space consumed on the drive once you’ve done so, as well as 4.8 million new files and 1.6 million new folders so maybe do this on a drive you don’t care that much about.
The source for the crawler is available on GitHub though it’s super rough-and-ready and provided with absolutely no guarantees, aside from that it’s likely to make your day worse by running it.