Artifactory Caching


Quick write up of a custom solution I've been working on the past couple weeks to cache an Artifactory instance.

The problem

Artifactory is a big bulky Java application with a database that stores ‘artifacts’. It can store pretty much everything you throw at it: generic files, RPMs, PyPi packages, NPM packages, Helm charts, Docker images, …

We have a pretty big Artifactory that serves a lot of these things, and it's hosted in the cloud. We also have a bunch of users that want to download these artifacts and they're located outside of the cloud, in the company network.

The link between these 2 places is not fast enough. That's a problem.

The solution, iteration 1

So a quick and dirty solution would be: a cache! Since we had experience with Apache as a cache, we set that up, and it did what it was supposed to do. But it cached responses that it shouldn't cache… So we added some exceptions in the rules for these types of URLs. And it cached some other things too long, so we added some more exceptions for these type of URLs. And we noticed that some clients had issues with certain other URLs, so we added exceptions for those as well, and in the end, well, we pretty much ended up not caching anything…

The solution, iteration 2

So instead of Apache, we set up a second Artifactory instance, and this instance had a set of ‘remote’ repositories that pointed to our main Artifactory instance. Each time a client requested an Artifact, the caching instance retrieved it from the main Artifactory instance, cached it, and then served it to our customers. Yes! Except…

Right now, whenever we add a new repo, we must add it to both Artifactory instances. If we need to update Artifactory, we must do it in 2 places, doubling the amount of downtime required.

And worst of all: we're now dealing with 2 layers of caching. The latest version of a file on a repository on the internet that we mirror, could be different than the latest version on our main Artfactory instance, and that one could be different to the latest version on our caching Artifactory instance. So now we had to purge the cache on 2 different places.

The solution, iteration 3

So back to the drawing board. Squid is old, but it's good, it does what it says on the box. It caches things. So we setup squid in accelerator mode and it did so flawlessly! It cached artifacts, it did not cache API responses, it was glorious! Except for docker repositories. It would not cache those. But these make up a large portion of our traffic, so we've only solved our problem about 50%.

The solution, iteration 3.5

So a few Google searches later: it's possible to run a docker registry in pull-through cache mode. But there are a few caveats here:

  • One docker registry can only cache one upstream docker repository, and in Artifactory each upstream URL that we mirror is a seperate docker repository. We mirror Docker Hub, Google's container registry, some others and we have a docker registry for each team that's using our Artifactory, so we need a lot of these registries
  • There's a bug in the registry that renders the cache unuseable if you do not clear it everey 7 days: https://github.com/docker/distribution/issues/2367

But that doesn't stop an engineer like me!

What if we run Squid in a docker container? And what if for every repository we have, we create a docker registry container? Then all we need is some thing that routes requests to the various URLs to the correct docker container and we're set! And guess what: I've been toying with Traefik for my homelab setup for a while, and that does exactly that!

So we ended up with this:

  • A machine running docker
  • Traefik listening on ports 80 and 443
  • Docker Registry container for $docker_repository_1 with some labels (see below)
  • Docker Registry container for $docker_repository_{2..999} with some labels
  • Squid container with some labels so that all other traffic not destined for any of the docker registries gets handled by Squid
  • CoreDNS container (since the DNS records for $artifactory_url are pointing to this machine, the docker registry containers and squid container need some way of figuring out the IP address of the main artifactory instance). I can spend a whole other blogpost on this, but trust me, CoreDNS was needed, it couldn't have been done with just a simple /etc/hosts file update.
  • Cronjobs that stop the docker registries, purge the cache dir and restart the docker registries once a week to circumvent the bug. This lowers our cache hit rate, but it's better than having no cache…

Ofcourse we automate the hell out of this: we query the Artifactory API for a list of docker registries, then feed that into Ansible's inventory, and we use that to create our docker containers:

- name: "docker registry_container {{ item.key }}"
  docker_container:
    name: "{{ item.key }}"
    state: started
    restart_policy: unless-stopped
    image: "{{ artcache_docker_registry_registry }}/{{ artcache_docker_registry_image }}:{{ artcache_docker_registry_tag }}"
    dns_servers: "172.17.0.1"
    volumes:
      - "{{ artcache_docker_registry_data_directory }}/{{ item.key }}:/var/lib/registry"
    env:
      SETTINGS_FLAVOR: "local"
      SEARCH_BACKEND: "sqlalchemy"
      REGISTRY_PROXY_REMOTEURL: "https://{{ item.value.urls.0 }}"
      GUNICORN_OPTS: "[--preload]"
      STORAGE_PATH: "/registry"
      REGISTRY_STORAGE_DELETE_ENABLED: "true"
    labels: "{{ lookup('template', 'docker-registry-labels.j2') | to_json }}"
    log_driver: json-file
    log_options:
      max-size: '100m'
      max-file: '3'
  with_items: "{{ docker_registries }}"

And the labels look like this:

{
  "traefik.http.routers.{{ item.key }}-http.rule": "Host(`{{ item.value.urls|join('`,`') }}`) && Method(`GET`, `HEAD`) && Path(`/v2/{blob:(.*)/blobs/(.*)}`)",
  "traefik.http.routers.{{ item.key }}-http.priority": "2",
  "traefik.http.routers.{{ item.key }}-http.entryPoints": "http",
  "traefik.http.routers.{{ item.key }}-tls.rule": "Host(`{{ item.value.urls|join('`,`') }}`) && Method(`GET`, `HEAD`) && Path(`/v2/{blob:(.*)/blobs/(.*)}`)",
  "traefik.http.routers.{{ item.key }}-tls.priority": "2",
  "traefik.http.routers.{{ item.key }}-tls.tls": "true",
  "traefik.http.routers.{{ item.key }}-tls.entryPoints": "https"
}

Explanation:

  • We set the priority to arbitrary 2, which is higher than the rest so that these rules get matched first. Any leftover traffic gets routed to the Squid container
  • We only route GET/HEAD requests for ‘blobs’ to the docker registry containers, so that docker logins, token calls and docker pushes don't get served by the docker registry container, but by the upstream Artifactory (through squid first, which doesn't cache these)

As for the squid container:

  # traefik.http.middlewares.https_redirect.redirectscheme.scheme=https  # Uncomment if we want http -> https redirection
  # traefik.http.routers.squid-http.middlewares=https_redirect  # Uncomment if we want http -> https redirection
  traefik.http.services.squid-http.loadbalancer.server.port: "80"
  traefik.http.routers.squid-http.rule: "HostRegexp(`{{ artcache_artifactory_url }}`, `{subdomain:.*}.{{ artcache_artifactory_url }}`, `{{ artcache_artifactory_alt_url }}`, `{subdomain:.*}.{{ artcache_artifactory_alt_url }}`)"
  traefik.http.routers.squid-http.priority: "1"
  traefik.http.routers.squid-http.service: "squid-http"
  traefik.http.routers.squid-http.entryPoints: "http"
  traefik.http.services.squid-https.loadbalancer.server.port: "443"
  traefik.http.services.squid-https.loadbalancer.server.scheme: "https"
  traefik.http.routers.squid-https.rule: "HostRegexp(`{{ artcache_artifactory_url }}`, `{subdomain:.*}.{{ artcache_artifactory_url }}`, `{{ artcache_artifactory_alt_url }}`, `{subdomain:.*}.{{ artcache_artifactory_alt_url }}`)"
  traefik.http.routers.squid-https.priority: "1"
  traefik.http.routers.squid-https.tls: 'true'
  traefik.http.routers.squid-https.entryPoints: "https"
  traefik.http.routers.squid-https.service: "squid-https"

Explanation:

  • Our Artifactory is reachable on 2 URLs (one legacy URL and a proper one, but we still need to serve both), hence the artcache_artifactory_url and artcache_artifactory_alt_url variables
  • Our docker repositories can be accessed using $docker_repository_name.$artcache_artifactory_url (on a subdomain), hence the weird rules.
  • Priority is set to 1, which is lower than the 2 for our docker repositories, so any leftover traffic not yet routed gets send here.

The solution, iteration 3.6

As mentioned before, Squid is old… And it shows. This solution has been in production for a little over a month now. And over the past couple days users complained that it was slowing down. Our graphs didn't show any unusual load on the system, it was coping fine.

Except for one cpu that was constantly at 100%.

Squid is single threaded, and there's some work from back in the 90's to make it multithreaded, but it's not really stable (as far as my research shows), So switching to this multithreaded way of working is pretty high risk imho.

But we needed more capacity. And how do we get more capacity? You add more instances! You scale! Which is pretty easy with Treafik!

Add a second squid container, config identical to the first squid container, with a seperate cache dir. Same labels on the docker container as the first squid container and voila! Traefik magically round robin balances traffic over the 2 squid containers. Our hit ratio took a dive the first couple hours, but it's almost as good with the cache filled up now again (couple days later).

And still we noticed that those 2 squid containers were running at 100%, so we simply added more! Right now we're running at 4 squid containers, and it's handling our load fine now!

How much data you ask? Let me close off with some graphs and images:

Traefik Screenshot Grafana Screenshot