Incrementally Copying (Rsyncing) Files From A Kubernetes Pod
The most obvious choice for moving files in and out of containers is kubectl cp
but it just does a straight copy of all the bytes. If you want to backup just a few changed bytes out of several hundred Gigabytes you will definitely want to do an incremental transfer. This will not only save time but also avoid problems like exhausting EFS Burst Credits.
Keep It Simple
When approaching any problem a good place to start is from something that works, in our case that's kubectl cp
. A little digging revealed that kubectl cp
is simply a tar pipe (aka kubectl exec pod tar cf - /path | tar -xf -
). After seeing this I recalled that tar can do 'incremental' backups, I wont try to explain how that works here as there are many many many posts on the subject. The short version is that tar can keep track of files that have been added, deleted and modified since the last time it ran.
The tar command to create an incremental backup looks like this:
tar -C /precious/files --create --listed-incremental=/path/to/backupIndex -vv --file=backup.tar .
Now if we sprinkle in a little stdout magic and some pipes we can transfer only changed files:
kubectl exec -it <pod_id> tar -C /precious/files --create --listed-incremental=/path/to/backupIndex -vv --file=- . | tar -xvf - /precious/backups
The first time this is run it will transfer everything because there is no last time to compare to (aka the backupIndex
is empty) but every run thereafter only changed files will be copied.
Caveats & Warnings
- The
backupIndex
file is how tar knows what the last state was so it needs to be persistent or copied to the container before each run. If it is missing everything will be transferred. - This strategy also copies deletes, it is effectively the same as rsyncs
--delete
argument - From the tar manual "Incremental dumps depend crucially on time stamps". Checksums or hash comparisons are not possible with tar, if you want this then use rsync