Caching¶
Sometimes the transforms are time-consuming. For example, Zoom
, from previous tutorials is implemented via linear interpolation, which is quite expensive, especially for objects of higher dimension, such as 3D images.
Caching to RAM¶
A popular way of dealing with this complexity is caching. We'll start by the simplest of them - caching to RAM:
# let's create a dataset
from layers02 import *
from connectome import Chain
source = HeLa(root='DIC-C2DH-HeLa')
key = source.ids[0]
dataset = Chain(
source,
Binarize(),
Zoom(factor=0.25),
Crop(),
)
%%timeit
x, y = dataset.image(key), dataset.mask(key)
49.6 ms ± 517 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This is is not much, but for larger images the time would be greater.
Now, let's cache this dataset:
from connectome import CacheToRam
cached = dataset >> CacheToRam()
The first call will take around the same time, because the data need to be cached first:
%%time
x, y = cached.image(key), cached.mask(key)
CPU times: user 81.9 ms, sys: 65 ms, total: 147 ms Wall time: 48.9 ms
but subsequent calls will be much faster:
%%time
x, y = cached.image(key), cached.mask(key)
CPU times: user 4.93 ms, sys: 8.52 ms, total: 13.4 ms Wall time: 4.38 ms
%%timeit
x, y = cached.image(key), cached.mask(key)
1.55 ms ± 6.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
And just like that we sped up our pipeline by a factor of ~30. Now this is fast!
Persistent Caching to Disk¶
Caching to RAM does speed up our pipelines, but we still have a problem - the first call to image
and mask
are slow, because the computation need to happen in the first place. This means that each time you restart your script (or create the pipeline, for that matter) you'll have to recompute the cache:
# create the dataset
cached = dataset >> CacheToRam()
%%time
# first call - slow
x, y = cached.image(key), cached.mask(key)
CPU times: user 90.3 ms, sys: 35.6 ms, total: 126 ms Wall time: 54 ms
%%time
# second call - fast
x, y = cached.image(key), cached.mask(key)
CPU times: user 10.9 ms, sys: 0 ns, total: 10.9 ms Wall time: 3.63 ms
# create the dataset again
cached = dataset >> CacheToRam()
%%time
# first call - slow again!
x, y = cached.image(key), cached.mask(key)
CPU times: user 89.1 ms, sys: 44.3 ms, total: 133 ms Wall time: 44.4 ms
What if we could make a persistent cache that keeps living between runs?
Well, we can! This is when caching to disk comes into play:
from connectome import CacheToDisk
cached = dataset >> CacheToDisk.simple('image', 'mask', root='cache')
So, what is happening here? We want to cache image
and mask
and will be storing the cache in the current directory, in the cache
folder. You can change the path if you like.
CacheToDisk
is a highly customizable layer, however for this tutorial simple
is a good starting point - it will choose adequate default parameters for you.
The first run is slow as always:
%%time
x, y = cached.image(key), cached.mask(key)
CPU times: user 174 ms, sys: 127 ms, total: 301 ms Wall time: 112 ms
%%time
x, y = cached.image(key), cached.mask(key)
CPU times: user 19.5 ms, sys: 23.8 ms, total: 43.3 ms Wall time: 15.4 ms
And next calls are faster.
Now let's create the dataset again:
cached = dataset >> CacheToDisk.simple('image', 'mask', root='cache')
%%time
x, y = cached.image(key), cached.mask(key)
CPU times: user 37.9 ms, sys: 11.4 ms, total: 49.4 ms Wall time: 17.1 ms
Now even the first call is fast too! It's not as fast as caching to RAM, but we can combine them:
cached = Chain(
dataset,
CacheToDisk.simple('image', 'mask', root='cache'),
CacheToRam(),
)
%%time
x, y = cached.image(key), cached.mask(key)
CPU times: user 15.3 ms, sys: 0 ns, total: 15.3 ms Wall time: 16 ms
%%time
x, y = cached.image(key), cached.mask(key)
CPU times: user 7.34 ms, sys: 0 ns, total: 7.34 ms Wall time: 7.29 ms
We took the best of both worlds. How neat is that!
Cache Invalidation¶
Now our cache is stored in the cache
folder and it is loaded from disk when it's needed. There is a potential problem. What if we change the data preprocessing? Do we need to choose a new folder for the cache?
Luckily, the answer is no, we don't. connectome
is smart enough to figure out that the data has changed, and it will always keep the cache consistent with your current data!
Watch this:
small = Chain(
source,
Binarize(),
Zoom(factor=0.25),
Crop(),
) >> CacheToDisk.simple('image', 'mask', root='cache')
big = Chain(
source,
Binarize(),
Zoom(factor=0.5),
Crop(),
) >> CacheToDisk.simple('image', 'mask', root='cache')
We have two datasets with different transformations: the first one downsamples the images by a factor of 4, the second one - by 2.
Let's check the image's shapes:
# fill the cache
small.image(key).shape, big.image(key).shape
((122, 120), (244, 243))
# load from cache
small.image(key).shape, big.image(key).shape
((122, 120), (244, 243))
This is automatic cache invalidation at work!
That's all for caching. See you in next tutorials!