Caching¶

Sometimes the transforms are time-consuming. For example, Zoom, from previous tutorials is implemented via linear interpolation, which is quite expensive, especially for objects of higher dimension, such as 3D images.

Caching to RAM¶

A popular way of dealing with this complexity is caching. We'll start by the simplest of them - caching to RAM:

In [1]:

Copied!





# let's create a dataset
from layers02 import *
from connectome import Chain

source = HeLa(root='DIC-C2DH-HeLa')
key = source.ids[0]

dataset = Chain(
    source, 
    Binarize(),
    Zoom(factor=0.25),
    Crop(),
)
# let's create a dataset
from layers02 import *
from connectome import Chain

source = HeLa(root='DIC-C2DH-HeLa')
key = source.ids[0]

dataset = Chain(
    source, 
    Binarize(),
    Zoom(factor=0.25),
    Crop(),
)

In [2]:

Copied!

%%timeit
x, y = dataset.image(key), dataset.mask(key)
%%timeit
x, y = dataset.image(key), dataset.mask(key)

49.6 ms ± 517 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This is is not much, but for larger images the time would be greater.

Now, let's cache this dataset:

In [3]:

Copied!

from connectome import CacheToRam

cached = dataset >> CacheToRam()
from connectome import CacheToRam

cached = dataset >> CacheToRam()

The first call will take around the same time, because the data need to be cached first:

In [4]:

Copied!

%%time
x, y = cached.image(key), cached.mask(key)
%%time
x, y = cached.image(key), cached.mask(key)

CPU times: user 81.9 ms, sys: 65 ms, total: 147 ms
Wall time: 48.9 ms

but subsequent calls will be much faster:

In [5]:

Copied!

%%time
x, y = cached.image(key), cached.mask(key)
%%time
x, y = cached.image(key), cached.mask(key)

CPU times: user 4.93 ms, sys: 8.52 ms, total: 13.4 ms
Wall time: 4.38 ms

In [6]:

Copied!

%%timeit
x, y = cached.image(key), cached.mask(key)
%%timeit
x, y = cached.image(key), cached.mask(key)

1.55 ms ± 6.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

And just like that we sped up our pipeline by a factor of ~30. Now this is fast!

Persistent Caching to Disk¶

Caching to RAM does speed up our pipelines, but we still have a problem - the first call to image and mask are slow, because the computation need to happen in the first place. This means that each time you restart your script (or create the pipeline, for that matter) you'll have to recompute the cache:

In [7]:

Copied!

# create the dataset
cached = dataset >> CacheToRam()
# create the dataset
cached = dataset >> CacheToRam()

In [8]:

Copied!

%%time
# first call - slow
x, y = cached.image(key), cached.mask(key)
%%time
# first call - slow
x, y = cached.image(key), cached.mask(key)

CPU times: user 90.3 ms, sys: 35.6 ms, total: 126 ms
Wall time: 54 ms

In [9]:

Copied!

%%time
# second call - fast
x, y = cached.image(key), cached.mask(key)
%%time
# second call - fast
x, y = cached.image(key), cached.mask(key)

CPU times: user 10.9 ms, sys: 0 ns, total: 10.9 ms
Wall time: 3.63 ms

In [10]:

Copied!

# create the dataset again
cached = dataset >> CacheToRam()
# create the dataset again
cached = dataset >> CacheToRam()

In [11]:

Copied!

%%time
# first call - slow again!
x, y = cached.image(key), cached.mask(key)
%%time
# first call - slow again!
x, y = cached.image(key), cached.mask(key)

CPU times: user 89.1 ms, sys: 44.3 ms, total: 133 ms
Wall time: 44.4 ms

What if we could make a persistent cache that keeps living between runs?

Well, we can! This is when caching to disk comes into play:

In [12]:

Copied!

from connectome import CacheToDisk

cached = dataset >> CacheToDisk.simple('image', 'mask', root='cache')
from connectome import CacheToDisk

cached = dataset >> CacheToDisk.simple('image', 'mask', root='cache')

So, what is happening here? We want to cache image and mask and will be storing the cache in the current directory, in the cache folder. You can change the path if you like.

CacheToDisk is a highly customizable layer, however for this tutorial simple is a good starting point - it will choose adequate default parameters for you.

The first run is slow as always:

In [13]:

Copied!

%%time
x, y = cached.image(key), cached.mask(key)
%%time
x, y = cached.image(key), cached.mask(key)

CPU times: user 174 ms, sys: 127 ms, total: 301 ms
Wall time: 112 ms

In [14]:

Copied!

%%time
x, y = cached.image(key), cached.mask(key)
%%time
x, y = cached.image(key), cached.mask(key)

CPU times: user 19.5 ms, sys: 23.8 ms, total: 43.3 ms
Wall time: 15.4 ms

And next calls are faster.

Now let's create the dataset again:

In [15]:

Copied!

cached = dataset >> CacheToDisk.simple('image', 'mask', root='cache')
cached = dataset >> CacheToDisk.simple('image', 'mask', root='cache')

In [16]:

Copied!

%%time
x, y = cached.image(key), cached.mask(key)
%%time
x, y = cached.image(key), cached.mask(key)

CPU times: user 37.9 ms, sys: 11.4 ms, total: 49.4 ms
Wall time: 17.1 ms

Now even the first call is fast too! It's not as fast as caching to RAM, but we can combine them:

In [17]:

Copied!





cached = Chain(
    dataset,
    CacheToDisk.simple('image', 'mask', root='cache'),
    CacheToRam(),
)
cached = Chain(
    dataset,
    CacheToDisk.simple('image', 'mask', root='cache'),
    CacheToRam(),
)

In [18]:

Copied!

%%time
x, y = cached.image(key), cached.mask(key)
%%time
x, y = cached.image(key), cached.mask(key)

CPU times: user 15.3 ms, sys: 0 ns, total: 15.3 ms
Wall time: 16 ms

In [19]:

Copied!

%%time
x, y = cached.image(key), cached.mask(key)
%%time
x, y = cached.image(key), cached.mask(key)

CPU times: user 7.34 ms, sys: 0 ns, total: 7.34 ms
Wall time: 7.29 ms

We took the best of both worlds. How neat is that!

Cache Invalidation¶

Now our cache is stored in the cache folder and it is loaded from disk when it's needed. There is a potential problem. What if we change the data preprocessing? Do we need to choose a new folder for the cache?

Luckily, the answer is no, we don't. connectome is smart enough to figure out that the data has changed, and it will always keep the cache consistent with your current data!

Watch this:

In [20]:

Copied!





small = Chain(
    source, 
    Binarize(),
    Zoom(factor=0.25),
    Crop(),
) >> CacheToDisk.simple('image', 'mask', root='cache')

big = Chain(
    source, 
    Binarize(),
    Zoom(factor=0.5),
    Crop(),
) >> CacheToDisk.simple('image', 'mask', root='cache')
small = Chain(
    source, 
    Binarize(),
    Zoom(factor=0.25),
    Crop(),
) >> CacheToDisk.simple('image', 'mask', root='cache')

big = Chain(
    source, 
    Binarize(),
    Zoom(factor=0.5),
    Crop(),
) >> CacheToDisk.simple('image', 'mask', root='cache')

We have two datasets with different transformations: the first one downsamples the images by a factor of 4, the second one - by 2.

Let's check the image's shapes:

In [21]:

Copied!

# fill the cache
small.image(key).shape, big.image(key).shape
# fill the cache
small.image(key).shape, big.image(key).shape

Out[21]:

((122, 120), (244, 243))

In [22]:

Copied!

# load from cache
small.image(key).shape, big.image(key).shape
# load from cache
small.image(key).shape, big.image(key).shape

Out[22]:

((122, 120), (244, 243))

This is automatic cache invalidation at work!

That's all for caching. See you in next tutorials!