Chapter 11 Working with digital pathology data and python

Histopathology is used to analyse tissues from the body under the microscope. When a sample of tissue is taken during and operation or biopsy, it is put into a preservative, dehydrated and then impregnated with paraffin wax. It may also be frozen. This allows the tissue to be very thinly sliced and mounted onto glass slides, where the tissue can then be stained and studied under the microscope. Different types of stains can be used to look at different things.

These stained slides can be scanned with a slide scanner to allow the image to be digitised using scanners. There are two main types of images. Brightfield - generated by using white light, which requires deconvolution to produce separate colour channels. Fluorescence - generated using a light source to excite fluorescent probes, this typically generates images with 2+ colour channels.

The file formats produced by these scanners varies by manufacturer and scanning technique. Essentially, data is stored as tiled TIFF images within what is possibly an XML structure. There are usually multiple images produced, not only of the tissue but also of the slide, including a macro image (no magnification, sometimes helps scanner locate tissue), and the slide label.

Unfortunately, it’s quite a difficult format to interrogate easily. However, there is a nice program written that can solve this issue, known as OpenSlide.

OpenSlide has C, Python and Java versions. The python version is probably easiest to interact with, either directly in Python, or it can be combined with R using reticulate.

This chapter will not go into depth about issues around environments or how to install python properly. Please familiarise yourself with how to use pip, particularly pip install and set / activate python environments.

You can use RStudio for working in python too and this is what is suggested here. Other formats exist including jupyter notebooks.

For more information, Python for microscopists has excellent resources.

11.1 Installing OpenSlide

First install the linux version of openslide - the python files act as a wrapper to this system library.

apt-get install python3-openslide

To install openslide and the libraries we will need, first create a python environment. Do this inside the folder you are going to be working in.

virtualenv -p python3 openslide_env

You can then either activate the virtual environment to install python libraries there, or install them system/user wide.

python3 -m pip install openslide-python

We will also need some other libraries

python3 -m pip install numpy matplotlib tifffile

Installation is fairly straighforward -if there are issues, do it as a sudo user. Most of the problems will appear if you are trying to use different or conflicting versions of python or if there are incorrect permissions for installation.

11.2 Using OpenSlide

Once openslide is installed, we can now write python scripts. Do this in RStudio or other IDE in the same folder as your virtual environment. You can create a project in RStudio using this existing folder, then add new python scripts. Python scripts have the suffix .py.

First, start by checking you can successfully import the libraries.

from openslide import open_slide
import openslide
from PIL import Image
import numpy as np
from matplotlib import pyplot as plt
import tifffile as tiff

If you get an error- try installing the libraries with pip as above.

11.2.1 Opening a slide and inspecting properties

file_path = '/home/test/slides/myslide.svs' # replace with where your slides are

slide_in = open_slide(file_path)

slide_in_props = slide_in.properties

slide_in_assc_images = slide_in.associated_images.items() # see if there are any associated images

11.3 Whole workflow

Here is the whole workflow for extracting magnified tiles of the image, taken from Python for microscopists.

file_path = '/home/test/slide_ins/myslide_in.svs' # replace with where your slide_ins are

slide_in = open_slide(file_path)

slide_in_props = slide_in.properties

print(slide_in_props)

print("Vendor is:", slide_in_props['openslide.vendor'])
print("Pixel size of X in um is:", slide_in_props['openslide.mpp-x'])
print("Pixel size of Y in um is:", slide_in_props['openslide.mpp-y'])

#Objective used to capture the image
objective = float(slide_in.properties[openslide.PROPERTY_NAME_OBJECTIVE_POWER])
print("The objective power is: ", objective)

# get slide_in dimensions for the level 0 - max resolution level
slide_in_dims = slide_in.dimensions
print(slide_in_dims)

#Get a thumbnail of the image and visualize
slide_in_thumb_600 = slide_in.get_thumbnail(size=(600, 600))
slide_in_thumb_600.show()

#Convert thumbnail to numpy array
slide_in_thumb_600_np = np.array(slide_in_thumb_600)
plt.figure(figsize=(8,8))
plt.imshow(slide_in_thumb_600_np)    


#Get slide_in dims at each level. Remember that whole slide_in images store information
#as pyramid at various levels
dims = slide_in.level_dimensions

num_levels = len(dims)
print("Number of levels in this image are:", num_levels)

print("Dimensions of various levels in this image are:", dims)

#By how much are levels downsampled from the original image?
factors = slide_in.level_downsamples
print("Each level is downsampled by an amount of: ", factors)

#Copy an image from a level
level3_dim = dims[2]
#Give pixel coordinates (top left pixel in the original large image)
#Also give the level number (for level 3 we are providing a valueof 2)
#Size of your output image
#Remember that the output would be a RGBA image (Not, RGB)
level3_img = slide_in.read_region((0,0), 2, level3_dim) #Pillow object, mode=RGBA

#Convert the image to RGB
level3_img_RGB = level3_img.convert('RGB')
level3_img_RGB.show()

#Convert the image into numpy array for processing
level3_img_np = np.array(level3_img_RGB)
plt.imshow(level3_img_np)


#Return the best level for displaying the given downsample.
SCALE_FACTOR = 32
best_level = slide_in.get_best_level_for_downsample(SCALE_FACTOR)
#Here it returns the best level to be 2 (third level)
#If you change the scale factor to 2, it will suggest the best level to be 0 (our 1st level)
#################################

#Generating tiles for deep learning training or other processing purposes
#We can use read_region function and slide_in over the large image to extract tiles
#but an easier approach would be to use DeepZoom based generator.
# https://openslide.org/api/python/

from openslide.deepzoom import DeepZoomGenerator

#Generate object for tiles using the DeepZoomGenerator
tiles = DeepZoomGenerator(slide_in, tile_size=256, overlap=0, limit_bounds=False)
#Here, we have divided our svs into tiles of size 256 with no overlap. 

#The tiles object also contains data at many levels. 
#To check the number of levels
print("The number of levels in the tiles object are: ", tiles.level_count)

print("The dimensions of data in each level are: ", tiles.level_dimensions)

#Total number of tiles in the tiles object
print("Total number of tiles = : ", tiles.tile_count)

#How many tiles at a specific level?
level_num = 11
print("Tiles shape at level ", level_num, " is: ", tiles.level_tiles[level_num])
print("This means there are ", tiles.level_tiles[level_num][0]*tiles.level_tiles[level_num][1], " total tiles in this level")

#Dimensions of the tile (tile size) for a specific tile from a specific layer
tile_dims = tiles.get_tile_dimensions(11, (3,4)) #Provide deep zoom level and address (column, row)


#Tile count at the highest resolution level (level 16 in our tiles)
tile_count_in_large_image = tiles.level_tiles[16] #126 x 151 (32001/256 = 126 with no overlap pixels)
#Check tile size for some random tile
tile_dims = tiles.get_tile_dimensions(16, (120,140))
#Last tiles may not have full 256x256 dimensions as our large image is not exactly divisible by 256
tile_dims = tiles.get_tile_dimensions(16, (125,150))


single_tile = tiles.get_tile(16, (62, 70)) #Provide deep zoom level and address (column, row)
single_tile_RGB = single_tile.convert('RGB')
single_tile_RGB.show()

###### Saving each tile to local directory
cols, rows = tiles.level_tiles[16]

import os
tile_dir = "images/saved_tiles/original_tiles/"
for row in range(rows):
    for col in range(cols):
        tile_name = os.path.join(tile_dir, '%d_%d' % (col, row))
        print("Now saving tile with title: ", tile_name)
        temp_tile = tiles.get_tile(16, (col, row))
        temp_tile_RGB = temp_tile.convert('RGB')
        temp_tile_np = np.array(temp_tile_RGB)
        plt.imsave(tile_name + ".png", temp_tile_np)

11.4 Using python and R together

Using the reticulate package, R can communicate with python. This is slightly glitchy at the time of writing. So writing python objects from R seems to not work (using py$ as mentioned in reticulate documentation), but python calling objects in the R environment seems reliable.

To do this we simply access the R environment in python using r['myobject']. Lets use the file_path example, where you might have used R to save the file path of specific slide file, then you want to use python to extract or manipulate the images.

In R:

file_path = '/home/test/slides/myslide.svs' # replace with where your slides are

In python:

file_path = r['file_path']
slide_in = open_slide(file_path)

11.5 Handling errors in python

Most slides will be scanned well, newer scanners use lasers and a variety of different tools to automatically detect where tissue is and the depth at which to scan. However, some files will not scan properly or be corrupted in transfer (files are at least 200M+ per scan, ususally around 500M).

If there are errors, it is quite annoying, particularly given the size of files being handled if a run fails due to errors in opening. To avoid this we can use the python try commands, where it will attempt the command but continue on fail.

#import pyvips
from openslide import open_slide
import openslide
from PIL import Image
import numpy as np
from matplotlib import pyplot as plt
import tifffile as tiff

#import
file_path = r['file_path']

try:
  slide_in = open_slide(file_path)
  slide_in_props = slide_in.properties
  r['slide_properties'] = slide_in_props
  r['slide_associated_images'] = slide_in.associated_images.items()
except Exception as e:
  r['slide_properties'] = print("Error loading") # doesn't save error
  r['slide_associated_images'] = print("Error loading") # doesn't save error

11.6 Iterating over many many slide files

Combining all the above, we can massively extend functionality and the number of slides we can process, for example to extract hundreds of slides at once and apply the same analysis to them.

In R:

library(tidyverse)
library(reticulate)
Sys.setenv(RETICULATE_PYTHON = "/home/slides/openslide_env/bin/python")

#list files in a dir
files = data.frame(file_path = list.files('/home/slides', 
                                          pattern = '.svs|.scn',
                                          full.names = T, recursive = T))#scn needs different

#make names
files = files %>% 
  mutate(file_name =  basename(file_path),
         label_path = gsub('.svs|.scn', '', file_name),
         out_path = gsub('.svs|.scn', '_label.png', file_path),
         file_info = file.info(file_path))


#first get label properties
slide_properties_list = list()
associate_images_list = list()

#get properties

for(i in 1:length(files$label_path)){
  label_path = files$label_path[i]
  file_path = files$file_path[i]
  out_path = files$out_path[i]
  source_python('label_properties.py', envir = globalenv())
  slide_properties_list[[i]] = as.character(slide_properties)
  associate_images_list[[i]] = as.character(slide_associated_images)
}

openslide_data = data.frame(do.call(cbind, list(files$file_path, associate_images_list, slide_properties_list)))

In python label_properties.py:

#import pyvips
from openslide import open_slide
import openslide
from PIL import Image
import numpy as np
from matplotlib import pyplot as plt
import tifffile as tiff

#import
file_path = r['file_path']

try:
  #slide_in = open_slide('/mnt/data/tdrake/light_microscopy/immune_staining_from_colin/01082019/collection_0000020061_2019-08-01 13_05_34.scn')
  slide_in = open_slide(file_path)
  slide_in_props = slide_in.properties
  r['slide_properties'] = slide_in_props
  r['slide_associated_images'] = slide_in.associated_images.items()
except Exception as e:
  r['slide_properties'] = print("Error loading")
  r['slide_associated_images'] = print("Error loading")