Chapter 11 Working with digital pathology data and python
Histopathology is used to analyse tissues from the body under the microscope. When a sample of tissue is taken during and operation or biopsy, it is put into a preservative, dehydrated and then impregnated with paraffin wax. It may also be frozen. This allows the tissue to be very thinly sliced and mounted onto glass slides, where the tissue can then be stained and studied under the microscope. Different types of stains can be used to look at different things.
These stained slides can be scanned with a slide scanner to allow the image to be digitised using scanners. There are two main types of images. Brightfield - generated by using white light, which requires deconvolution to produce separate colour channels. Fluorescence - generated using a light source to excite fluorescent probes, this typically generates images with 2+ colour channels.
The file formats produced by these scanners varies by manufacturer and scanning technique. Essentially, data is stored as tiled TIFF images within what is possibly an XML structure. There are usually multiple images produced, not only of the tissue but also of the slide, including a macro image (no magnification, sometimes helps scanner locate tissue), and the slide label.
Unfortunately, it’s quite a difficult format to interrogate easily. However, there is a nice program written that can solve this issue, known as OpenSlide.
OpenSlide has C, Python and Java versions. The python version is probably easiest to interact with, either directly in Python, or it can be combined with R using reticulate
.
This chapter will not go into depth about issues around environments or how to install python properly. Please familiarise yourself with how to use pip
, particularly pip install
and set / activate python environments.
You can use RStudio for working in python too and this is what is suggested here. Other formats exist including jupyter notebooks.
For more information, Python for microscopists has excellent resources.
11.1 Installing OpenSlide
First install the linux version of openslide - the python files act as a wrapper to this system library.
apt-get install python3-openslide
To install openslide and the libraries we will need, first create a python environment. Do this inside the folder you are going to be working in.
virtualenv -p python3 openslide_env
You can then either activate the virtual environment to install python libraries there, or install them system/user wide.
python3 -m pip install openslide-python
We will also need some other libraries
python3 -m pip install numpy matplotlib tifffile
Installation is fairly straighforward -if there are issues, do it as a sudo user. Most of the problems will appear if you are trying to use different or conflicting versions of python or if there are incorrect permissions for installation.
11.2 Using OpenSlide
Once openslide is installed, we can now write python scripts. Do this in RStudio or other IDE in the same folder as your virtual environment. You can create a project in RStudio using this existing folder, then add new python scripts. Python scripts have the suffix .py
.
First, start by checking you can successfully import the libraries.
from openslide import open_slide
import openslide
from PIL import Image
import numpy as np
from matplotlib import pyplot as plt
import tifffile as tiff
If you get an error- try installing the libraries with pip
as above.
11.3 Whole workflow
Here is the whole workflow for extracting magnified tiles of the image, taken from Python for microscopists.
= '/home/test/slide_ins/myslide_in.svs' # replace with where your slide_ins are
file_path
= open_slide(file_path)
slide_in
= slide_in.properties
slide_in_props
print(slide_in_props)
print("Vendor is:", slide_in_props['openslide.vendor'])
print("Pixel size of X in um is:", slide_in_props['openslide.mpp-x'])
print("Pixel size of Y in um is:", slide_in_props['openslide.mpp-y'])
#Objective used to capture the image
= float(slide_in.properties[openslide.PROPERTY_NAME_OBJECTIVE_POWER])
objective print("The objective power is: ", objective)
# get slide_in dimensions for the level 0 - max resolution level
= slide_in.dimensions
slide_in_dims print(slide_in_dims)
#Get a thumbnail of the image and visualize
= slide_in.get_thumbnail(size=(600, 600))
slide_in_thumb_600
slide_in_thumb_600.show()
#Convert thumbnail to numpy array
= np.array(slide_in_thumb_600)
slide_in_thumb_600_np =(8,8))
plt.figure(figsize
plt.imshow(slide_in_thumb_600_np)
#Get slide_in dims at each level. Remember that whole slide_in images store information
#as pyramid at various levels
= slide_in.level_dimensions
dims
= len(dims)
num_levels print("Number of levels in this image are:", num_levels)
print("Dimensions of various levels in this image are:", dims)
#By how much are levels downsampled from the original image?
= slide_in.level_downsamples
factors print("Each level is downsampled by an amount of: ", factors)
#Copy an image from a level
= dims[2]
level3_dim #Give pixel coordinates (top left pixel in the original large image)
#Also give the level number (for level 3 we are providing a valueof 2)
#Size of your output image
#Remember that the output would be a RGBA image (Not, RGB)
= slide_in.read_region((0,0), 2, level3_dim) #Pillow object, mode=RGBA
level3_img
#Convert the image to RGB
= level3_img.convert('RGB')
level3_img_RGB
level3_img_RGB.show()
#Convert the image into numpy array for processing
= np.array(level3_img_RGB)
level3_img_np
plt.imshow(level3_img_np)
#Return the best level for displaying the given downsample.
= 32
SCALE_FACTOR = slide_in.get_best_level_for_downsample(SCALE_FACTOR)
best_level #Here it returns the best level to be 2 (third level)
#If you change the scale factor to 2, it will suggest the best level to be 0 (our 1st level)
#################################
#Generating tiles for deep learning training or other processing purposes
#We can use read_region function and slide_in over the large image to extract tiles
#but an easier approach would be to use DeepZoom based generator.
# https://openslide.org/api/python/
from openslide.deepzoom import DeepZoomGenerator
#Generate object for tiles using the DeepZoomGenerator
= DeepZoomGenerator(slide_in, tile_size=256, overlap=0, limit_bounds=False)
tiles #Here, we have divided our svs into tiles of size 256 with no overlap.
#The tiles object also contains data at many levels.
#To check the number of levels
print("The number of levels in the tiles object are: ", tiles.level_count)
print("The dimensions of data in each level are: ", tiles.level_dimensions)
#Total number of tiles in the tiles object
print("Total number of tiles = : ", tiles.tile_count)
#How many tiles at a specific level?
= 11
level_num print("Tiles shape at level ", level_num, " is: ", tiles.level_tiles[level_num])
print("This means there are ", tiles.level_tiles[level_num][0]*tiles.level_tiles[level_num][1], " total tiles in this level")
#Dimensions of the tile (tile size) for a specific tile from a specific layer
= tiles.get_tile_dimensions(11, (3,4)) #Provide deep zoom level and address (column, row)
tile_dims
#Tile count at the highest resolution level (level 16 in our tiles)
= tiles.level_tiles[16] #126 x 151 (32001/256 = 126 with no overlap pixels)
tile_count_in_large_image #Check tile size for some random tile
= tiles.get_tile_dimensions(16, (120,140))
tile_dims #Last tiles may not have full 256x256 dimensions as our large image is not exactly divisible by 256
= tiles.get_tile_dimensions(16, (125,150))
tile_dims
= tiles.get_tile(16, (62, 70)) #Provide deep zoom level and address (column, row)
single_tile = single_tile.convert('RGB')
single_tile_RGB
single_tile_RGB.show()
###### Saving each tile to local directory
= tiles.level_tiles[16]
cols, rows
import os
= "images/saved_tiles/original_tiles/"
tile_dir for row in range(rows):
for col in range(cols):
= os.path.join(tile_dir, '%d_%d' % (col, row))
tile_name print("Now saving tile with title: ", tile_name)
= tiles.get_tile(16, (col, row))
temp_tile = temp_tile.convert('RGB')
temp_tile_RGB = np.array(temp_tile_RGB)
temp_tile_np + ".png", temp_tile_np) plt.imsave(tile_name
11.4 Using python and R together
Using the reticulate
package, R can communicate with python. This is slightly glitchy at the time of writing. So writing python objects from R seems to not work (using py$
as mentioned in reticulate
documentation), but python calling objects in the R environment seems reliable.
To do this we simply access the R environment in python using r['myobject']
. Lets use the file_path
example, where you might have used R to save the file path of specific slide file, then you want to use python to extract or manipulate the images.
In R:
= '/home/test/slides/myslide.svs' # replace with where your slides are file_path
In python:
= r['file_path']
file_path = open_slide(file_path) slide_in
11.5 Handling errors in python
Most slides will be scanned well, newer scanners use lasers and a variety of different tools to automatically detect where tissue is and the depth at which to scan. However, some files will not scan properly or be corrupted in transfer (files are at least 200M+ per scan, ususally around 500M).
If there are errors, it is quite annoying, particularly given the size of files being handled if a run fails due to errors in opening. To avoid this we can use the python try
commands, where it will attempt the command but continue on fail.
#import pyvips
from openslide import open_slide
import openslide
from PIL import Image
import numpy as np
from matplotlib import pyplot as plt
import tifffile as tiff
#import
= r['file_path']
file_path
try:
= open_slide(file_path)
slide_in = slide_in.properties
slide_in_props 'slide_properties'] = slide_in_props
r['slide_associated_images'] = slide_in.associated_images.items()
r[except Exception as e:
'slide_properties'] = print("Error loading") # doesn't save error
r['slide_associated_images'] = print("Error loading") # doesn't save error r[
11.6 Iterating over many many slide files
Combining all the above, we can massively extend functionality and the number of slides we can process, for example to extract hundreds of slides at once and apply the same analysis to them.
In R:
library(tidyverse)
library(reticulate)
Sys.setenv(RETICULATE_PYTHON = "/home/slides/openslide_env/bin/python")
#list files in a dir
= data.frame(file_path = list.files('/home/slides',
files pattern = '.svs|.scn',
full.names = T, recursive = T))#scn needs different
#make names
= files %>%
files mutate(file_name = basename(file_path),
label_path = gsub('.svs|.scn', '', file_name),
out_path = gsub('.svs|.scn', '_label.png', file_path),
file_info = file.info(file_path))
#first get label properties
= list()
slide_properties_list = list()
associate_images_list
#get properties
for(i in 1:length(files$label_path)){
= files$label_path[i]
label_path = files$file_path[i]
file_path = files$out_path[i]
out_path source_python('label_properties.py', envir = globalenv())
= as.character(slide_properties)
slide_properties_list[[i]] = as.character(slide_associated_images)
associate_images_list[[i]]
}
= data.frame(do.call(cbind, list(files$file_path, associate_images_list, slide_properties_list))) openslide_data
In python label_properties.py
:
#import pyvips
from openslide import open_slide
import openslide
from PIL import Image
import numpy as np
from matplotlib import pyplot as plt
import tifffile as tiff
#import
= r['file_path']
file_path
try:
#slide_in = open_slide('/mnt/data/tdrake/light_microscopy/immune_staining_from_colin/01082019/collection_0000020061_2019-08-01 13_05_34.scn')
= open_slide(file_path)
slide_in = slide_in.properties
slide_in_props 'slide_properties'] = slide_in_props
r['slide_associated_images'] = slide_in.associated_images.items()
r[except Exception as e:
'slide_properties'] = print("Error loading")
r['slide_associated_images'] = print("Error loading") r[