Oftentimes, you will have an image that you want to extract data from.
It will most likely be something annoyingly minor, but will somehow turn into the most important piece to the puzzle that you *NEED RIGHT NOW!*
Okay… maybe that’s just me.
Sometimes it’s nice to be able to extract data from an image you found in a journal article to replicate the results.

Here, I will go through the steps I took to convert an image into RGB format (using Python - naturally) and convert it into an array of data. From there it gets interesting. A lot of images I come across have been interpolated from relatively sparse data. You don’t need all that data! It’s a waste of space and you’re better off trimming the bits you don’t actually need - especially if you would like to do some spatial post-processing.

Here are the main steps:

- Import the image using the
`scipy.ndimage`

module. - Setup a colour map using
`matplotlib.colors`

- Query the colour map with many numbers along the interval you’ve defined and store that in a KDTree.
- Query the KDTree using the $m \times n \times 3$
`ndimage`

RGB array. The result is an $m \times n$ data array. - Reduce the data by taking the first spatial derivative using
`numpy.diff`

and removing points where $|\nabla| > 0$. - Use
`sklearn.DBScan`

from the scikit-learn module to cluster the data and remove even more unnecessary data. - Have a beer.

At the end I have a Python class that contains all these steps rolled into one nifty code chunk for you to use on your own projects, if you find this useful.

Ready? Here we go…

## Load the image

```
from scipy import ndimage
import numpy as np
import matplotlib.pyplot as plt
im = ndimage.read("filename.jpg") # flatten=True for b&w
plt.imshow(im)
plt.show()
```

That looks nice, `im`

is a $m \times n \times 3$ array - i.e. 3 sets of $m$ rows, $n$ columns, that correspond to the number of pixels, for each RGB channel.
Now let’s create the colour map and chuck everything into a KDTree for nearest-neighbour lookup.
Seriously, KDTrees are awesome. srsly.

```
from scipy.spatial import cKDTree
from matplotlib import colors as mcolors
def scalarmap(cmap, vmin, vmax):
""" Create a matplotlib scalarmap object from colour stretch """
cNorm = mcolors.Normalize(vmin=vmin, vmax=vmax)
return plt.cm.ScalarMappable(norm=cNorm, cmap=cmap)
scalarMap = scalarmap(plt.get_cmap("jet_r"), 0.0, 0.7)
value_array = np.linspace(0,0.7,10000)
color_array = np.asarray(scalarMap.to_rgba(value_array))
color_array *= 255 # rescale
tree = cKDTree(color_array[:,:-1]) # Last column is alpha
d, index = tree.query(im)
data = value_array[index]
```

We scaled the colour map between 0.0 and 0.7, which corresponds to the extent in the image.
Optionally, we could mask pixels that are some distance, $d$, from the colour map.
E.g. `data = np.ma.array(value_array[index], mask=(d > 70))`

.
This would mask out the black text and white background in the image.

*I don’t know a better way to map the RGB colour arrays to data than to query entries in a KDTree.
If you do, please mention in the comments below.*

## Data reduction

Now that we have converted an image into an array we can reduce the data so that only the important bits remain. Firstly, we take the first spatial derivative and find where $\nabla = 0$.

```
dx = np.abs(np.diff(data[:,:-1], axis=0))
dy = np.abs(np.diff(data[:-1,:], axis=1))
dxdy = (dx + dy)/2.0
mask = dxy < 1e-12
```

Here, `mask`

is a $m \times n$ boolean array that contains `True`

values where we have good data.
But we can reduce it even further using `DBScan`

.
This finds spatial regions of high density and expands clusters from them.
I think this is basically a higher order use of KDTrees, but let’s humour `sklearn`

toolkit anyway.

```
from scikit.cluster import DBSCAN
Xcoords = np.arange(data.shape[1]-1)
Ycoords = np.arange(data.shape[0]-1)
xx, yy = np.meshgrid(Xcoords, Ycoords)
db = DBSCAN(eps=0.05, min_samples=1).fit(xx[mask], yy[mask])
labels = db.labels_
num_clusters = len(set(db.labels)) - (1 if -1 in db.labels else 0)
def get_centroid(points):
""" find centroid of a cluster in xy space """
n = points.shape[0]
sum_lon = np.sum(points[:, 0])
sum_lat = np.sum(points[:, 1])
return (sum_lon/n, sum_lat/n)
tree = cKDTree(np.vstack([xx[mask], yy[mask]]).T)
x = np.zeros(num_clusters, dtype=int)
y = np.zeros(num_clusters, dtype=int)
for i in xrange(num_clusters):
xc = xtrim[mask][labels == i]
yc = ytrim[mask][labels == i]
centroid = get_centroid(np.vstack([xc,yc]).T)
d, index = tree.query(centroid)
x[i], y[i] = tree.data[index]
mask2 = np.zeros_like(xx, dtype=bool)
mask2[y,x] = True
```

Above, we found the centroids for each cluster to reduce the substantially reduce the dataset from 16667 points down to 1221. Each colour represents a unique cluster in the image below.

`mask2`

has significantly improved on `mask`

by finding clusters in the data.
Now that it’s done, let’s see if we can reconstruct the image we first started with.
This is a good check to make sure we haven’t removed too much of the original data.

Overall the two look pretty similar!
There are some bits that could be refined by adjusting some of the parameters, e.g. `eps`

in `DBScan`

and $\nabla \approx 0$.
Also, adjusting the distance threshold in the original RGB $\rightarrow$ data conversion would remove artifacts caused by the coastline (i.e. bits that don’t find on the colour map).