Single-particle diffusional fingerprinting: A machine-learning framework for quantitative analysis of heterogeneous diffusion

Significance Single-particle tracking (SPT) analysis of individual biomolecules is an indispensable tool for extracting quantitative information from dynamic biological processes, but often requires some a priori knowledge of the system. Here we present “single-particle diffusional fingerprinting,” a more general approach for extraction of diffusional patterns in SPT independently of the biological system. This method extracts a set of descriptive features for each SPT trajectory, which are ranked upon classification to yield mechanistic insights for the species under comparison. We demonstrate its capacity to yield a dictionary of diffusional traits across multiple systems (e.g., lipases hydrolyzing fat, transcription factors diffusing in cells, and nanoparticles in mucus), supporting its use on multiple biological phenomena (e.g., drug delivery, receptor dynamics, and virology).

Super/Sub diffusion with four HMM states. The parameters for the HMM state are identical to the slow variant: (0.4, 0.25, 0.25, 0.1) with the same transition probability of 0.1. To simulate increments with statistics of a given α, the fbm python library with the daviesharte method was employed. For a given α, traces of unit diffusion constant were generated from a series of fbm increments with the fgn method of the library. For each increment in the generated trace, the increment was scaled with √ 2Di∆t where Di is the diffusion constant for the state at that frame and ∆t was set to 0.1 ms as in the simulation for normal diffusion.

Computation of the features
In order to provide a generalizable yet informative description of single particle tracking data, we chose to use feature-based machine learning for the prediction of mutant variants. The workflow is presented in Fig. 1, and consists of the following steps: Given a set of trajectories (x, y), we compute a set of 17 intuitive features aimed to describe general trends of random walks in many environments of interest (see table S1). The features are described in the following. The features Trappedness, kurtosis, gaussianity, fractal dimension, efficiency, and MSDratio are adapted from Kowalek et al (1). Table S1. Features used for the single molecule diffusional fingerprint as well as a short description of what they seek to describe. The features T0, T1, T2, T3, <tau>, and meanSL are obtained from the step lengths of the trace and give information on how fluctuating the diffusion speed is over the trajectory and the average speed of the diffusion. The features Pval, alpha, D, trappedness, meanMSD, and MSDratio gives information on how much the particle spreads out during the trajectory. The features kurtosis, gaussianity, fractal dimension, and efficiency allow the fingerprint to capture directionality of the walk by describing correlations between steps as well as linearity and asymmetry of the trajectory. Finally, the duration of the state was included as it gives key information on binding affinity and bleaching.  , the Viterbi path is  computed for each trajectory, and the features T0, T1, T2, and T3 are computed as the fraction of frames spent in each of the four states along the trajectory. Finally, the feature <tau> is computed as the average residence time in states along the trajectory. We chose Gaussians and not the distributions of step length for normal diffusion since the purpose of the HMM is not to fit the actual diffusion model, instead it is meant to capture the rough step length regimes in the data, and for this, the normal distribution is nicely suited as it is more localized than the broader step length distributions of Brownian motion. Both for motion of lower and higher complexity than four states, the chosen value of four HMM features was found to lead to good performance. The transcription factor data used for Fig. 4 D, E, F, which was previously analyzed using a two-state model (2), could be analyzed in terms of our four states to retrieve similar mechanistic interpretations while maintaining a classification score on par with deep neural network classifiers (Fig. S8). The comparably simple four-state HMM features performed well on a complex data set consisting of several state-shifting species based on both normal diffusion, directed motion, confined diffusion and anomalous diffusion. The diffusional fingerprinting feature set yielded a significant improvement in classification as compared to non-HMM features and the current state-of-the-art in convolutional neural network (CNN) classifiers (Table S2, Fig. S11, S12 and S13).

Feature name Description
B. MSD fitting based features: Pval, alpha and D . We compute the mean squared displacement for each trace and fit the resulting curve to a power law: MSD = Dt α . The diffusion constant D gives information on the general spreading of the motion, whereas α is a measure of persistence in the walk. α = 1 corresponds to normal diffusion, whereas α > 1 is superdiffusive and persistent while α < 1 yields subdiffusive motion that is anti-persistent and often trapped (3). Finally, we include the p-value in a chi2 test for the fit as a feature since it will give a rough estimate as to what degree the model fits with a power law curve.
C. Trappedness. The trappedness is based on a computation done by Saxton for the probability P that a normal random walk with diffusion constant D (in units length 2 /frame) will stay within a circular region of radius R for a time t (4) If this probability is deviating significantly from 1, one would start to think that the particle might be trapped. Therefore, the feature computed as 1 − P becomes a measure of the trappedness of the particle The diffusion constant is estimated from the difference between the first two points in the MSD curve, and R is taken to be the half the maximum separation between any two points in the trajectory. From simulations we found that the trappedness is distributed centrally around 0.5 for normal diffusion.

D. Kurtosis.
Kurtosis is a feature which compares the tail of the spatial probability density of the trajectory to that of the Gaussian (1). For a given trajectory of duration N with coordinates (xi, yi), i = 1, 2, .., N this is done by first computing the covariance matrix (or gyration tensor) [5] For normal diffusion, the eigenvectors of this matrix will point in the directions of the minor and major axis of an ellipse enclosing values of equal probability. The major axis of this ellipse will be in the direction of the eigenvector (r1, r2) with the largest eigenvalue. Given the major axis of the covariance matrix, we can project onto this axis to get a set of scalar values for each coordinate (xi, yi) yielding a maximum spread 1D projection x p i of the trajectory It is the kurtosis of the distribution for x p i to which the feature name refers. Kurtosis is the fourth moment of a distribution and it is estimated for the maximum spread projection as where σxp is the estimated standard deviation of the projected trajectory. The kurtosis captures how heavy-tailed the distribution is. As a feature, it gives a comparison of how far out the edge-point of a trajectory gets compared to what would be expected for a normal Brownian motion. Simulations show that normal diffusion displays a kurtosis feature value around 2. Therefore, a kurtosis larger than 2 suggests that the motion is superdiffusive, and a kurtosis less than 2 suggests subdiffusive motion. However, since one needs to both estimate the mean and variance before being able to compute the kurtosis, for very short trajectories, the estimation error will be quite large, which is probably why it is not the most descriptive feature for the relatively short trajectories analyzed in single molecule studies where trajectories rarely get much longer than a 100 frames.
E. Gaussianity. Whereas kurtosis describes the spatial distribution of the whole trajectory, gaussianity seeks to investigate the spatial distribution at the time-scale of single frames (5). The gaussianity is based on the ratio between the quartic moments and the second moments of the trajectory. The ratio of these two gives a set of numbers for each frame n. For a Brownian diffusion in 2D, the ratio becomes 2, and therefore we divide by 2 such that the gaussianity becomes 1 for Brownian motion. Values larger than 1 means that the step distribution is narrower than a those for Brownain motion, and values smaller than 1 means that it is wider. [10] Since the gaussianity is defined for each n as an average over increments n frames apart, we have N such gaussianities corresponding to N different timescales of the diffusion. We chose to use the average gaussianity along the trajectory as our feature, and only compute up to n = N/2 separations as the number of data points for computation of the moments becomes very sparse as you increase n and thus yielding a quite large variance. The resulting formula for the gaussianity features is F. Fractal dimension. The fractal dimension is a measure of how space filling the points of the trajectory is. Using the estimator by Katz and George (6), we estimate it as where L is the total length of the path, N is the number of steps in the trajectory and d is the largest distance between any two points in the trajectory. The fractal dimension is around 1 for straight trajectories (ie. ballistic motion), around 2 for normal diffusion and larger than 2 for subdiffusion (ie. confinement). For normal diffusion the distribution has been previously found to be consistent with a lognormal and peaked slightly below 2 (6), which we also found in our simulations (see Fig. S14A). We find that as traces become longer, the peak approaches 2 which is the fractal dimension expected for Brownian motion (Fig.  S14B). We also found that HMMs consisting of purely normal diffusion states with identical diffusion constants but differing equilibrium occupations display slightly different estimated fractal dimensions (Fig. S14C, Fig. S15). This makes sense as the Brownian emissions are weighted differently on average, leading to an altered expectation for the ratio d/L.

G. Efficiency.
The feature efficiency is the ratio of end-to-end distance to the sum of distance traveled for the particle (1) [13] This ratio has the same flavour as the fractal dimension. It will yield similar information, but it is very good at picking up directed motion. We found the distribution of γ in 2D Brownian motion to be roughly lognormal in simulations, with a mean of the log(γ) around -7. Since the distribution is roughly lognormal, and efficiency ratios are rather low, we chose to use the logarithm of the efficiency as the feature in our fingerprint: H. MSD ratio. The mean squared displacement ratio is similar to the fit of the MSD curve. It gives a measure of the shape of the MSD curve, and is defined as (1): with n1 < n2. Since the MSD curve often follows a power-law, the MSD-ratio is an estimate of α in MSD = Dn α for a set of two point along the MSD curve with the α = 1 (normal diffusion) result n1/n2 subtracted off. The MSD ratio will be 0 for normal diffusion greater than 0 for superdiffusion and less than 0 for subdiffusion.
I. Other features: meanSL, meanMSD and N. On top of the features described above, we include the average step length and average MSD as a feature. These are meant to provide a general overview of the tendencies in the trace. The HMM residence times: T0, T1, T2 and T3 would only contain equivalent information to the average step length in the limit of an infinite trace. For this reason, the average step length is a rather good feature for describing short traces. The average MSD similarly captures the general spreading of the trace, and is meant to supplement the other features derived from the MSD that might be blurred for short traces. Finally, the duration of the trace is included as a feature since in some systems, like surface based tracking studies with TIRF, the trace duration gives information on binding kinetics in the system.
J. Machine learning methodology. The machine learning approach adopted to separate trace data starts by generating the 17 fingerprint statistics described above and in table S1. This generates a 2D data set X of shape (N, 17) where N is the number of trajectories in the data set. The traces are then labeled giving a vector y of shape (N, 1). The data set X is then normalized to zero column mean and unit column variance. Prediction is done using a Logistic Regression classifier which is trained using scikit-learn's function LogisticRegression() with max_iter set to 10000 and a trainable offset. The Logistic regression classifier was chosen after a comparison with 7 other classifiers (Fig. S1). Logistic regression was found to perform similar to other classifiers in binary classification and best-but comparable to SVM-based classifiers and a gradient boosted decision tree-in multiclass classification. Logistic Regression was chosen as it has no need for hyper-parameter optimization like SVMs and trains faster than a gradient-boosted decision tree.
In binary classifications, a histogram of the class probabilities estimated with the Logistic Regresssion classifier is plotted for each class, and then a ROC curve is computed for a decision boundary at each histogram bin. The boundary which maximize the sum of sensitivity and specificity on the training data is chosen and used for prediction.
In order to assess the generalizability of the model, a 5-fold cross validation is performed with 20% of the data set aside for testing in a stratified fashion, keeping the fractions of class labels the same in the training and test set. When an overview of the feature distributions is required, the Linear Discriminant 1D projection is computed using the scikit-learn function: sklearn.discriminant_analysis.LinearDiscriminantAnalysis.

Classification benchmarking
A. Data set simulation. To quantify noise-robustness of the methodology and compare against current state-of-the-art classifiers, we generated three stress-test data sets. The simulation consisted of one normal diffusion variant and three binary HMM variants shifting between normal diffusion and one of either directed motion, confined diffusion or anomalous diffusion. The HMM model employed is visualized in Fig. S11A and we chose p shift =0.1. For each trace, a state history was generated using the HMM and the motion types in each state was generated as described elsewhere (1), with the simulation parameters: ∆t=1/30 s, D = 9.02 µm 2 /s, α=0.3, R=9, B=4. All data sets consisted of 5,000 traces for each variant giving a total of 20,000 traces out of which 70% was used for training and 30% was used for validation.
The first data set investigates how the models perform on good data. Each trace in this data was chosen to be 100 frames with a low localization error ratio of Q = √ 4D∆t/σ =5, for all traces where σ is the standard deviation of the localization error (Fig. S11). The second data allow quantification of the effects of localization errors on classification accuracy. All traces are 100 frames long but five simulations where done with varying degrees of localization error ratios of Q =16, 4, 1, 0.25 and 0.0625 (Fig. S12). The last data sets allow quantification of the effects of short traces often found in high density tracking or low signal to noise movies. Q was fixed at 5 and 7 simulations were done with trace lengths of 20, 40, 80, 160, 320, 640 and 1280 (Fig. S13).

Acquisition of L3 diffusing on Lard vs. trimyristin
The methodology for surface preparation, data acquisition and analysis is described elsewhere (7). Briefly here, the lipid substrate surfaces were prepared on round 0.25 mm microscope cover glass slides by spin coating 1-2 drops of 14 µL LARD solution (24.5 g/L LARD with 0.05 ppm DOPE-ATTO488 dissolved in toluene) or 12 µL trimyristin-EnzChek solution (20 g/L trimyristin mixed with 0.1 ppm EnzChek505 dissolved in toluene) at 5000 rpm for 2x60 seconds with 10 s break in-between. The glass slides were placed in metal chambers and stored in vacuum for at least 2 hours or overnight before immediate use. Movies were acquired using a Total Internal Reflection Fluorescence (TIRF) microscope (IX 83, Olympus). Videos of 2000 frames and pixel width of 160 nm, were recorded using an EMCCD camera (ImageEM X2, Hamamatsu) and an oil immersion 100x objective (UAPON 100XOTIRF, Olympus). Imaging of the TLL enzyme mutant L3 labelled with Setau647 was performed with 300 EM gain, 80ms exposure time and 101-114 nm penetration depth, excited by 640 nm laser line. Videos were recorded Table S2. Comparison of diffusional fingerprinting to other classifiers across several simulated datasets. Low localization error data set: 20000 traces 30% validation 70% training consisting of 5000 normal diffusion traces and 5000 traces for each of three Hidden Markov Models (HMMs) converting between normal diffusion and either directed, confined or anomalous diffusion. Simulations were done as described elsewhere (1) with similar parameters (N =100, ∆t = 1 / 30 s, D = 9.02 µm^2/s, α = 0.3, N = 100, R = 9, B = 4, Q=5). The HMM model used is shown in Fig. S11 and p shift was set to 0.1. High localization error data set: Same as for the stress-test dataset but Q=1. High background noise: Same as for stress-test data set but N =40.

Dataset
Previous  All models are from the sklearn library. Parameter grid searches were conducted for the Decision tree, all svm classifiers and the nearest neighbor algorithm. For the SVMs, C=(0.1, 1, 10, 100, 1000) was searched, for the decision tree max_depth=(4, 16 , 64, None), min_samples_split=(1, 2, 4, 6), min_samples_leaf=(2, 8, 16, 32) was searched, and for nearest neighbor n_neighbors=(2,8,32,128) was searched. Data was scaled to 0 mean and unit variance for all classifiers except the decision-tree based classifiers (Decision tree, Random forest, and Gradient boosted decision tree). 70% of the data was used for training and 30% for validation.        Fig. S13. Confusion matrix for 5-fold stratified cross validation with a Logistic Regression classifier trained on 80% of the data and predicting on 20% of the data. The data sets are those from Fig. 4 in the main text, A is the Lard-trimyristin data presented in Fig. 4: A,B,C, B is the transcription factor data presented in Fig. 4: D,E,F and finally C is the nanoparticle data presented in Fig. 4: G,H,I.  The meanSL histogram displays two peaks for NLS and one peak for Sox2 with a tail. Both Sox2 and NLS has a peak from 0-0.25 µm. The NLS peak lies a little below 0.6 µm and is rather broad and the tail for Sox2 could be reconcilable with a similarly broad distribution centered around 0.3-0.4 µm. Since the data is measured with a framerate of 134 Hz, the conversion from meanSL to diffusion constant is D =meanSL 2 /(4/134) leading to D ∈ {0 µm 2 /s, 2.1 µm 2 /s} for the shared peak, D ∈ {3 µm 2 /s, 5.4 µm 2 /s} and D =12 µm 2 / for the NLS peak. Hansen et al. found a free diffusion constant for Sox2 of roughly 3 µm 2 /s, for NLS it was 12 µm 2 /s . The positions of the peaks therefore match rather well if one assumes the shared peak from 0-3µm in meanSL is due to bound diffusion.