Edinburgh Explorer Reliability of an automatic classifier for brain enlarged perivascular spaces burden and comparison with human performance

In the brain, enlarged perivascular spaces (PVS) relate to cerebral small vessel disease, poor cognition, inﬂammation and hypertension. We propose a fully automatic scheme that uses a support vector machine (SVM) to classify the burden of PVS in the basal ganglia (BG) region as low or high. We assess the performance of three diﬀerent types of descriptors extracted from the BG region in T2-weighted MRI images: 1) statistics obtained from Wavelet transform’s coeﬃcients, 2) local binary patterns and 3) bag of visual words (BoW)-based descriptors characterising local keypoints obtained from a dense grid with the scale-invariant feature transform characteristics. When the latter were used, the SVM classiﬁer achieved the best accuracy (81.16%). The output from the classiﬁer using the BoW descriptors was compared with visual ratings done by an experienced neuroradiologist (Observer 1) and by a trained image analyst (Observer 2). The agreement and cross-correlation between the classiﬁer and Observer 2 ( κ =0.67[0.58 0.76]) were slightly higher than between the classiﬁer and Observer 1 ( κ =0.62[0.53 0.72]) and com-∗ parable between both observers ( κ =0.68[0.61 0.75]). Finally, three logistic regression models using clinical variables as independent variables and each of the PVS ratings as dependent variable were built to assess how clinically meaningful were the predictions of the classiﬁer. The goodness-of-ﬁt of the model for the classiﬁer was good (AUC values 0.93(model1), 0.90(model2) and 0.92(model3)) and slightly better (i.e. AUC values 0.02 units higher) than that of the model for Observer 2. These results suggest that, although it can be improved, an automatic classiﬁer to assess PVS burden from brain MRI can provide clinically meaningful results close to those from a trained observer.


Introduction
Perivascular spaces, also known as Virchow-Robin spaces, are fluid-containing spaces that surround the walls of small vessels and capillaries in the brain as they go through the grey or white matter. Perivascular spaces are microscopic, filled with interstitial fluid and act as drainage pathways for fluid 5 and metabolic waste from the brain and, when enlarged, are visible in structural magnetic resonance imaging (MRI) sequences (Potter et al., 2015b). High number of enlarged perivascular spaces (PVS) has been reported to be associated with worse cognition (MacLullich, 2004), active inflammation in multiple sclerosis plaques (Wuerfel et al., 2008) or ageing (Aribisala et al.,10 2014), depression at older ages (Patankar et al., 2007), Parkinson's disease (Laitinen et al., 2000) and cerebral small vessel disease (Doubal et al., 2010).
The term Small Vessel Disease (SVD) refers to a group of pathological processes that affect the small arteries, veins and capillaries of the brain (Pantoni, 2010). It is the most common cause of vascular dementia and a 15 cause of about a fifth of the strokes worldwide , proven to have significant and strong associations with vascular risk factors (Staals et al., 2014).A moderate to severe burden of PVS in the basal ganglia (BG) is one of the markers of SVD , along with lacunes, cerebral microbleeds and white matter hyperintensities (WMH).

20
PVS can be better identified on T2-weighted (T2w) MRI, where they appear as linear or dot-like structures with intensities close to those of the cerebrospinal fluid (CSF) and less than 3mm diameter in cross section . Therefore, PVS can be potentially quantified. Visual counting and/or manual delineation of PVS can be time consuming, and the develop-25 ment of computational methods to assess them is challenging, partly due to inconsistencies within the literature regarding PVS diameter and overlap in shape, intensity, location and size with these of lacunes (Valdés Hernández et al., 2013). Recently, Wang et al. (2016) and Ramirez et al. (2015) presented computational methods to obtain quantitative measurements of PVS 30 and validated the usefulness of their procedures in clinical research, but both approaches are semi-automatic being, therefore, prone to inter-observer variations and could be time consuming. Cai et al. (2015) also proposed a method for quantifying PVS using high resolution 7T MRI scanners but the use of such field strengths, although providing good spatial resolution and signal-35 to-noise ratio, has limited clinical use. Ballerini et al. (2016) use a Frangi filter whose parameters are optimised by means of the ordered logit model to enhance the differentiation between PVS and the background, but is unsuitable for images with very anisotropic voxels commonly used in clinical settings (e.g. voxel sizes of 0.5 x 0.5 x 6 mm) and still requires the (visual) 40 rating of the PVS.
As an alternative to quantitative measurements, several visual rating scales that provide a qualitative assessment of the burden of PVS have been proposed in recent years. Potter et al. reviewed the ambiguities of these scales and combined their strengths to develop one that proved to be robust 45 (Potter et al., 2015a). However, as with any visual recognition process, it is subject to observer bias. Making the PVS rating automatic (e.g. replicating the visual rating scale using image processing and pattern recognition) could potentially overcome these and also the drawbacks that the current methods of PVS segmentation have. 50 Computer vision and pattern recognition have already been successfully applied for computer-aided diagnosis using MRI (Munsell et al., 2015;Beheshti and Demirel, 2015) and for segmentation of brain structures or lesions Ithapu et al. (2014); Roy et al. (2015); de Brebisson and Montana (2015). It has also been used to assess markers of SVD qualitatively. For exam-55 ple, Chen et al. proposed a framework based on multiple instance learning to distinguish between absent/mild vs. moderate/severe SVD in computed tomography (CT) scans (Chen et al., 2015).
However, to the best of our knowledge, only two papers have addressed the task of assessing automatically the PVS rating in brain MRI using com-60 puter vision and pattern recognition techniques 3 González-Castro et al., 2016). They explored the use of different descriptors for this task, but did not analyse agreement with a human observer other than with the one that provided the ground truth ratings, or whether the predictors of the classification were clinically meaningful. Moreover, each of 65 these two works evaluate different descriptors to characterise the brain region selected for classifying PVS burden and report similar levels of accuracy for the preferred schemes, albeit having validated the schemes differently (i.e. González-Castro et al. (2016) uses cross-validation and compares results on randomly divided train and test subsets).

70
An overall evaluation of the schemes proposed so far for classifying the burden of PVS from brain MRI is lacking.
In this paper we build upon the work presented in , comparing the performance of the descriptors proposed by both studies for classifying automatically the burden 75 of PVS using a Support Vector Machine (SVM) (Vapnik, 1995). We focus on the PVS in the basal ganglia (BG), since moderate to severe PVS in this region (i.e. ratings 2-4) is a marker of cerebral SVD. We evaluate three different types of descriptors: 1) statistics obtained from Wavelet transform's coefficients (Alegre et al., 2012), 2) local binary patterns (Ojala et al., 2002) and 3) 80 bag of visual words (BoW)-based descriptors, using keypoints obtained from a dense grid characterised with the scale-invariant feature transform (SIFT) characteristics. Moreover, we validate the results by comparing the predictions made by the automatic method (i.e. the classifier using the descriptors that achieve the best performance) with the ratings from two observers. Fi-85 nally, we also investigate the applicability of this classifier to clinical studies, to assess if its outcome is clinically meaningful. The paper is organised as follows: In Section 2 the dataset and proposed methods are explained. Section 3 introduces the experimental setup and the results of the experiments, which are discussed in Section 4. Finally, the conclusions and possible future 90 lines of work are presented in Section 5.

Subjects and MRI protocol
We used data from 264 patients who gave written informed consent to participate in a study of lacunar stroke mechanisms (Valdés Hernández et al., The study that provided data for this manuscript (Valdés Hernández et al., 2015) included patients with lacunar stroke and/or minor cortical strokes which were clinically evident, and did not consider diabetes, hypertension and other vascular risk factors as criteria for exclusion. However, 100 it excluded patients with other non-vascular neurological disorders, major medical conditions including renal failure, contraindications to MRI, unable to give consent, and those who had haemorrhagic stroke or whose symptoms resolved within 24 hours (i.e. transient ischaemic attack). It was approved by the Lothian Ethics of Medical Research Committee (REC 09/81101/54) and the NHS Lothian R+D Office (2009/W/NEU/14) and was conducted according to the principles expressed in the Declaration of Helsinki.
Brain MRI was conducted at baseline (i.e. there was a maximum of 8 days between the stroke and the scan) on a 1.5 tesla GE Signa LX clinical scanner (General Electric, Milwaukee, WI), equipped with a self-shielding gradient 110 set and manufacturer supplied eight-channel-phased array heal coil. For our analyses we used the T2w images, acquired with TE 147 milliseconds, TR 9002 milliseconds, field of view 240 × 240 mm, acquisition matrix 256 × 256, slice thickness 5 mm, 1 mm inter-slice gap and voxel size 0.469 × 0.469 × 6 mm. The reconstructed image size (in voxels) is 512 × 512 × 28. For tissue 115 segmentation, diffusion-weighted and structural T1-weighted (T1w), T2w and gradient echo, acquired as specified in (Valdés Hernández et al., 2015) were also used.

PVS visual rating scale
The visual rating scale proposed by Potter et al. was used for assessing 120 the burden of PVS in the sample (Potter et al., 2015a). It rates the PVS separately in three major anatomical brain regions, i.e. midbrain, basal ganglia (BG) and centrum semiovale (CS) -shown in Figure 1 -using T2w MRI. The rating is done separately for left and right hemispheres, but a combined score that represents the average of the PVS burden is given.
All visual ratings were made by two observers: a neuroradiologist (Observer 1) with more than 25 years of experience who participated in the  In this paper, we focus only on the PVS in the BG, since moderate to severe PVS in this region (i.e. ratings 2-4) is a marker of cerebral SVD, 135 which has been associated with cognitive decline (Staals et al., 2014), vascular dementia and stroke (Potter et al., 2015b). An example of each of the ratings for the BG is shown in Figure 2. We dichotomise the BG PVS scores into two classes as per Potter et al. (2015b), scores 0-1 (i.e. none or mild PVS burden) and scores 2-4 (i.e. moderate to severe), to be our classes 0 and 1, 140 6 respectively.

Image preprocessing
The guidelines for the visual rating of PVS according to this scale state that the rating should be done on the slice with the highest number of PVS, so as to minimize inconsistencies and intra-/inter-observer variations due to 145 inter-slice variations in PVS visibility, varying number of PVS on different slices and double counting of linear PVS (Potter et al., 2015a). In the case of the BG region, this slice should be chosen amongst the slices with at least one characteristic BG structure, as indicated by Wang et al. (2016). A pipeline to extract the BG region and find the axial slice (from the BG) with the 150 highest number of PVS for each subject, was developed.
The first step of this pipeline is to automatically segment the intracranial volume and cerebrospinal fluid (CSF) on the T1w images. This was achieved using optiBET (Lutkenhoff et al., 2014) and FSL-FAST (Zhang et al., 2001) respectively. The second step is to, also automatically, extract all subcorti-155 cal structures, which was achieved using other tools from the same FMRIB Software Library (FSL) as is described in Valdés Hernández et al. (2015). Thereafter, from the slices that contained BG structures, we selected those in which the total area of these structures was more than 5 % the area of the intracranial area defined on the slice.

160
On each of the BG slices initially selected, a polygon enclosing the BG, internal and external capsules and thalami was automatically drawn by joining anatomical points in the insular cortex, the closest points to them in the lateral ventricles (frontal and occipital horns) and the intercept of the genu of the corpus callosum with the septum; and subtracting from it the region 165 occupied by the CSF. These steps are illustrated in Figure 3.
From this subset of slices, the slice where our classifier operated was selected after applying contrast-limited adaptive histogram equalisation (CLAHE) (Zuiderveld, 1994) to the polygonal regions, thresholding them to 0.43 times the maximum intensity level (Valdés Hernández et al., 2013;Wang et al., 170 2016) ( Fig. 3(d)), and counting the number of thresholded hyperintense regions on each candidate slice with area between 3 and 15 times the in-plane voxel dimensions (Wang et al., 2016). Although this procedure overestimates the number of PVS in the presence of other features of SVD markers (e.g. small lesions and lacunes) (Valdés Hernández et al., 2013), it provides a good estimate of the number of PVS on each candidate axial slice, so as to select the one with more PVS. for texture description with successful results (Arivazhagan and Ganesan, 2003). Due to its frequency domain localization capability, we have applied the discrete Wavelet transform (DWT) to each selected region to characterise their textures. We have used the Haar family of wavelets, which have already been successfully used in other medical image classification applica-185 tions (Alegre et al., 2012). The DWT extracts the low and high frequency components of a signal so they can be analysed separately. When the transform is applied to an image, four matrices of coefficients are obtained: namely LL i , LH i , HL i and HH i where i stands for the level of decomposition, which represent the approximations and details in the verti-190 cal, horizontal and diagonal directions respectively, They can be seen in the example that Figure 4 illustrates.
The first level of decomposition is applied on the original image, while the next levels i are applied to the matrix of approximations of level i − 1 as Figure 5 shows.

195
One of the descriptors we used is based on the DWT, and it is built using the mean and standard deviations of the histograms of the original image and each one of the matrices of coefficients yielded after three DWT levels (i.e. LL 1 , LH 1 , HL 1 , HH 1 , LL 2 , LH 2 , HL 2 , HH 2 , LL 3 , LH 3 , HL 3 and HH 3 ). Hence we represent each region by a vector of 26 features. This descriptor 200 is known as Wavelet statistical features (WSF) (Arivazhagan and Ganesan,  2003; Alegre et al., 2012).
The other descriptor based on the DWT is built using the features proposed by Haralick et al. (1973) derived from the grey-level co-occurrence matrix (GLCM) of the original image and each of the the coefficient ma-205 trices obtained after the first DWT level (i.e. LL 1 , LH 1 , HL 1 and HH 1 ). The features extracted from each GLCM are concatenated to form the final descriptor. A diagram depicting this process is shown in Figure 6. To achieve some invariance to rotation, we averaged the features extracted from GLCMs computed with orientations 0 • , 45 • , 90 • and 135 • . These de-210 scriptors are called Wavelet Co-occurrence Features (WCF) (Arivazhagan and Ganesan, 2003;Alegre et al., 2012). In this work, we assess two vari-ants of the WCF descriptors, WCF 4 and WCF 13 , depending on whether we extracted 4 or 13 features from the GLCMs, respectively. WCF 4 is built using the Haralick features Contrast, Correlation, Energy and Homogeneity, 215 and WCF 13 is formed using all features proposed by Haralick et al. (1973) except the Maximal Correlation Coefficient. These two descriptors showed good performance in Alegre et al. (2009).

Local binary patterns
Local Binary Patterns (LBP) were introduced by Ojala et al. (2002). In 220 the original version they worked with a 3×3 pixel block, but LBPs were later generalised, so as the size of the neighbourhood and the number of sampling points were parameters of the method. Given a pixel c with coordinates (x c , y c ), a pattern code is calculated by comparing it with the value of its P neighbours separated by a distance R, which in our case is 1, as per Equation
where g c and g p are the grey-level values of pixel c and its p-th neighbour, and function s(g p − g c ) is defined as: Finally, the whole image is described by means of a histogram of the LBP values of all pixels, given by Equation (1). As the position of the first neighbour (i.e. p = 0) is fixed, it being the pixel on the right hand side of c, the LBP R,P operator is not invariant to rotation We remove such effect of 230 rotation using the rotation invariant local binary pattern, LBP ri R,P , defined in Ojala et al. (2002).
As certain local binary patterns represent fundamental properties of texture, providing the vast majority of patterns present in textures (Ojala et al., 2002), while others are known to be less descriptive of the texture, Ojala et al. 235 introduced a measure of 'uniformity' U (LBP R,P ), which counts the number of spatial transitions (i.e. bitwise 0/1 changes) in a binary pattern LBP R,P for LBP R,P less than 2 (i.e. LBP riu2 R,P ) as expressed in Equation (2).
As the BG regions and the PVS are not very big we tried to keep the texture analysis as local as possible, so in this work we have used the values 240 R = 1 and P = 8. The final descriptors we use are the histograms of the accumulated output of LBP 1,8 , LBP ri 1,8 and LBP riu2 1,8 operating in each BG region.

Bag of visual words
The Bag of Visual Words (BoW) model (Sivic and Zisserman, 2003) rep-245 resents each image as a function of the frequency of appearance of certain visual elements, called visual words. The set of visual words is called the dictionary or codebook.
To build the dictionary, a set of keypoints from each image are sampled. Around each keypoint a small square region (i.e. patch) is extracted and 250 characterised by means of descriptors that retrieve information about the distribution of its pixels intensities. After that, the descriptors of the patches are clustered into K groups, each one having a prototype feature vector which is called visual word. This process is depicted in Figure 7.
In this work, we use a dense grid for sampling the keypoints and the 255 k-means clustering method (MacQueen, 1967) for forming the visual words. The process of creating the dictionary is performed in each iteration of the cross validation using the subsets of images used for training. We assessed different numbers of visual words to evaluate their impact on the classification. Once the dictionary is built, each image of the dataset is described by means of a process called image representation. This consists of repeating, for each image, the same process of keypoint selection and characterisation used in the creation of the dictionary, using also the same methods. Then, for each "new" patch, we find the visual word of the dictionary that is most similar to it by means of calculating the Euclidean distance between their descriptors.
The histogram of the visual words representative of all patches in an image is used as its final descriptor. The image representation process is illustrated in Figure 8. In this work, the patches are described using the Scale Invariant Feature Transform (SIFT) (Sivic and Zisserman, 2003). Basically, SIFT descriptors are based on histograms of oriented gradients computed from the intensities of the regions that result from dividing a 16 × 16 pixel squared patch around each keypoint into 16 subregions of 4 × 4 pixels each. More details about 275 SIFT can be found in Sivic and Zisserman (2003). Despite these consisting of two different parts, keypoint detector and patch descriptor, we only use the patch descriptor as we are sampling the keypoints in a dense grid.

Classification
In this work, we use a Support Vector Machine (SVM) classifier, which 280 is a supervised machine-learning approach that adjusts internal "weights" by means of a training process (i.e. an optimization phase), minimising the error between its calculated response and a "ground truth" provided by an expert. This type of classifier has attracted attention in the last few years for analysing MR images (Nam et al., 2015;Tong et al., 2014;Feis et al., 2013).

285
SVM tries to find the optimal hyperplane that maximizes the distances (i.e. margins) to the instances of the positive and negative classes in the training dataset. One of the parameters of SVM is the cost parameter C, which controls the trade-off between classes allowing training errors and forcing rigid margins.

290
SVM is a linear classifier: it tries to separate the data using a linear hyperplane. There are cases where the data is not linearly separable. In those cases, SVM may use the kernel trick : A kernel function K(x , x) may transform the data into a higher dimensional space where it is possible to separate it linearly. After evaluating different kernels (i.e. linear, radial basis function, sigmoid), the best results were achieved with the radial basis function (RBF) kernel: We refer the reader interested in more details about SVM to Schölkopf and Smola (2001).

Validation of the classifier
We validated the classification with a stratified 5-fold cross validation as 305 follows. The whole set, represented by the descriptors explained in Sections 2.4.1, 2.4.2 and 2.4.3, was randomly partitioned into 5 equally sized subsets with the same distribution as the original set. Of the 5 subsets, 4 were used to train the classifier and the remaining one was used as the test set. This process was repeated 5 times using a different subset each time as test set.

310
The 5 results from the 5 folds were averaged to provide the final results. This cross validation process was repeated 10 times, and the 10 results were averaged to avoid possible bias due to a random separation of the folds. Data were normalised so that they had mean 0 and standard deviation 1.
The overall results were validated in terms of accuracy, sensitivity and 315 specificity, using the dichotomised ratings of Observer 1 as ground truth.

Statistical analyses
The descriptors that achieve the best performance would be used in a real automatic visual rating application. Therefore we analysed the agreement of the visual ratings between the automatic classifier based on those descriptors 320 and between each observer. We also analysed the association between the outcome of each PVS rating (i.e. from each observer and from the automatic classifier) and clinical parameters known to be related to PVS burden in the patients that comprise this sample (see Section 2.1). 325 We determined the weighted Kappa coefficient of the PVS ratings in the BG region (scale 0-4) between observers as per http://vassarstats. net/kappa.html (Copyright Richard Lowry 2001. We also performed marginal homogeneity tests of the basal ganglia PVS visual ratings (scale 0-4) using the software application mh.exe ver. 1.2 (2016-03-01) (by John 330 Uebersax).

Inter-observer agreement
After dichotomising the BG PVS visual ratings produced by both observers, we determined the Kappa coefficient between observers and between the automatic classifier and each observer, using the function kappa in MAT-LAB R2015a (Copyright (c) 2007, Giuseppe Cardillo, updated 23 Dec 2009, 335 http://uk.mathworks.com/matlabcentral/fileexchange/15365-cohen-s-kappa/ content/kappa.m). We also conducted the McNemar's test between the ratings produced by the expert (i.e. Observer 1) and the automatic classifier to investigate whether the marginal frequencies between both were or not equal.

Clinical validation
The following clinical and demographic parameters were available for each study participant: age, hypertensive (or not) classification, stroke subtype (lacunar or cortical) classification and scores of white matter hyperintensity (WMH), atrophy and SVD burden. WMH were coded using Fazekas scores, 345 for periventricular (PV) and deep lesions separately in the left and right hemispheres and a combined score for both hemispheres was recorded (Fazekas et al., 1987). Brain atrophy was coded using a validated age-relevant template (Farrell et al., 2008), with superficial and deep atrophy coded separately ranging from none to severe on a scale from 1 to 6 according to the centiles 350 into which the template is divided, being 1(< 25 th ), 2(25-50 th ), 3(50-75 th ), 4(75-95 th ), 5(> 95 th ) and if >>5, 6 is used. Total atrophy was calculated as the average of deep and superficial atrophy scores. SVD was coded as per Staals et al. (2015) (0-4), which confers a point for each of the following conditions: if 1 or more cavitated old lacunar lesions are present, if Fazekas 355 PV score >= 3 and/or Fazekas Deep score >= 2, if BG PVS score is >= 2 as per Potter et al. (i.e. moderate-to-extensive), and if more than 1 brain microbleed is present.
We calculated the non-parametric bootstrapped correlations between BG PVS scores (before and after dichotomisation, from observers and from the 360 automatic classifier) and each clinical variable. We also performed bino-mial multivariable logistic regression to evaluate the clinical usefulness of our machine-learning scheme as per Potter et al. (2015a) and its sensitivity in various models. The latter was evaluated by comparison of correlated receiver operating characteristic (ROC) curves obtained from three models 365 that have as outcome variable the dichotomised PVS rating from A) the automatic classifier, B) Observer 1, and C) Observer 2. The first model (i.e. Model 1) had the following predictors: age, total atrophy, hypertension, Fazekas score, whether the patient had a previous lacunar infarct or not, index stroke subtype and SVD score. The second model (i.e. Model 2, implemented in Potter et al. (2015a)) had the same predictors as Model 1 with the exception of SVD score. The third model (i.e. Model 3) had also the predictors of Model 1 with the exception of Fazekas score and whether the patient had a previous lacunar infarct or not, as these two parameters are contemplated within the SVD score. These analyses were done using 375 MATLAB R2015a. Of note, the PVS outcome variable is also a contributor to the SVD score.

Analysis of the robustness against imaging confounds
All scans of the primary study that provided data for this analysis underwent quality checks. None of the T2-weighted sequences were corrupted 380 by visible movement artefacts that could affect the automatic PVS rating procedure presented. However, there are other confounds that could have influence in the results. We calculated the number of scans misclassified on each of the 10 iterations that contributed to the final result, on the absence and presence of the following imaging confounds visually identified by Ob-385 server 2 in the basal ganglia region blind to the neuroradiological reports: white matter hyperintensities found either bilaterally and scattered throughout the region or as a single cluster possibly indicative of a recent or old subcortical infarct, lacunes (symptomatic or asymptomatic), recent or old cortical strokes that partially affect the region, globus pallidus partially or 390 totally hyperintense, partial volume effects of the cerebrospinal fluid, and a combination of two or more of these factors.
We also counted the number of scans misclassified on each iteration for those people who had a lacunar infarct neuroradiologically determined, regardless of whether it was visible on T2-weighted in the basal ganglia region or not. This analysis would allow us to discuss whether the occurence of a recent lacunar infarct influenced the descriptors used by the classifier.

Results
The PVS ratings made by the experienced neuroradiologist (Observer 1), used to train the classifier, were distributed across the sample as Table 1   400 shows. The dichotomisation of these ratings into none-mild vs. moderatesevere resulted in 133 and 131 datasets for each class, respectively.  The best descriptor in terms of overall accuracy was the descriptor based on the Bag of Visual Words model (81.15%) using a dictionary with 175 visual 410 words, followed by the fusion of WCF 4 and LBP riu2 1,8 (78.84%). Moreover, the former reached a sensitivity just slightly worse than the latter. The highest sensitivity is achieved by LBP riu2 1,8 , but its specificity is much worse than the BoW-based descriptor. It is also remarkable that, whereas WCF 4 does not get a good accuracy on its own, its accuracy improves 7% when it is fused with the LBP riu2 1,8 descriptor. The automatic classifier used in the following sections will be the SVM based on the descriptors that achieved the best overall accuracy (i.e. the dense-SIFT-based Bag of Visual Words model, with the SVM parameters C = 5 and γ = 10 −4 using a dictionary of 175 visual words). Once the visual 420 dictionary is created and the classifier is trained, this method took 0.0477 seconds to describe and classify each image.

430
The maximum possible linear-weighted kappa, given the observed marginal frequencies was 0.8486. Table 3 shows the agreement (i.e. kappa coefficient, standard error, 95% CI and maximum possible linear-weighted kappa, given the observed marginal frequencies) between each observer and the ratings assigned by the 435 SVM classifier that yield the best accuracy (see Table 2). Since the classification experiment was repeated 10 times, the reported agreements are the average of the corresponding 10 agreements. The marginal proportions between the ratings from the expert (i.e. Observer 1) and the automatic classifier were non-significantly different from each other (McNemar's test 440 p=0.1086). See the 2x2 frequency Table 4.  Observer 2 (dichotomised and not dichotomised) and the automatic classifier were equally significantly and positively correlated with age, PVS ratings in centrum semiovale (dichotomised and not), atrophy (deep and superficial), Fazekas (deep and periventricular), hypertension, old lacunar infarcts and SVD score. None of the BG PVS ratings correlated with index stroke subtype 450 (lacunar or cortical), and all were highly and significantly correlated with each other as Table 5 shows. Table 5: Non-parametric bootstrapped cross-correlation matrix for PVS ratings in the basal ganglia region. All correlations shown were significant with p<0.0001.

Parameter
Observer 1 Observer 1 Observer 2 Observer 2 Automatic Scale 0-4 Dichotomised Scale 0-4 Dichotomised Classifier Observer 1 (0-4) 1 0.9317 0.8130 0.6828 0.6588 Observer 1 (0-1) 1 0.7341 0.6901 0.6464 Observer 2 (0-4) 1 0.9057 0.7127 Observer 2 (0-1) 1 0.7030 Table 6 shows the results of the binomial multivariable logistic regression. Age, Fazekas periventricular scores and the presence of old lacunar infarcts 455 were significant and negatively associated with all BG PVS scores (i.e. those done by both observers and by the automatic classifier), as in Potter et al. (2015a). The coefficient estimates tabulated (B) express the effects of each predictor variable on the log odds of being in one class (i.e. 1 or 0) versus the reference class (i.e. 1 or 0 as per Observer 1).  Figure 9 shows the predicted probabilities of the outcome variables for each model. The distribution of the predicted "0"s and "1"s to be 0 and 1 respectively for the classifier and Observer 2 were similar across all models. All outcomes (i.e. PVS ratings from the classifier, Observer 1 and Observer 465 2) were consistently poorer for Model 2, which does not include SVD scores as predictor, than for the other two models. The PVS ratings from Observer 1 were particularly sensitive to the presence -and absence-of the SVD scores as predictor in the model, being exceptionally high when more components of the SVD score (including it) were included (i.e. Model 1).

Sensitivity analysis
(a) (b) Figure 9: Boxplots showing the distributions of the predicted probabilities of the outcome variable "1" (a) and "0" (b) (i.e. PVS ratings from the automatic classifier, from Observer 1 or from Observer 2) for each logistic regression model. Figure 10 shows the correlated ROC curves for each outcome variable (i.e. automatic classifier, Observers 1 and 2) also for each model. The area under the curve (AUC) from the automatic classifier experiences the least variation across the three curves: 0.93,0.90 and 0.92 for models 1,2 and 3 respectively (maximum variation 3%) indicating highest consistency in model accuracy, 475 followed by Observer 2 (maximum variation 5%).

Performance on the presence/absence of imaging confounds
As table 7 shows, only 9.6% to 16.6% of the scans that have a small T2weighted hyperintense lesion such as lacunes, white matter hyperintensities or subcortical new or old infarcts in the basal ganglia region of size comparable 480 with those of the PVS were misclassified, versus 16% of the scans that have two or more of these confounds, and 13.6% of those who had none. These percentages were higher when the T2-weighted hyperintense covered a larger region (i.e. cortical stroke or globus pallidus hyperintense), but the number of scans that had these confounds were very small (7 and 5 respectively out 485 of 264). The number of patients who had a recent lacunar infarct (neuroradiologically determined) and for which the PVS rating was miscalculated was the same as the number of patients that did not have any imaging confound and for which the PVS rating done by the classifier was wrong (compared to the ratings of the neuroradiologist).

Discussion
We developed an automatic framework to classify T2-weighted MRI as having none or few PVS in the basal ganglia region versus having many of them, in response to the need for such tool given the role of PVS in SVD and vascular dementia progression. Our framework uses a conventional SVM 495 classifier based on the information from SIFT descriptors that operate on patches from the basal ganglia region using a dense grid following the "bag of words" model. These descriptors provided the highest classification accuracy (81.16%) from those evaluated. This accuracy is slightly lower than the one reported in  with the same descriptors (82.34%).

500
The reason is the different validation of the classifier used in both works: in  the classification was carried out by randomly splitting the dataset into train (70%) and test sets (30%), whereas in this case we have used 5-fold cross validation. This classifier took an average of 0.0477 seconds to describe and classify each image. The framework proved to 505 be useful in clinical settings and outperformed the visual classification done by a trained observer.
The image processing pipeline that pre-processed the data where the descriptors were extracted was designed following the visual rating guidelines for PVS from Potter et al (Potter et al., 2015a) (http://www.sbirc.ed.ac. 510 uk/documents/epvs-rating-scale-user-guide.pdf), which are based on assessing the PVS from a region of interest on the axial MRI slice with the most visible PVS. All agreements between the automatic classifier, the dichotomised ratings of the experienced neuroradiologist (Observer 1) and those from the trained observer (Observer 2), as shown in Section 3.2 were 515 above 0.6. However, the agreement between the dichotomised ratings from both observers (kappa = 0.6822) was slightly higher than the agreement between the classifier and any of the observers (0.6228 with Observer 1 and 0.6743 with Observer 2). The fact that the classifier had better agreement with Observer 2 than with Observer 1 may be because Observer 2 followed 520 the same guidelines used to design the pipeline for the automatic classifier, whereas Observer 1 may have also applied their individual experience and neuroradiological knowledge when rating the PVS. The cross-correlation between the classifier output and the dichotomised ratings of both observers, shown in Table 5, followed the same pattern: the correlation of the classifier 525 with Observer 2 was higher than with Observer 1 (0.7030 and 0.6464, respectively). This cross-correlation between the output of the classifier and the dichotomised ratings of Observer 2 (0.7030) was comparable and even slightly higher than between the dichotomised ratings of both observers (0.6901).
The statistical model built to evaluate the applicability of the automatic 530 classifier to the clinical research showed excellent and similar goodness-offit irrespective of whether the outcome variable was the automatic classifier (AUC=0.90), Observer 1 (AUC=0.84) or Observer 2 (AUC=0.86). Also, age, the burden of periventricular white matter hyperintensities (i.e. Fazekas PV) and the presence of old lacunar infarcts were associated with the PVS burden 535 irrespective of whether these were rated automatically or visually by any of the observers, proving the usefulness of the automatic framework proposed. A separate sensitivity analysis of this and similar correlated models showed that the automatic classifier was the least susceptible to be influenced by the overall burden of SVD shown in the MRI scan whilst the ratings from the 540 neuroradiologist captured better the full flavour of the SVD features. The degree in which this result was favoured by the single-slice approach adopted by the classifier (Potter et al., 2015a) (Wang et al., 2016) is not known. Further evaluation on the whole extent of the three anatomical regions defined by Potter et al. (2015a), with added scrutiny to exclude lacunes is needed.

545
Nevertheless, given that the accuracy of the classifier on the presence of imaging confounds was not different from it in the absence of them, and that the output was quite robust against the whole SVD burden, we do not foresee any problem for this automatic classification scheme to be applied to longitudinal or multicentre studies, as long as the training and testing datasets 550 have similar acquisition protocols.

22
A possible limitation of this work is the fact that the segmentation of the basal ganglia region is not always accurate (due to, for example, not finding the anatomical points described in Section 2.3), causing a potential misclassification. As we wanted to assess the validity of a fully automatic 555 method, we kept those suboptimal segmentations. Another limitation of the study may be the dichotomisation of the visual ratings used in the automatic classification. Due to limitations in the sample size, we needed to simplify the classification, so we dichotomised the visual rating scale as it was done in previous studies (Potter et al., 2015b): a reliable 5-class classification model is not possible to be trained with such few instances in some classes (e.g. out of 264 subjects there were only 5 with rating 0 or 19 with rating 4). Further analyses using bigger samples and considering the full ratings (i.e. 0-4) need to be done.

565
In this paper we have proposed an automatic framework based on image analysis and machine learning to predict the burden of enlarged perivascular spaces on the basal ganglia as "none or few" or "moderate to severe" based on the PVS visual rating scale Potter et al. (2015a).We compared different descriptors computed from the basal ganglia region. The bag-of-visual-words-570 based descriptors achieved the best accuracy (81.16%) in the classification, carried out using a support vector machine trained using the visual ratings provided by an experienced neuroradiologist (i.e., Observer 1) as ground truth.
We also compared the predictions of the classifier with the visual rat-575 ings done by Observer 1 and also with those done by a trained image analyst (i.e., Observer 2). The inter-observer agreement with the Observer 2 (kappa=0.6743) was higher than with the Observer 1 (kappa=0.6228) and comparable to that between both observers (kappa=0.6822). The crosscorrelation with the Observer 2 (0.7030) is also higher than with the Observer 580 1 (0.6464), and slightly higher that that between both observers (0.6901). Finally, we built three correlated logistic regression models with some clinical variables as independent variables and the ratings predicted by the automatic method and both observers as outcome variables and demonstrated that, although the automatic classifier does not capture the overall SVD severity, it can be used in clinical research as it consistently gives a meaningful output in relation to clinical parameters.
For future work we will try to improve the classification performance by means of extracting the whole basal ganglia region and use the information from all slices where the extracted region appears (i.e. 3D analysis), as 590 it may provide information that we are currently not taking into account. We will also try to use data from patients from other studies to increase our sample size and perform a 5-class classification (i.e. ratings from 0-4). Supervised machine-learning schemes like the one presented here would require the ground truth PVS counts or segmentations from a large number 595 of datasets done by an expert to be able to count and/or segment PVS. Such data are currently unavailable. However, the output from this classifier could be used as input to the fully automatic PVS unsupervised segmentation approach developed by Ballerini et al. (2016), (mentioned in the Introduction Section) which needs the PVS ratings to tune its algorithm and make it fully 600 automatic. Finally, the classifier presented here could be adapted to get the visual rating of the PVS in the centrum semiovale.  ) and (c) respectively). In the model 1 (a) the AUCs of the classifier, Observer 1 and Observer 2 were 0.9265, 0.9813 and 0.9074, respectively. In the model 2 (b) the AUCs of the classifier, Observer 1 and Observer 2 were 0.9041, 0.8395 and 0.8622, respectively. In the model 3 (c) the AUCs of the classifier, Observer 1 and Observer 2 were 0.9152, 0.9411 and 0.8934, respectively