Predicting long-term dynamics of soil salinity and sodicity on a global scale

Significance Land degradation due to soil salinization has detrimental impacts on vegetation, crops, and human livelihoods, leading to a need for a methodologically consistent analysis of the variability of different aspects of salt-affected soils. However, previous studies on the soil salinity issue have been primarily spatial and localized, leaving the large-scale spatiotemporal variations of soil salinity widely ignored. To address this gap, we present a globally validated analysis quantifying the long-term variations (40 y) of topsoil salinity at high spatial resolutions using machine-learning techniques. The results have significant implications for agroecological modelling, land assessment, crop growth simulation, and sustainable water management.


Extended Data
: Validation of the predictive capability of the developed two-part models for global estimation of soil ECe and ESP. a and b, The relation between classes of known measurements (True class) and the Predicted class as a result of 10-fold cross validation (10-CV). Producer's Accuracy shows the percentage of correct classifications relative to all classifications made by the classifier. User's Accuracy indicates to what percentage the predictions of the classifier can represent reality. c and d, Binned scatter plots showing the relation between the measured data and predictions of the regression part for saline and sodic classes as a result of 10-CV. e and f, Comparison between measured values of ECe and ESP at the soil surface (0 -30 cm) and the predicted values obtained using the model developed in the present study (R 2 ECe = 0.83, R 2 ESP = 0.86) and those obtained from the Harmonized World Soil Database (HWSD; R 2 ECe = 0.12, R 2 ESP = 0.26). A total of 9,293 and 30,491 measured data points are used in e and f. Figure S2: Catchment-scale average of the soil salinity predicted by ML-based models developed in the present study versus the dryness index (the ratio of long-term potential evapotranspiration to rainfall) for Australia, Africa, and North America.  Table S1.     Projected the original DEMs to the World Mercator x-ycoordinates system (at 1,000 m resolution) by the cubic convolution method for reducing computational time in SAGA GIS. Terrain Ruggedness Index (TRI) Aspect (degrees) -Fertilizer input rate for C3-annual and perennial crops (kg nitrogen ha -1 y -1 ). C3 is one of the pathways that plants use to fix carbon during the process of photosynthesis. For C3 plants, the first carbon compound produced during photosynthesis contains three carbon atoms.
The original annual .nc layers were converted to geo-tiff rasters and per-cell average of rasters between 1980 and 2018 was calculated using the ArcGIS "cell statistics" tool.  Original monthly (combined passive and active sensor type) .nc files were converted to geo-tiff layers and annual per-cell averages were computed in ArcGIS using the "cell statistics" tool.

GCS WGS 1984
180W-180E, 90S-90N 0.25˚ Original daily .nc files were converted to geo-tiff layers and annual per-cell averages were computed in ArcGIS using the "cell statistics" tool. Original Monthly .nc files were converted to geo-tiffs and reprojected to the GCS WGS 1984 to change the extent of the rasters (180W-180E, 90S-90N was the desirable extent). Annual per-cell averages were then calculated from daily raster layers in ArcGIS using the "cell statistics" tool.
Soil skin temperature (˚K) Soil layer one (0 -7 cm) temperature (˚K) Soil layer two (7- Table S4: Total area of the salt-affected soils at the climate, biome, and land cover levels. For the full name of the climate zones see Figure S12.            Figure S12 for the full name of each climate zone.          Figure S20: Spatial distribution of the known surface measurements classified by the two-part models. a, ECe. b, ESP. Any available ECe or ESP measurement from 1980 with zero upper sample's depth and a maximum lower sample's depth equal to 30 cm was used as a known surface measurement. We categorised the known surface measurements of ECe into five classes of non-saline (0 -2 dS m -1 ), slightly saline (2 -4 dS m -1 ), moderately saline (4 -8 dS m -1 ), highly saline (8 -16 dS m -1 ), and extremely saline (> 16 dS m -1 ) and similarly, the known surface measurements of ESP into five classes of non-sodic (0 -1%), slightly sodic (1 -6%), moderately sodic (6 -15%), highly sodic (16 -30%), and extremely sodic (> 30%). The above maps are generated by comparison between the measured data classes and final predictions of the two-part models falling into each class.  and Australia (NSE = 0.30, umber of observations = 34) were still of the lowest certainty, respectively. The low NSE values might be due to the insufficient number of validating surface measurements. Overall, the performance of the models in estimation of individual continents' soil surface salinity/sodicity with an NSE ranging from -0.01 to 0.96 (mean for all continents = 0.63) was better than the mean NSE of -0.14 for HWSD predictions.

Statistics on salt-affected regions
We would assume soils of a particular location as salt-affected if the annual predicted ECe of that location were ≥ 4 dS m -1 and/or its predicted ESP were ≥ 6% in at least 75% of the years between 1980 and 2018 period.
Note: All statistical analysis presented here were calculated for the regions delimited to -55° and 55° latitudes, i.e. tropics, subtropics, and temperate zones. Statistics of the countries located above 55° latitude (e.g. Canada, Russia, and United Kingdom) are not reported here.

Computer codes
This section provides the scripts and codes required to regenerate the results. Please note that ArcGIS Desktop 10.x license is needed to run ArcPy module. Also, the MATLAB Parallel Computing plus Statistics and Machine Learning toolboxes are required for running the MATLAB codes provided here.

Pre-processing the predictors' layers
The scripts provided in this sub-section show how we pre-processed the predictors assembled from the different sources and made them ready for data extraction. We refer the reader to Table S1 to see the corresponding pre-processing steps for each individual predictor. Static topographic predictors including slope (degrees), plan and profile curvatures, slope length (m), and Terrain Ruggedness Index (TRI) were calculated in SAGA GIS GUI (Graphical User Interface) from CGIAR CSI SRTM 90 m Digital Elevation Database v4.1. The original DEM (Digital Elevation Model) data were resampled to 250 m and saved in three separate raster datasets named: North East, South East, and West. We downloaded these three layers, mosaicked them in ArcGIS for Desktop environment (herein we refer to its central application: ArcMap) and exported the generated global layer as a single geo-tiff. To generate the map of the topographic predictors including slope, slope length, TRI, plan, and profile curvatures, it was necessary to have the original DEM in a projected coordinates system. For computing those topographic predictors, we first projected the global DEM layer to World Mercator coordinates system (with 259.511 m spatial resolution) using ArcMap "raster project" tool. To reduce the computational load and avoid system crashes in SAGA GIS, we produced a separate DEM layer in the World Mercator coordinates system, however, at 1,000 m spatial resolution to generate the maps of slope length and TRI. For the plan and profile curvatures, the 10 parameter 3 rd order polynomial method was used. Also, we used a square cell with radius of three for calculation of TRI.
Other static predictors were directly pre-processed (including projections and per-cell statistics) through ArcMap GUI and the following scripts are not applicable to those predictors. Soil texture raster datasets of clay, silt, and sand content at different depths were averaged using ArcMap "raster calculator" tool. To get an average of soil texture properties between soil surface and 100 cm depth from the available values of SoilGrids250 datasets for five standard depths of 0, 15, 30, 60, and 100 cm, we applied the trapezoidal rule as follows: where Rval (depth) was the raster value at the corresponding depth. Some predictors were originally in an .hdf format. HDF files were composed of different layers (or sub-datasets) and we required only one or two layers form those subdatasets. An example of these kind of predictors was VIP30 v. 004 dataset and NDVI and EVI2 layers were the required sub-datasets. We used the following Python code to automatize the processes of extracting the desirable sub-dataset layers:

## Extract sub-dataset, we used PyCharm Python IDE (Integrated Development Environment) ## Usage: Extracting the required sub-datasets from predictors with .hdf format and saving the ## final sub-dataset as a geo-tiff.
import arcpy # Importing the ArcPy module import os # Importing Miscellaneous operating system module required for reading the file names in a directory import os, fnmatch # Setting the geo-processing environments arcpy.env.overwriteOutput = True arcpy.env.workspace = r"The directory of .hdf files" arcpy.env.geographicTransformations = arcpy.SpatialReference(4326) # Setting the output coordinates as WGS 1984 # Reading the .hdf files needed to be processed from a directory path = r" The directory of .hdf files " pattern = "*.hdf" hdf_files = [ff for ff in os.listdir(path) if fnmatch.fnmatch(ff, pattern)] for i in range(0,len(hdf_files)): # i is index of the .hdf file arcpy.ExtractSubDataset_management(hdf_files[i],"Output directory"+str(i)+".tif", "S_N") # Extracting the sub-dataset and saving as geo-tiff using ArcPy ExtractSubDataset_management # S_N is the sub-dataset number in .hdf file In some cases, the original files were in an .nc format. To convert those netcdf files to raster layers we used the following Python code:

## Making raster layers from a netcdf file, ## Usage: This code first extracts the different temporal layers of the netcdf files and then ## saves each layer as a separate raster file in .tif format.
import arcpy # Importing the ArcPy module and spatial analysis required functions from arcpy import env from arcpy.sa import * import os # Importing Miscellaneous operating system module required for reading the file names in a directory import os, fnmatch # Setting the ArcPy geo-processing environments arcpy.env.overwriteOutput = True arcpy.env.workspace = r"The directory of the nc files" arcpy.env.geographicTransformations = arcpy.SpatialReference (4326)  After converting all predictors' datasets to raster layers, we used the ArcPy "cellstatistics" geo-processing tool to calculate the per-cell average of dynamic predictors. Temporal resolution of the predictors was different. First we generated the annual averages of each predictor. For the predictors with decadal averaging window, we calculated the average in each year from 1971 to 2018. For the predictors with 5-year averaging window, we computed annual averages from 1976. For the rest of predictors, we generated annual averages from 1980. Unfortunately, vegetation indices data including NDVI, EVI2, LAI, and FAPAR were not available for 1980. Therefore, we produced their layers by calculating an average between 1981 and 1985. For instance, we generated the raster layer of NDVI in 1980 (which was missing in the original VIP30 v. 004 dataset) by calculating the per-cell average of NDVI raster layers between 1981 and 1985. Then we computed running window averages of the predictors with decadal and five-year averaging windows (Table S1) from 1980 to 2018 using the following Python code. For each particular predictor and each year between 1980 and 2018, a raster layer representing the average of the corresponding predictor in the averaging window was generated. We named (labelled) these raster layers with the predictor's name as a prefix and the number of the year to which the raster layer was corresponded. Final averaged rasters of each predictor were saved in separate directories for extraction of the values and further processing.

## Calculation of per cell average for predictors with decadal and five-year averaging window, ## Usage: this code gets a large number of rasters in a directory and calculates the per cell ## average of the rasters ## and generates a final raster layer which is the average of input rasters.
import arcpy # Importing the ArcPy module from arcpy import env from arcpy.sa import * # Importing all functions from ArcPy spatial analyst toolbox import os # Importing Miscellaneous operating system module required for reading the file names in a directory import os, fnmatch # Setting the geo-processing environments arcpy.env.workspace = r"The directory of rasters for each particular predictor" arcpy.env.extent = "MAXOF" arcpy.env.overwriteOutput = True arcpy.env.geographicTransformations = arcpy.SpatialReference(4326) # Setting the output coordinates as WGS 1984 C = "Raster name prefix" for i in range (1980,2019): # i is the index of year # Execution of the cell-statistics tool # For predictors with decadal averaging window: outCellStatistics = ([C + str(i) + ".tif", C + str(i-1) + ".tif", C + str(i-2) + ".tif",…, C + str(i-9) + ".tif"], "MEAN", "DATA") # For predictors with five-year averaging window: outCellStatistics = ([C + str(i) + ".tif", C + str(i-1) + ".tif", C + str(i-2) + ".tif",…, C + str(i-4) + ".tif"], "MEAN", "DATA") # The output of cell-statistics is temporary (saved on memory) # Saving the output of cell-statistics on the disk outCellStatistics.save("Output folder/Predictor_name_" + str(i) + ".tif") Some pixels were missing in the final generated rasters of the predictors; mostly in layers of the remotely sensed soil moisture and vegetation indices. We filled the spatial gaps (pixels with null values) in the data layers using the mean of surrounding pixels. A circle with radius of 4 from the neighbouring cells of the gap was used to calculate the mean through application of the following Python code:

## Filling the gaps in rasters, ## Usage: This code fills the gaps (null cells) in generated rasters for extraction of ## predictors' values.
import arcpy # Importing the ArcPy module from arcpy import env from arcpy.sa import * # importing the functions of spatial analyst toolbox # Importing Miscellaneous operating system module required for reading the file names in a directory import os import os, fnmatch # Setting the geo-processing environments arcpy.env.parallelProcessingFactor = "100%" arcpy.env.workspace = r"The directory of rasters" arcpy.env.extent = "MAXOF" arcpy.CheckOutExtension("Spatial") # Acquiring all rasters within a directory path = r" The directory of rasters " pattern = "*.tif" tif_files = [ff for ff in os.listdir(path) if fnmatch.fnmatch(ff, pattern)] # i is index of the .tif file for i in range(0,len(tif_files)): string = tif_files[i] # raster layer name string_1 = string[0:(len(string)-4)] # Raster layer name without .tif suffix # Execution of the Filling. This part is a combination of ArcPy Focal Statistics and Raster Calculator tools # A circle with radius of 4 from the neighbouring cells of the gap is used to calculate the average and gap is filled by the average value Rasterfilled = Con(IsNull(string), FocalStatistics(string,NbrCircle(4,"CELL"),"MEAN","DATA"), string) # The output is temporary (saved on memory) and the following saves it on the disk arcpy.CopyRaster_management(Rasterfilled, r"The output directory"+string_1+".tif")

Extracting the predictors' values to training point feature layers
Merging of the training datasets (for ESP) from different source datasets and their preprocessing (see Methods, Data) was fully accomplished in ArcMap GUI and corresponding toolboxes. The indicator of the missing values in original training datasets was replaced by -9,999, soil layers attributes were joined to their corresponding geo-referenced profile locations, and the valid range for ESP was assumed to be 0 to 100%. Then we removed the profiles in AfSP and WISE datasets that spatially intersect the NCSS profiles and merged these three datasets into a single inventory. The final training datasets were saved as a point feature class (.shp format) file. We first projected these layers in ArcMap environment to World Mercator coordinates system to extract the values of static predictors in the World Mercator projection. We projected the point shape files instead of rasters to avoid data loss due to raster resampling.
After extraction, the two shape files were re-projected to WGS 1984 coordinates system to extract the values of other predictors. For all static predictors, the extractions were directly conducted by "extract multi values to points" tool from ArcMap "spatial analyst" toolbox. However, for the dynamic predictors, first we applied the ArcPy "select layer by attribute" tool to the point features classes and divided the points according to the year of acquisition. In detail, in each point feature class, there were some points with x-and y-spatial coordinates values representing the locations where soil ECe and ESP were sampled. In the attribute table of these x-y-points, the year of acquisition of the sample, lower sample's depth, upper sample's depth (from the soil surface), and the measured values of ECe or ESP for that sample were reported. We selected the samples with the same year of acquisition and exported the selected samples as new point feature classes for further processing and extraction of the dynamic predictors' values. Therefore, a total of 39 point feature layers labelled by the year of acquisition of samples were generated (since 1980). The following Python script shows the selection process:

## Select by attribute, ## Usage: This code selects the points in original datasets (needed for training) based on the ## year of acquisition in attribute table. ## This code splits the point feature layers of the training datasets into smaller point ## feature layers. Each smaller layer is labelled by the name of the year.
import arcpy # Importing the ArcPy module # Setting the geo-processing environments arcpy.env.workspace = r"The directory of training datasets" arcpy.env.overwriteOutput = True # Importing the original dataset feature point layer into memory arcpy.MakeFeatureLayer_management("ESP/ECe.shp", "lyr") # CC is the prefix of the generated point feature layers ECE_ or ESP_ for each year CC = "" or "" for i in range (1980,2019): # i is index of the year # -9999 is the index of the missing data arcpy.SelectLayerByAttribute_management("lyr", "NEW_SELECTION", "Year >= '"+str(i)+"' AND Year < '"+str(i+1)+"' AND NOT Year = '-9999'") arcpy.CopyFeatures_management('lyr', "Output directory/"+CC+str(i)) Then we extracted the values of the predictors at each year corresponding to the year of acquisition of the point feature layer using ArcPy "extract multi values to points" geoprocessing tool as follows: import arcpy # Importing the ArcPy module from arcpy import env from arcpy.sa import * # Importing all functions in spatial analyst toolbox # Setting the geo-processing environments arcpy.CheckOutExtension("Spatial") # Checking for the spatial analyst license arcpy.env.workspace = r"The directory of point feature layers" # The directory should also include the dynamic predictors' raster layers C = "the prefix of the point feature layers for each year" The extracted values of each predictor were added to the attribute table of individual years' point feature layers. The attribute tables were composed of columns with headers named after the predictors and rows representing the sample observations. After extraction of the predictors' values, the point features layers were merged by ArcMap "merge" tool and the final attribute tables were exported as text files (.txt format). These text files were imported to MATLAB for fitting the models and further analysis.

Model training
The prepared text files were then imported to MATLAB workspace. We calculated the linear Pearson correlation coefficients between each predictor and target variables as a univariate criterion to filter the unnecessary predictors, assuming no interaction between the predictors (  Table S24). The Pearson correlation coefficients between the predictors and target variables were non-significant; so we retrieved all predictors for further modelling. Initially, we used MATLAB Regression and Classification Learner applications to examine the performance of different built-in models available in MATLAB Statistics and Machine Learning toolbox. We used two-part models for mapping the relation between predictors and target variables. We held out 25% of the training sets and fitted the models with default models' hyperparameters. Tree-based ensemble models were the most suitable among other models for our regression and classification tasks (see Table S6).

Generation of soil mask and spatio-temporal predictions
Through applying the trained models to a global soil mask, we created the global maps of surface soil salinity and sodicity (0 -30 cm) at 0.008333333˚ spatial resolution. To create that soil mask, first we re-projected the 2014 MODIS land cover map from sinusoidal coordinates system to WGS 1984 using the ArcMap "raster project" tool. During the re-projection, we also resampled the map (with 0.004˚ spatial resolution) to 0.008333333˚ resolution (which was our desirable resolution) by the nearest neighbour resampling method to minimize the data loss during resampling and re-projection steps. Then we masked out the pixels labelled as water bodies, permanent wetlands, urban and built-up lands, and permanent snow and ice (numbers 11, 13, 15, and 17 in the map's IGBP legend) using the "mask function" available in ArcMap image analysis window. During exporting the generated layer, the lower and upper extents were set to be -55 and 55, respectively. Using ArcMap "raster split tool", we split the soil mask to smaller tiles so that the final smaller rasters were of maximum 3,600 pixels (60 rows and 60 columns). A total of 50,687 raster tiles were generated and converted to point feature classes in the World Mercator coordinates system using the following Python script:

## Raster to point conversion, we used PyCharm Python IDE (Integrated Development Environment) ## Usage: This code converts raster layers to point feature layers.
import arcpy # Importing the ArcPy module import multiprocessing #Importing multi-processing module from multiprocessing import Process import os # Importing Miscellaneous operating system module required for reading the file names in a directory import os, fnmatch # Setting the geo-processing environments arcpy.env.overwriteOutput = True arcpy.env.workspace = r"Directory of the raster tiles created form splitting job" arcpy.env.extent = "MAXOF" # Setting the output coordinates system as World Mercator arcpy.env.geographicTransformations = arcpy.SpatialReference (54004) # Reading all raster files in a directory path = r" Directory of the raster tiles created form splitting job " pattern = "*.tif" Tiffs = [ff for ff in os.listdir(path) if fnmatch.fnmatch(ff, pattern)] # Defining the function that will be passed to child processes # Function arguments: ini, end def cell(ini,end): jj = range(ini,end)# Indicator of the raster file for j in jj: inRaster = Tiffs[j] string = Tiffs[j] outPoint = "Output directory of point feature layers/"+string[0:4]+"_"+str(j)+".shp" field = "VALUE" # Execution of the ArcPy raster to point tool arcpy.RasterToPoint_conversion(inRaster, outPoint, field) if __name__ == '__main__': count = 0 processes = [] for i in range(0,number of system cores): ini = count end = count + The number of rasters that should be converted by each core process = Process(target=cell, args=(ini,end,)) processes.append(process) process.start() count = end The output point feature classes were in the World Mercator coordinates system. The values of static predictors in the World Mercator coordinates system were then extracted to the points in the generated point feature classes as follows:

Trend analysis
As mentioned earlier, within each of 39 folders we had 50,687 output tables. We used the values of predictions for target variables from those output tables to create annual time series of ECe and ESP between 1980 and 2018. By fitting a linear model to these time series, we generated different layers including trends of soil salinity variation since 1980, likelihood of soils with ECe 4 ≥ dS m -1 or ESP ≥ 6%, and change in the likelihood of soils with ECe 4 ≥ dS m -1 or ESP ≥ 6% (see Methods). The calculated coefficients (slopes) for locations with p ≥ 0.05 were set as no data value. The trend values and x-y-coordinates were then converted to raster datasets and mosaicked to generate the variation of global longitude-latitude grid cells of ECe and ESP at 30" spatial resolution. Additionally, for the classification step of the two-part models, we produced 39-year mean of the pixel-level Hs ( Figure S23). Also we generated other layers including the average of ECe/ESP values between 1980 and 2018, and standard deviation of the predictions between 1980 and 2018. The following MATLAB code shows how we performed the trend analysis and computed the statistical layers:

Rasterizing the generated tables
To reduce the required time for rasterizing the output tables, first we merged the 50,687 tables and reduced the number of output tables to a number below 500. During this merging process, we also multiplied the predictions for target variables by 100, 10,000, or 100,000 (depending on the needed accuracy) and rounded the results to remove the decimal point from the predictions. Using the signed integer values substantially reduced the required disk space for saving the final raster layers generated from the output tables. The following MATLAB code shows how we merged the tables, converted the float predictions to integer, and defined no data value indicators of the final rasters (generated from the output tables): clc; clear; %% Table Merger, %% This MATLAB code merges a large number of tables in a directory, passes the contents of the %% merged table to a desirable number of tables, and export those tables into another %% directory. Also the table variables (exempt X and Y coordinates) %% will be converted from float to integer during the merging process. % Reading all tables in a directory (in .txt format) and copying them into the memory P = strcat('Directory of the original comma delimited input tables'); S = dir(fullfile(P,'*.txt')); C = cell(1,numel(S)); parfor k = 1:numel(S) F = fullfile(P,S(k).name); C{k} = readtable(F,'FileType',... end = count + + The number of tables that should be converted to raster by each core process = Process(target = cell, args = (ini,end,)) processes.append(process) process.start() count = end

Zonal statistics
As mentioned earlier in the "Model deployment" section, we classified the predicted values of ECe and ESP to four classes or bins for each variable (8 classes in total). This classification enabled us to do different analysis on the total area of salt-affected soils between various thresholds. From those four bins for each variable, only the bins with an ECe ≥ 4 dS m -1 and an ESP ≥ 6% were required for area analysis. Instead of saving the values of ECe or ESP, the calculated area of pixels (based on x-y-coordinates) were saved as the final tables (x-, y-, and Area) into the corresponding folders. The result for each target variable was 39 folders (representing each year between 1980 and 2018) with 3 sub-folders including the generated tables for each bin. A similar script to the one presented in previous section (Rasterizing the generated tables) was used to rasterize all tables within the sub-folders. Each raster was representing the area of pixels labelled with different classes of the soil salinity or sodicity. To make it clearer, we provide an example from the final arrangement of the generated raster files. For ECe as a target variable in the 1999 main folder, for example, we had three sub-folders with the names of 1, 2, and 3, representing the three bins of salinity: 4 -8, 8 -16, and equal or greater than 16, respectively. Within each sub-folder, we had a set of raster files. These raster files included information on the area of pixels, which were estimated to have an ECe value falling into the corresponding bin of salinity. The following script was used to mosaic the final generated raster files within each sub-folder: # Raster mosaicing, # This script gets a set of raster datasets in a directory and mosaic all of them into a new # raster.