From 1 - 10 / 24
  • The aim of the (Taxonomic) Data Refinement Workflow is to provide a streamlined workflow environment for preparing observational and specimen data sets for use in scientific analysis on the Taverna platform. The workflow has been designed in a way that, accepts input data in a recognized format, but originating from various sources (e.g. services, local user data sets), includes a number of graphical user interfaces to view and interact with the data, the output of each part of the workflow is compatible with the input of each part, implying that the user is free to choose a specific sequence of actions, allows for the use of custom-built as well as third-party tools applications and tools. This workflow can be accessed through the BioVeL Portal here http://biovelportal.vliz.be/workflows?category_id=1 This workflow can be combined with the Ecological Niche Modelling Workflows. http://marine.lifewatch.eu/ecological-niche-modelling Developed by: Biodiversity Virtual e-Laboratory (BioVeL) (EU FP7 project) Technology or platform: The workflow has been developed to be run in the Taverna automated workflow environment.

  • Background Monitoring hard-bottom marine biodiversity can be challenging as it often involves non-standardised sampling methods that limit scalability and inter-comparison across different monitoring approaches. Therefore, it is essential to implement standardised techniques when assessing the status of and changes in marine communities, in order to give the correct information to support management policy and decisions, and to ensure the most appropriate level of protection for the biodiversity in each ecosystem. Biomonitoring methods need to comply with a number of criteria including the implementation of broadly accepted standards and protocols and the collection of FAIR data (Findable, Accessible, Interoperable, and Reusable). Introduction Artificial substrates represent a promising tool for monitoring community assemblages of hard-bottom habitats with a standardised methodology. The European ARMS project is a long-term observatory network in which about 20 institutions distributed across 14 European countries, including Greenland and Antarctica, collaborate. The network consists of Autonomous Reef Monitoring Structures (ARMS) which are deployed in the proximity of marine stations and Long-term Ecological Research sites. ARMS units are passive monitoring systems made of stacked settlement plates that are placed on the sea floor. The three-dimensional structure of the settlement units mimics the complexity of marine substrates and attracts sessile and motile benthic organisms. After a certain period of time these structures are brought up, and visual, photographic, and genetic (DNA metabarcoding) assessments are made of the lifeforms that have colonised them. These data are used to systematically assess the status of, and changes in, the hard-bottom communities of near-coast ecosystems. Aims ARMS data are quality controlled and open access, and they are permanently stored (Marine Data Archive) along with their metadata (IMIS, catalogue of VLIZ) ensuring data fairness. Data from ARMS observatories provide a promising early-warning system for marine biological invasions by: i) identifying newly arrived Non-Indigenous Species (NIS) at each ARMS site; ii) tracking the migration of already known NIS in European continental waters; iii) monitoring the composition of hard-bottom communities over longer periods; and iv) identifying the Essential Biodiversity Variables (EBVs) for hard-bottom fauna, including NIS. The ARMS validation case was conceived to achieve these objectives: a data-analysis workflow was developed to process raw genetic data from ARMS; end-users can select ARMS samples from the ever-growing number available in collection; and raw DNA sequences are analysed using a bioinformatic pipeline (P.E.M.A.) embedded in the workflow for taxonomic identification. In the data-analysis workflow, the correct identification of taxa in each specific location is made with reference to WoRMS and WRiMS, webservices that are used to check respectively the identity of the organisms and whether they are introduced.

  • This workflow integrates the MEDA Toolbox for Matlab and Octave, focusing on data simulation, Principal Component Analysis (PCA), and result visualization. Key steps include simulating multivariate data, applying PCA for data modeling, and creating interactive visualizations. The MEDA Toolbox combines traditional and advanced methods, such as ANOVA Simultaneous Component Analysis (ASCA). The aim is to integrate the MEDA Toolbox into LifeWatch, providing tools for enhanced data analysis and visualization in research. Background This workflow is a template for the integration of the Multivariate Exploratory Data Analysis Toolbox (MEDA Toolbox, https://github.com/codaslab/MEDA-Toolbox) in LifeWatch. The MEDA Toolbox for Matlab and Octave is a set of multivariate analysis tools for the exploration of data sets. There are several alternative tools in the market for that purpose, both commercial and free. The PLS_Toolbox from Eigenvector Inc. is a very nice example. The MEDA Toolbox is not intended to replace or compete with any of these toolkits. Rather, the MEDA Toolbox is a complementary tool that includes several contributions of the Computational Data Science Laboratory (CoDaS Lab) to the field of data analysis. Thus, traditional exploratory plots based on Principal Component Analysis (PCA) or Partial Least Squares (PLS), such as score, loading, and residual plots, are combined with new methods: MEDA, oMEDA, SVI plots, ADICOV, EKF & CKF cross-validation, CSP, GPCA, etc. A main tool in the MEDA Toolbox which has received a lot of attention lately is ANOVA Simultaneous Component Analysis (ASCA). The ASCA code in the MEDA Toolbox is one of the most advanced internationally. Introduction The workflow integrates three examples of functionality within the MEDA Toolbox. First, there is a data simulation step, in which a matrix of random data is simulated with a user-defined correlation level. The output is sent to a modeling step, in which Principal Component Analysis (PCA) is computed. The PCA model is then sent to a visualization module. Aims The main goal of this template is the integration of the MEDA Toolbox in LifeWatch, including data simulation, data modeling, and data visualization routines. Scientific Questions This workflow only exemplifies the integration of the MEDA Toolbox. No specific questions are addressed.

  • This workflow employs a deep learning model for blind spectral unmixing, avoiding the need for expensive hyperspectral data. The model processes 224x224 pixel RGB images and associated environmental data to generate CSV files detailing LULC abundance at two levels of detail (N1 and N2). The aim is to provide an efficient tool for LULC monitoring, answering the question: Can LULC abundance be estimated from RGB images and environmental data? This framework supports environmental monitoring and land cover analysis. Background Land Use and Land Cover (LULC) represents earth surface biophysical properties of natural or human origin, such as forests, water bodies, agricultural fields, or urban areas. Often, different LULC types are mixed together in the same analyzed area. Nowadays, spectral imaging sensors allow us to capture these mixed LULC types (i.e., endmembers) together as different spectral data signals. LULC types identification within a spectral mixture (i.e., endmembers identification) and their quantitative abundance assessment (i.e., endmembers abundances estimation) play a key role in understanding earth surface transformations and climate change effects. These two tasks are carried out through spectral unmixing algorithms by which the measured spectrum of a mixed image is decomposed into a collection of constituents (i.e., spectra, or endmembers), and a set of fractions indicating their abundances. Introduction Early research on spectral unmixing dates back more than three decades. First attempts, referred to as linear unmixing, assumed that the spectral response recorded for an LULC mixture is simply an additive function of the spectral response of each class weighted by its proportional coverage. Notably, some authors used linear regression and similar linear mixture-based techniques in order to relate the spectral response to its class composition. Afterwards, other authors claimed the necessity of overcoming this assumption by proposing non-linear unmixing methods. However, non-linear methods require endmember spectra extraction for each LULC class, which has been found difficult in several works. Moreover, some studies indicated that it is unlikely that the spectra could be derived directly from the remotely sensed data since the majority of image pixels may be mixed. To overcome these limitations, several works introduced what is called blind spectral unmixing as an alternative method to avoid the need to derive any endmember spectra or making any prior assumption about their mixing nature. However, the majority of works that adopted blind spectral unmixing used deep learning-based models trained with expensive and hard-to-process hyperspectral or multispectral images. Therefore, many researchers during the last decade pointed out that more effort should be dedicated towards the usage of more affordable remote sensing data with few bands in spectral unmixing. They justified this need by two important factors: (1) In real situations, we might have access to images with only a few bands because of their availability, cost-effectiveness, and acquisition time-efficiency in comparison to imagery gathered with multi-band devices that require more processing effort and expenses; (2) In some cases, we do not really need a huge number of bands, as they can be used as a fundamental dataset from which we determine optimal wavebands for a particular application. In parallel, high-quality research in artificial intelligence application to remote sensing imagery, such as computer vision-based techniques and especially DL, is continuously achieving new breakthroughs that encourage researchers to entrust remote sensing imagery analysis tasks to these models and be confident about their performance. Aims The objective of this work is to present what is to our knowledge the first study that explores a multi-task deep learning approach for blind spectral unmixing using only 224x224 pixels RGB images derived from Sentinel-2 and enriched with their corresponding environmental ancillary data (topographic and climatic ancillary data) without the need to use any expensive and complex hyperspectral or multispectral data. The proposed deep learning model used in this study is trained in a multi-task learning approach (MTL) as it constitutes the most adequate machine learning method that aims to combine several information from different tasks to improve the performance of the model in each specific task, motivated by the idea that different tasks can share common feature representations. Thus, the provided model in this workflow was optimized for elaborating endmembers abundance estimation task that aims to quantify the spatial percentage covered by each LULC type within the analyzed RGB image, while being trained for other spectral unmixing related tasks that improves its accuracy in the main targeted task which is endmembers abundance estimation. The provided model here is able to give for each input (RGB image + ancillary data) the contained endmembers abundances values inside its area summarized in an output CSV file. The results can be computed for two different levels N1 and N2. These two levels reflect two land use/cover levels definitions in SIPNA land use/cover mapping campaign (Sistema de Información sobre el Patrimonio Natural de Andalucía) which aims to build an information system on the natural heritage of Andalusia in Spain (https://www.juntadeandalucia.es/medioambiente/portal/landing-page-%C3%ADndice/-/asset_publisher/zX2ouZa4r1Rf/content/sistema-de-informaci-c3-b3n-sobre-el-patrimonio-natural-de-andaluc-c3-ada-sipna-/20151). The first level "N1" contains four high-level LULC classes, whereas the second level "N2" contains ten finer level LULC classes. Thus, this model was mainly trained and validated on the region of Andalusia in Spain. Scientific Questions Through the development of this workflow, we aim at addressing the following main scientific question: - Can we estimate the abundance of each land use/land cover type inside an RGB satellite image using only the RGB image and the environmental ancillary data corresponding to the area covered by this image?

  • Land Use and Land Cover (LULC) maps are crucial for environmental monitoring. This workflow uses Remote Sensing (RS) and Artificial Intelligence (AI) to automatically create LULC maps by estimating the relative abundance of LULC classes. Using MODIS data and ancillary geographic information, an AI model was trained and validated in Andalusia, Spain, providing a tool for accurate and efficient LULC mapping. Background Land Use and Land Cover (LULC) maps are of paramount importance to provide precise information for dynamic monitoring, planning, and management of the Earth. Regularly updated global LULC datasets provide the basis for understanding the status, trends, and pressures of human activity on carbon cycles, biodiversity, and other natural and anthropogenic processes. Because of that, being able to automatically create these maps without human labor by using new Remote Sensing (RS) and Artificial Intelligence (AI) technologies is a great avenue to explore. Introduction In the last few decades, LULC maps have been created using RS images following the "raster data model", where the Earth's surface is divided in squares of a certain spatial resolution called pixels. Then, each of these pixels is assigned a "LULC class" (e.g., forest, water, urban...) that represents the underlying type of the Earth surface in each pixel. The number of different classes of a LULC map is referred to as thematic resolution. Frequently, the spatial and thematic resolutions do not match, which leads to the mixed pixel problem, i.e., pixels are not pure but contain several LULC classes. Under a "hard" classification approach, a mixed pixel would be assigned just one LULC class (e.g., the dominant class) while under a "soft" classification approach (also called spectral unmixing or abundance estimation) the relative abundance of each LULC class is provided per pixel. Moreover, ancillary information regarding the geographic, topographic, and climatic information of the studied area could also be useful to classify each pixel to its corresponding LULC class. Concretely, the following ancillary variables are studied: GPS coordinates, altitude, slope, precipitation, potential evapotranspiration, mean temperature, maximum temperature, and minimum temperature. Aims To estimate the relative abundance of LULC classes in Andalusia and develop an AI model to automatically perform the task, a new labeled dataset of Andalusia of pixels from MODIS at 460m resolution was built. Each pixel is a multi-spectral time series and includes the corresponding ancillary information. Also, each pixel is labeled with its corresponding LULC class abundances inside that pixel. The label is provided at two hierarchical levels, namely N1 (coarser) and N2 (finer). To create these labels, the SIPNA (Sistema de Información sobre el Patrimonio Natural de Andalucía) product was used, which aims to build an information system on the natural heritage of Andalusia. The first level "N1" contains four high-level LULC classes, whereas the second level "N2" contains ten finer LULC classes. Thus, this model was mainly trained and validated in the region of Andalusia in Spain. Once the dataset was created, the AI model was trained using about 80% of the data and then validated with the remaining 20% following a carefully spatial block splitting strategy to avoid spatial autocorrelation. The AI model processes the multi-spectral time series from MODIS at 460m and the ancillary information to predict the LULC abundances in that pixel. Both the RS dataset with the ancillary data used to create the AI model and the AI model itself are the deliverables of this project. In summary, we provide an automatic tool to estimate the LULC classes abundances of MODIS pixels from Andalusia using a soft classification approach and set a methodology that could be applied to other satellites where a better spatial resolution allows the use of more fine LULC classes in the future. Also, the AI model could serve as a starting point for researchers interested in applying the model in other locations, i.e., they can fine-tune the existing model with data for the new region of interest requiring far less training data thanks to transferring the learned patterns of our model. Scientific Questions Through the development of this workflow, we aim at addressing three main scientific questions: 1. Can we predict LULC abundances in a particular place through remote sensing and ancillary data and AI technologies?

  • Background Freshwater ecosystems have been profoundly affected by habitat loss, degradation, and overexploitation, leaving them now especially vulnerable to biological invasions. Whether non-indigenous species are the key drivers or mere complementary factors of biodiversity loss is still debated among the scientific community, however biological invasions together with other anthropogenic stressors are determining population declines and homogenisation of biodiversity in freshwater ecosystems worldwide. For example, it has been demonstrated that river basins with greater numbers of non-indigenous species have higher extinction rates of native fish species. Consequently, the application of effective biomonitoring approaches to support protection actions of managers, stakeholders and policy-makers is nowadays essential. Introduction Conventional methods of monitoring freshwater fish diversity are based on direct observation of organisms and are therefore costly, labour and resource intensive, require taxonomic expertise, and can be invasive. Obtaining information about species and communities by retrieving DNA from environmental samples has the ability to overcome some of these difficulties. The molecular investigation of environmental samples is known as environmental DNA (eDNA). Environmental DNA can be isolated from water, soil, air or faeces as organisms shed their genetic material in the surroundings through metabolic waste, damaged tissues, sloughed skin cells and decomposition. The analysis of eDNA consists of extracting the genetic material and subjecting it to a Polymerase Chain Reaction (PCR) which amplifies the target DNA. The use of high-throughput sequencing (HTS) allows the simultaneous identification of many species within a certain taxonomic group. This community-wide approach is known as eDNA metabarcoding and involves the use of broad-range primers during PCR that amplify a set of species. In recent years, the cost of this technology has drastically decreased, making it very attractive in conservation management and scientific research. A number of studies have demonstrated that eDNA metabarcoding is more sensitive than conventional biomonitoring methods for freshwater fish as it can detect rare or low-abundance taxa. As a result, eDNA metabarcoding can be used as an early-warning tool to detect new NIS at the initial stages of colonisation, when they are not yet abundant in the ecosystem. Aims This validation case regards eDNA metabarcoding fish sequences collected from the Douro Basin in Portugal. DNA sequences are processed through a bioinformatic pipeline wrapped in the first part of the analytical workflow which conducts a quality check and assigns the DNA sequences to produce a list of taxa. The analytical workflow developed can process DNA sequences of different kinds, depending on the genetic markers used for the analysis and so this workflow can be applied to different taxonomic groups and ecosystems. The taxa identified might include indigenous organisms as well as newly identified taxa within a certain geographical region. For that reason, the national checklists of introduced and invasive species (GRISS) from GBIF are consulted to check if the organisms detected are recognised as NIS or if previously unrecorded NIS have been detected through eDNA metabarcoding analysis.

  • Accurately mapping vegetation is crucial for environmental monitoring. Traditional methods for identifying shrubs are labor-intensive and impractical for large areas. This workflow uses remote sensing and deep learning to detect Juniperus shrubs from high-resolution RGB satellite images, making shrub identification more efficient and accessible to non-experts in machine learning. Background In a dynamic climate, accurately mapping vegetation distribution is essential for environmental monitoring, biodiversity conservation, forestry, and urban planning. One important application of vegetation mapping is the identification of shrub individuals. We term by shrub identification, detection of shrub location and segmentation of shrub morphology. Introduction Yet, shrub species monitoring is a challenging task. Ecologists used to identify shrubs using classical field surveying methods, however, this process poses a significant challenge since the shrubs are often distributed in large areas that are most of the time inaccessible. Thus, these methods are considered labor-intensive, costly, time-consuming, unsustainable, limited to a small spatial and temporal scale, and their data are often not publicly available. Combining remote sensing and deep learning, however, can play a significant role in tackling these challenges providing a great opportunity to improve plant surveying. First, remote sensing can offer highly detailed spatial resolution granting exceptional flexibility in data acquisition. Then, these data can be afterward processed by deep learning models for automatic identification of shrubs. Aims The objective of this workflow is to help scientists, non-expert in machine learning, detect Juniperus shrubs from RGB very-high resolution satellite images using deep learning and remote sensing tools. Scientific Questions Can we accurately detect high-mountain Juniperus shrubs from RGB very-high resolution satellite images using deep learning?

  • This workflow focuses on analyzing diverse soil datasets using PCA to understand their physicochemical properties. It connects to a MongoDB database to retrieve soil samples based on user-defined filters. Key objectives include variable selection, data quality improvement, standardization, and conducting PCA for data variance and pattern analysis. The workflow generates graphical representations, such as covariance and correlation matrices, scree plots, and scatter plots, to enhance data interpretability. This facilitates the identification of significant variables, data structure exploration, and optimal component determination for effective soil analysis. Background - Understanding the intricate relationships and patterns within soil samples is crucial for various environmental and agricultural applications. Principal Component Analysis (PCA) serves as a powerful tool in unraveling the complexity of multivariate soil datasets. Soil datasets often consist of numerous variables representing diverse physicochemical properties, making PCA an invaluable method for: ∙Dimensionality Reduction: Simplifying the analysis without compromising data integrity by reducing the dimensionality of large soil datasets. ∙Identification of Dominant Patterns: Revealing dominant patterns or trends within the data, providing insights into key factors contributing to overall variability. ∙Exploration of Variable Interactions: Enabling the exploration of complex interactions between different soil attributes, enhancing understanding of their relationships. ∙Interpretability of Data Variance: Clarifying how much variance is explained by each principal component, aiding in discerning the significance of different components and variables. ∙Visualization of Data Structure: Facilitating intuitive comprehension of data structure through plots such as scatter plots of principal components, helping identify clusters, trends, and outliers. ∙Decision Support for Subsequent Analyses: Providing a foundation for subsequent analyses by guiding decision-making, whether in identifying influential variables, understanding data patterns, or selecting components for further modeling. Introduction The motivation behind this workflow is rooted in the imperative need to conduct a thorough analysis of a diverse soil dataset, characterized by an array of physicochemical variables. Comprising multiple rows, each representing distinct soil samples, the dataset encompasses variables such as percentage of coarse sands, percentage of organic matter, hydrophobicity, and others. The intricacies of this dataset demand a strategic approach to preprocessing, analysis, and visualization. This workflow introduces a novel approach by connecting to a MongoDB, an agile and scalable NoSQL database, to retrieve soil samples based on user-defined filters. These filters can range from the natural site where the samples were collected to the specific date of collection. Furthermore, the workflow is designed to empower users in the selection of relevant variables, a task facilitated by user-defined parameters. This flexibility allows for a focused and tailored dataset, essential for meaningful analysis. Acknowledging the inherent challenges of missing data, the workflow offers options for data quality improvement, including optional interpolation of missing values or the removal of rows containing such values. Standardizing the dataset and specifying the target variable are crucial, establishing a robust foundation for subsequent statistical analyses. Incorporating PCA offers a sophisticated approach, enabling users to explore inherent patterns and structures within the data. The adaptability of PCA allows users to customize the analysis by specifying the number of components or desired variance. The workflow concludes with practical graphical representations, including covariance and correlation matrices, a scree plot, and a scatter plot, offering users valuable visual insights into the complexities of the soil dataset. Aims The primary objectives of this workflow are tailored to address specific challenges and goals inherent in the analysis of diverse soil samples: ∙Connect to MongoDB and retrieve data: Dynamically connect to a MongoDB database, allowing users to download soil samples based on user-defined filters. ∙Variable selection: Empower users to extract relevant variables based on user-defined parameters, facilitating a focused and tailored dataset. ∙Data quality improvement: Provide options for interpolation or removal of missing values to ensure dataset integrity for downstream analyses. ∙Standardization and target specification: Standardize the dataset values and designate the target variable, laying the groundwork for subsequent statistical analyses. ∙PCA: Conduct PCA with flexibility, allowing users to specify the number of components or desired variance for a comprehensive understanding of data variance and patterns. ∙Graphical representations: Generate visual outputs, including covariance and correlation matrices, a scree plot, and a scatter plot, enhancing the interpretability of the soil dataset. Scientific questions - This workflow addresses critical scientific questions related to soil analysis: ∙Facilitate Data Access: To streamline the retrieval of systematically stored soil sample data from the MongoDB database, aiding researchers in accessing organized data previously stored. ∙Variable importance: Identify variables contributing significantly to principal components through the covariance matrix and PCA. ∙Data structure: Explore correlations between variables and gain insights from the correlation matrix. ∙Optimal component number: Determine the optimal number of principal components using the scree plot for effective representation of data variance. ∙Target-related patterns: Analyze how selected principal components correlate with the target variable in the scatter plot, revealing patterns based on target variable values.

  • SoilExcel workflow, a tool designed to optimize soil data analysis. It covers data preparation, statistical analysis methods, and result visualization. SoilExcel integrates various environmental data types and applies advanced techniques to enhance accuracy in soil studies. The results demonstrate its effectiveness in interpreting complex data, aiding decision-making in environmental management projects. Background Understanding the intricate relationships and patterns within soil samples is crucial for various environmental and agricultural applications. Principal Component Analysis (PCA) serves as a powerful tool in unraveling the complexity of multivariate soil datasets. Soil datasets often consist of numerous variables representing diverse physicochemical properties, making PCA an invaluable method for: ∙Dimensionality Reduction: Simplifying the analysis without compromising data integrity by reducing the dimensionality of large soil datasets. ∙Identification of Dominant Patterns: Revealing dominant patterns or trends within the data, providing insights into key factors contributing to overall variability. ∙Exploration of Variable Interactions: Enabling the exploration of complex interactions between different soil attributes, enhancing understanding of their relationships. ∙Interpretability of Data Variance: Clarifying how much variance is explained by each principal component, aiding in discerning the significance of different components and variables. ∙Visualization of Data Structure: Facilitating intuitive comprehension of data structure through plots such as scatter plots of principal components, helping identify clusters, trends, and outliers. ∙Decision Support for Subsequent Analyses: Providing a foundation for subsequent analyses by guiding decision-making, whether in identifying influential variables, understanding data patterns, or selecting components for further modeling. Introduction The motivation behind this workflow is rooted in the imperative need to conduct a thorough analysis of a diverse soil dataset, characterized by an array of physicochemical variables. Comprising multiple rows, each representing distinct soil samples, the dataset encompasses variables such as percentage of coarse sands, percentage of organic matter, hydrophobicity, and others. The intricacies of this dataset demand a strategic approach to preprocessing, analysis, and visualization. To lay the groundwork, the workflow begins with the transformation of an initial Excel file into a CSV format, ensuring improved compatibility and ease of use throughout subsequent analyses. Furthermore, the workflow is designed to empower users in the selection of relevant variables, a task facilitated by user-defined parameters. This flexibility allows for a focused and tailored dataset, essential for meaningful analysis. Acknowledging the inherent challenges of missing data, the workflow offers options for data quality improvement, including optional interpolation of missing values or the removal of rows containing such values. Standardizing the dataset and specifying the target variable are crucial, establishing a robust foundation for subsequent statistical analyses. Incorporating PCA offers a sophisticated approach, enabling users to explore inherent patterns and structures within the data. The adaptability of PCA allows users to customize the analysis by specifying the number of components or desired variance. The workflow concludes with practical graphical representations, including covariance and correlation matrices, a scree plot, and a scatter plot, offering users valuable visual insights into the complexities of the soil dataset. Aims The primary objectives of this workflow are tailored to address specific challenges and goals inherent in the analysis of diverse soil samples: ∙Data transformation: Efficiently convert the initial Excel file into a CSV format to enhance compatibility and ease of use. ∙Variable selection: Empower users to extract relevant variables based on user-defined parameters, facilitating a focused and tailored dataset. ∙Data quality improvement: Provide options for interpolation or removal of missing values to ensure dataset integrity for downstream analyses. ∙Standardization and target specification: Standardize the dataset values and designate the target variable, laying the groundwork for subsequent statistical analyses. ∙PCA: Conduct PCA with flexibility, allowing users to specify the number of components or desired variance for a comprehensive understanding of data variance and patterns. ∙Graphical representations: Generate visual outputs, including covariance and correlation matrices, a scree plot, and a scatter plot, enhancing the interpretability of the soil dataset. Scientific questions This workflow addresses critical scientific questions related to soil analysis: ∙Variable importance: Identify variables contributing significantly to principal components through the covariance matrix and PCA. ∙Data structure: Explore correlations between variables and gain insights from the correlation matrix. ∙Optimal component number: Determine the optimal number of principal components using the scree plot for effective representation of data variance. ∙Target-related patterns: Analyze how selected principal components correlate with the target variable in the scatter plot, revealing patterns based on target variable values.

  • This workflow aims to analyze diverse soil datasets using PCA to understand physicochemical properties. The process starts with converting SPSS (.sav) files into CSV format for better compatibility. It emphasizes variable selection, data quality improvement, standardization, and conducting PCA for data variance and pattern analysis. The workflow includes generating graphical representations like covariance and correlation matrices, scree plots, and scatter plots. These tools aid in identifying significant variables, exploring data structure, and determining optimal components for effective soil analysis. Background Understanding the intricate relationships and patterns within soil samples is crucial for various environmental and agricultural applications. Principal Component Analysis (PCA) serves as a powerful tool in unraveling the complexity of multivariate soil datasets. Soil datasets often consist of numerous variables representing diverse physicochemical properties, making PCA an invaluable method for: ∙Dimensionality Reduction: Simplifying the analysis without compromising data integrity by reducing the dimensionality of large soil datasets. ∙Identification of Dominant Patterns: Revealing dominant patterns or trends within the data, providing insights into key factors contributing to overall variability. ∙Exploration of Variable Interactions: Enabling the exploration of complex interactions between different soil attributes, enhancing understanding of their relationships. ∙Interpretability of Data Variance: Clarifying how much variance is explained by each principal component, aiding in discerning the significance of different components and variables. ∙Visualization of Data Structure: Facilitating intuitive comprehension of data structure through plots such as scatter plots of principal components, helping identify clusters, trends, and outliers. ∙Decision Support for Subsequent Analyses: Providing a foundation for subsequent analyses by guiding decision-making, whether in identifying influential variables, understanding data patterns, or selecting components for further modeling. Introduction The motivation behind this workflow is rooted in the imperative need to conduct a thorough analysis of a diverse soil dataset, characterized by an array of physicochemical variables. Comprising multiple rows, each representing distinct soil samples, the dataset encompasses variables such as percentage of coarse sands, percentage of organic matter, hydrophobicity, and others. The intricacies of this dataset demand a strategic approach to preprocessing, analysis, and visualization. This workflow centers around the exploration of soil sample variability through PCA, utilizing data formatted in SPSS (.sav) files. These files, specific to the Statistical Package for the Social Sciences (SPSS), are commonly used for data analysis. To lay the groundwork, the workflow begins with the transformation of an initial SPSS file into a CSV format, ensuring improved compatibility and ease of use throughout subsequent analyses. Incorporating PCA offers a sophisticated approach, enabling users to explore inherent patterns and structures within the data. The adaptability of PCA allows users to customize the analysis by specifying the number of components or desired variance. The workflow concludes with practical graphical representations, including covariance and correlation matrices, a scree plot, and a scatter plot, offering users valuable visual insights into the complexities of the soil dataset. Aims The primary objectives of this workflow are tailored to address specific challenges and goals inherent in the analysis of diverse soil samples: ∙Data transformation: Efficiently convert the initial SPSS file into a CSV format to enhance compatibility and ease of use. ∙Standardization and target specification: Standardize the dataset and designate the target variable, ensuring consistency and preparing the data for subsequent PCA. ∙PCA: Conduct PCA to explore patterns and variability within the soil dataset, facilitating a deeper understanding of the relationships between variables. ∙Graphical representations: Generate graphical outputs, such as covariance and correlation matrices, aiding users in visually interpreting the complexities of the soil dataset. Scientific questions This workflow addresses critical scientific questions related to soil analysis: ∙Variable importance: Identify variables contributing significantly to principal components through the covariance matrix and PCA. ∙Data structure: Explore correlations between variables and gain insights from the correlation matrix. ∙Optimal component number: Determine the optimal number of principal components using the scree plot for effective representation of data variance. ∙Target-related patterns: Analyze how selected principal components correlate with the target variable in the scatter plot, revealing patterns based on target variable values.