Skip to content

Commit

Permalink
last revised version of paper before zenodo
Browse files Browse the repository at this point in the history
  • Loading branch information
enricgrau committed Nov 20, 2023
1 parent 66a5534 commit 54142ff
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 4 deletions.
8 changes: 4 additions & 4 deletions paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,9 @@ authors:
orcid: 0000-0002-5502-3133
affiliation: "1"
affiliations:
- name: Catalonia Institute for Energy Research (IREC), Jardins de les Dones de Negre 1, 08930 Sant Adrià de Besòs, Spain
- name: Catalonia Institute for Energy Research (IREC), Jardins de les Dones de Negre 1, 08930 Sant Adrià de Besòs, Spain.
index: 1
- name: Departament d'Enginyeria Electrònica i Biomèdica, IN2UB, Universitat de Barcelona, C/ Martí i Franqués 1, 08028 Barcelona, Spain
- name: Departament d'Enginyeria Electrònica i Biomèdica, IN2UB, Universitat de Barcelona, C/ Martí i Franqués 1, 08028 Barcelona, Spain.
index: 2
date: 30 June 2023
bibliography: paper.bib
Expand All @@ -36,12 +36,12 @@ bibliography: paper.bib

Spectroscopic techniques (e.g. Raman, photoluminescence, reflectance, transmittance, X-ray fluorescence) are an important and widely used resource in different fields of science, such as photovoltaics [@Fonoll-Rubio2022] [@Grau-Luque2021], cancer [@Bellisola2012], superconductors [@Fischer2007], polymers [@Easton2020], corrosion [@Haruna2023], forensics [@Bhatt2023], and environmental sciences [@Estefany2023], to name just a few. This is due to the versatile, non-destructive and fast acquisition nature that provides a wide range of material properties, such as composition, morphology, molecular structure, optical and electronic properties. As such, machine learning (ML) has been used to analyze spectral data for several years, elucidating their vast complexity, and uncovering further potential on the information contained within them [@Goodacre2003] [@Luo2022]. Unfortunately, most of these ML analyses lack further interpretation of the derived results due to the complex nature of such algorithms. In this regard, interpreting the results of ML algorithms has become an increasingly important topic, as concerns about the lack of interpretability of these models have grown [@Burkart2021]. In natural sciences (like materials, physical, chemistry, etc.), as ML becomes more common, this concern has gained especial interest, since results obtained from ML analyses may lack scientific value if they cannot be properly interpreted, which can affect scientific consistency and strongly diminish the significance and confidence in the results, particularly when tackling scientific problems [@Roscher2020].

Even though there are methods and libraries available for explaining different types of algorithms such as SHAP [@Lundberg2017], LIME [@Ribeiro2016], or GradCAM [@Selvaraju2017], they can be difficult to interpret or understand even for data scientists, leading to problems such as miss-interpretation, miss-use and over-trust [@Kaur2020]. Adding this to other human-related issues [@Krishna12022], researchers with expertise in natural sciences with little or no data science background may face further issues when using such methodologies [@Zhong2022]. Furthermore, these types of libraries normally aim for problems composed of image, text, or tabular data, which cannot be associated in a straightforward way with spectroscopic data. On the other hand, time series (TS) data shares similarities with spectroscopy, and while still having specific needs and differences, TS dedicated tools can be a better approach. Unfortunately, despite the existence of several libraries that aim to explain models for TS with the potential to be applied to spectroscopic data, they are mostly designed for a specialized audience, and many are model-specific [@Rojat2021]. Furthermore, spectral data typically manifests as an array of peaks that are typically overlapped and can be distinguished by their shape, intensity, and position. Minor shifts in these patterns can indicate significant alterations in the fundamental properties of the subject material. Conversely, pronounced variations in the other case might only indicate negligible differences. Therefore, comprehending such alterations and their implications is paramount. This is still true with ML spectroscopic analysis where the spectral variations are still of primary concern. In this context, a tool with an easy and understandable approach that offers spectroscopy-aimed functionalities that allow to aim for specific patterns, areas, and variations, and that is beginner and non-specialist friendly is of high interest. This can help the different stakeholders to better understand the ML models that they employ and considerably increase the transparency, comprehensibility, and scientific impact of ML results [@Bhatt2020] [@Belle2021].
Even though there are methods and libraries available for explaining different types of algorithms such as SHAP [@Lundberg2017], LIME [@Ribeiro2016], or GradCAM [@Selvaraju2017], they can be difficult to interpret or understand even for data scientists, leading to problems such as miss-interpretation, miss-use and over-trust [@Kaur2020]. Adding this to other human-related issues [@Krishna12022], researchers with expertise in natural sciences with little or no data science background may face further issues when using such methodologies [@Zhong2022]. Furthermore, these types of libraries normally aim for problems composed of image, text, or tabular data, which cannot be associated in a straightforward way with spectroscopic data. On the other hand, time series (TS) data shares similarities with spectroscopy, and while still having specific needs and differences, TS dedicated tools can be a better approach. Unfortunately, despite the existence of several libraries that aim to explain models for TS with the potential to be applied to spectroscopic data, they are mostly designed for a specialized audience, and many are model-specific [@Rojat2021]. Moreover, spectral data normally manifests as an array of peaks that are typically overlapped and can be distinguished by their shape, intensity, and position. Minor shifts in these patterns can indicate significant alterations in the fundamental properties of the subject material. Conversely, pronounced variations in the other case might only indicate negligible differences. Therefore, comprehending such alterations and their implications is paramount. This is still true with ML spectroscopic analysis where the spectral variations are still of primary concern. In this context, a tool with an easy and understandable approach that offers spectroscopy-aimed functionalities that allow to aim for specific patterns, areas, and variations, and that is beginner and non-specialist friendly is of high interest. This can help the different stakeholders to better understand the ML models that they employ and considerably increase the transparency, comprehensibility, and scientific impact of ML results [@Bhatt2020] [@Belle2021].


# Overview

**pudu** is a Python library that quantifies the effect of changes in spectral features over the predictions of ML models and their effect to the target instances. In other words, it perturbates the features in a predictable and deliberate way and evaluates the features based on how the final prediction changes. For this, four main methods are included and defined. *Importance* quantifies the relevance of the features according to the changes in the prediction. Thus, this is measured in probability or target value difference for classification or regression problems, respectively. *Speed* quantifies how fast a prediction changes according to perturbations in the features. For this, the Importance is calculated at different perturbation levels, and a line is fitted to the obtained values and the slope, or the rate of change of Importance, is extracted as the Speed. *Synergy* indicates how features complement each other in terms of prediction change after perturbations. Finally, *Re-activations* account for the number of unit activations in a Convolutional Neural Network (CNN) that after perturbation, the value goes above the original activation criteria. The latter is only applicable for CNNs, but the rest can be applied to any other ML problem, including CNNs. To read in more detail how these techniques work, please refer to the [definitions](https://pudu-py.github.io/pudu/definitions.html) in the documentation.
**pudu** is a Python library that quantifies the effect of changes in spectral features over the predictions of ML models and their effect to the target instances. In other words, it perturbates the features in a predictable and deliberate way and evaluates the features based on how the final prediction changes. For this, four main methods are included and defined. *Importance* quantifies the relevance of the features according to the changes in the prediction. Thus, this is measured in probability or target value difference for classification or regression problems, respectively. *Speed* quantifies how fast a prediction changes according to perturbations in the features. For this, the *importance* is calculated at different perturbation levels, and a line is fitted to the obtained values and the slope, or the rate of change of *importance*, is extracted as the *speed*. *Synergy* indicates how features complement each other in terms of prediction change after perturbations. Finally, *re-activations* account for the number of unit activations in a Convolutional Neural Network (CNN) that after perturbation, the value goes above the original activation criteria. The latter is only applicable for CNNs, but the rest can be applied to any other ML problem, including CNNs. To read in more detail how these techniques work, please refer to the [definitions](https://pudu-py.github.io/pudu/definitions.html) in the documentation.

**pudu** is versatile as it can analyze classification and regression algorithms for both 1- and 2-dimensional problems, offering plenty of flexibility with parameters, and the ability to provide localized explanations by selecting specific areas of interest. To illustrate this, \autoref{fig:figure1} shows two analysis instances using the same *importance* method but with different parameters. Additionally, its other functionalities are shown in examples using scikit-learn [@Pedregosa2011], keras [@chollet2018keras], and localreg [@Marholm2022] are found in the documentation, along with XAI methods including LIME and GradCAM.

Expand Down
Binary file modified paper/paper.pdf
Binary file not shown.

0 comments on commit 54142ff

Please sign in to comment.