Currently all the datasets below are published on ChemFlow, on the FUN platform in our chemometrics MOOCs and we also have some on the INRAE chemometrics dataverse.

The datasets below are categorized according to their use in mooc grains. As a bonus, an unused game in the grains of the mooc is offered.

Wheat flour

The data were obtained from INRA and made available by D. Bertrand. They consist of 140 spectra of wheat flour, measured between 400 and 2496 nm in steps of 4 nm. The data file is therefore of dimensions (140 x 525). The soft or hard origin of the wheat which gave each flour is given by the second character of the name of each sample: a T or a D respectively. The dataset is : x_140farines.tab (0.5MB) , y_140farines.tab (1.8kB) (<!--x_140farines_bis.csv (0.5MB) -->)

Athens 2004 Olympic Games decathlon results

The first column gives the name of the athletes, the first row that of the events: decathlon_athenes_2004.csv (2.1kB)

The data were produced by the University of Aix-Marseille, team of N. Dupuy. Near infrared analyzes and chemical analyzes were carried out on 187 olive oils whose geographical origin was known. The data consists of :

  • un jeu de 187 spectres comprenant 2853 longueurs d'onde, 1000 à 2222 nm : http://web.supagro.inra.fr/partage/bouletjc/pir0.zip
  • un jeu de 187 spectres comprenant 612 longueurs d'onde, 1000 à 2222 nm par pas de 2nm, extraites du jeu précédent : pir.csv (1.0MB) ;
  • un jeu d'analyses de 14 acides gras et du squalène sur les 187 échantillons : ags.csv (15.2kB) ;
  • un jeu d'analyses de 19 triglycérides sur les 187 échantillons : tri.csv (20.2kB) ;
  • un codage disjonctif des 187 échantillons selon les 6 origines géographiques : AP=Aix en Provence, HP=Haute Provence, NI=Nice, NM=Nimes, NY=Nyons, VB=Vallée des Baux de Provence: ori.csv (3.5kB) .

In the ags.csv file, the main fatty acids are: linolenic acid C18-33, linoleic acid C18-26, oleic acid C18-19 and palmitic acid, C16-0 corresponding to the respective columns 10 , 9, 7 and 1.

Corn flour

Data was acquired by Cargill, and distribution kindly authorized by Mike Blackburn. The full set is available on the Eigenvector Research Incorporated website (August 2016). We have selected the data used in the mooc, presented in the format of our ChemFlow software. Spectra were acquired on 80 samples of maize, between 1100 and 2498 nm in steps of 2 nm, on three different spectrometers denoted m5, mp5 and mp6. The 80 spectra are numbered from 1 to 80, the same order is respected in the four files. The data includes:

  • les 80 spectres acquis sur le spectromètre m5 : x_m5.csv (0.5MB) ;
  • les 80 spectres acquis sur le spectromètre mp5: x_mp5.csv (0.5MB) ;
  • les 80 spectres acquis sur le spectromètre mp6: x_mp6.csv (0.5MB) ;
  • les valeurs d'humidité, matières grasses, protéines et amidon mesurées pour les 80 échantillons : y.csv (2.3kB) .

Grain 10 uses data from the m5 spectrometer. In grain 16 (PDF or video course data), the 9 spectra used to build the calibration transfer models are the numbers or lines: 1, 5, 7, 10, 12, 13, 28, 33, 36.

Scab on apple leaves

Forty-two near infrared spectra were acquired by IRSTEA-Montpellier on apple tree leaves: 21 spectra on healthy leaves and 21 spectra on leaves affected by scab. The goal is to separate healthy leaves from diseased leaves. The data is:


The objective is to predict the density of terephthalate polymer using spectroscopy. The data, produced by Erik Swierenga, is available from the R software pls package (GPL license). The original file was divided into:

  • un jeu de 21 spectres d'étalonage , 268 variables spectrales : x.csv (42.3kB) ;
  • les valeurs de densité pour les 21 échantillons du jeu d'étalonnage : y.csv (0.2kB) ;
  • un jeu de 7 spectres de test , 268 variables spectrales : xtest.csv (15.5kB) ;
  • les valeurs de densité pour les 7 échantillons du jeu de test : ytest.csv (0.1kB) .

Olive oils (2)

The set of olive oils (2) was taken from the set of olive oils presented above for grain 4. It is composed of 106 spectra of oils measured between 1000 and 1240 nm. The data was used to illustrate the pdf document attached to grain 9, not the video for grain 9 which is based on another game. It includes:


The objective is to classify mayonnaise according to the origin of the oil that was used to make them. Data is available from R software pls package (GPL license). They understand:

  • les spectres de 162 échantillons de mayonnaise acquis entre 1100 et 2500 nm (pas = 4nm) soient 351 variables spectrales : xmayo.csv (0.5MB) .
  • l'origine de l'huile ayant servi à faire la mayonnaise, sous forme d'un fichier conjonctif: classes.csv (1.2kB) .
    • 1 = soja ; 2 = tournesol ; 3 = canola ; 4 = olive ; 5 = maïs ; 6 = pépins de raisin .

The 21 wavelengths selected in grain 12 are variables # 2, 13, 37, 41, 49, 60, 80, 96, 107, 115, 128, 136, 140, 194, 211, 217, 225, 232, 237, 278 and 303.


The acquisitions were carried out by INRA, UMR408, the results are made available by Sylvie Bureau. Near infrared (800-2770nm) and medium infrared (4000-650cm-1) spectra were acquired on the same 750 apricots, for which 9 reference analyzes were carried out: refractive index expressed in degree Brix, total acidity in meq per 100 g of fresh matter (MF), glucose, fructose, sucrose and sum of sugars in g per 100 g of MF, malic and citric acids, sum of acids, in meq per 100 g of MF. Eight varieties were followed, at three stages of maturity: very green, ripe, over-ripe. The observations label is constructed as follows: (1) the variety: Ravilong (A03759), Ravicille (A03844), Blanc (A04034), Badami (A01267), Bergeron (A00660), Goldrich (A02210), Iranien (A02862) and Moniqui (A00500); (2) the year of measurement (05 -> 2005); (3) the measurement date (ex: 2206 = June 2); (4) the location of the orchard; (5) the stage of maturity: vv = very green, ma = mature, sm = over-ripe; (6) the color of the apricot perceived by the observer: R = red, O = orange, B = white; (7) the fruit number (eg: F001). The samples were taken from the green stage to the overripe stage. Only one spectral variable out of three has been kept in the PIR and MIR spectra, in order to simplify the calculations. The data has been ordered: the same apricot is represented on the same line in the following four files which contain:

Simulated data

These data were produced at AgroParisTech by Douglas Rutledge. Six different signals of 800 variables each were mixed together with different coefficients, and adding in addition Gaussian noise. A total of 100 simulated spectra were obtained. The data forms a matrix (100 x 800) : donnees_simulees.csv (0.6MB) .

Butters and mayonnaise

The data were obtained at INRA Nantes by Benoit Jaillais. Medium infrared (MIR), near infrared (NIR) and visible (VIS) spectra were acquired on 21 samples of butters or margarines. The data is:

  • les spectres VIS des 21 échantillons, 400-798 nm (pas=2nm) soient 200 variables spectrales : VIS21.csv (37.9kB) ;
  • les spectres NIR des 21 échantillons, 800-2498 (pas = 2nm) soient 850 variables spectrales : NIR21.csv (0.2MB) ;
  • les spectres MIR des 21 échantillons, 3616-916 cm-1 soient 1401 variables spectrales : MIR21.csv (0.3MB) .


The data was provided by Gérard Mazerolles. 60 cheeses were produced at INRA Poligny at the rate of 4 cheeses per day for 15 days. These cheeses are identified as follows:

  • une lettre: D (analyses avant salage) ou A (analyses après salage + maturation de 30 jours);
  • deux chiffres, de 01 à 15, représentent les 15 jours de production des fromages qui se répartissent en 5 groupes:
    • * jours 01 à 03: pâtes pressées cuites;
    • * jours 04 à 06: pâtes pressées mi-cuites;
    • * jours 07 à 09: pâtes pressées;
    • * jours 10 à 12: pâtes molles;
    • * jours 13 à 15: pâtes molles stabilisées.
  • un chiffre, de 1 à 4, représente le numéro du fromage (4 fromages produits par jour); les 4 fromages faits un même jour l'ont été avec des laits de compositions chimiques différentes;
  • une lettre, a, b ou c représente l'une des trois répétitions des analyses spectrales, chimiques ou rhéologiques.

In total, we have 60 cheeses x 2 dates (D and A) by 3 repetitions (a, b and c), or 360 observations. The data is as follows:

  • la partie du spectre moyen infrarouge centrée sur les matières grasses: MIR_FAT.csv (0.3MB) ; 360 observations, 104 variables de 2998 à 2800 cm-1;
  • la partie du spectre moyen infrarouge centrée sur les protéines: MIR_PROT.csv (0.4MB) ; 360 observations, 112 variables de 1700 à 1486 cm-1;
  • la partie du spectre moyen infrarouge contenant de l'information sur les matières grasses et les protéines: MIR_MIX.csv (1.1MB) ; 360 observations, 304 variables de 1485 à 900 cm-1;
  • le spectre de fluorescence de la vitamine A, excitation 270-350nm, émission 410nm: FLUO_VITA.csv (0.2MB) ; 360 observations, 81 variables;
  • le spectre de fluorescence du tryptophane, excitation 290nm, émission 305 à 400 nm: FLUO_TRYPT.csv (0.6MB) ; 360 observations, 191 variables;
  • la chimie des fromages: pH, humidité, matières grasses, protéines, calcium: CHEMISTRY.csv (12.1kB) ; 360 observations, 5 variables;
  • la rhéologie des fromages: déformabilité, déformation à la rupture, contrainte à la rupture, énergie à la rupture: RHEO.csv (11.3kB) ; 360 observations, 4 variables.

Note: the analyzes which were not carried out in triplicate were duplicated so that the numbers of rows are the same between all the tables and that a row number corresponds to the same sample in all the tables.

Grape berries

The data were acquired at the UE Pech-Rouge in Gruissan by the UMR ITAP and the UMR SPO respectively by Jean-Michel Roger and Jean-Claude Boulet and made available. UV-visible-near infrared spectra were acquired in transmittance on 250 grape seeds between 310 and 1150nm, with a step of approximately 3.3nm, giving 256 wavelengths. The degree Brix was measured on each of these 250 berries. The data includes:

Note: the corn data also presented in grain 10 has been previously described for grain 5.

Also find our data of interest and publishable on dataverse https://data.inrae.fr or DataInBrief.

Date de modification : 18 juillet 2023 | Date de création : 28 avril 2020 | Rédaction : ChemHouse