In place modification of Dataset value

Hi,

I would like to modify the value of Dataset.

In a mono-index DataFrame, we can do `df[“foo”] = 10.0`.

How can we do that in a GEMSEO Dataset?

Thanks for your help,

Sebastien

Hello Sébastien,

I would try dataset[group_name, variable_name, component] = data according to the pandas documentation: MultiIndex / advanced indexing — pandas 2.3.3 documentation . Unfortunately, I do not have my laptop for three days and cannot validate this proposal.

I think your solution only allows to get the values, but not to set them. What worked for me is the dataframe.loc accessor:

from gemseo import create_benchmark_dataset

iris = create_benchmark_dataset("IrisDataset")
iris.loc[0, ("parameters", ["sepal_length"])] = 300.0
iris

Which gives:

GROUP       parameters                                      labels
VARIABLE  sepal_length sepal_width petal_length petal_width  specy
COMPONENT            0           0            0           0      0
0                300.0         3.5          1.4         0.2      0
1                  4.9         3.0          1.4         0.2      0
2                  4.7         3.2          1.3         0.2      0
3                  4.6         3.1          1.5         0.2      0
4                  5.0         3.6          1.4         0.2      0
..                 ...         ...          ...         ...    ...
145                6.7         3.0          5.2         2.3      2
146                6.3         2.5          5.0         1.9      2
147                6.5         3.0          5.2         2.0      2
148                6.2         3.4          5.4         2.3      2
149                5.9         3.0          5.1         1.8      2

In the tuple, is the list mandatory? Is it possible to modify a multidimensional variable as dataset.loc[index, (group_name, variable_name] = numpy_array?

The following works too, but it raises a performance warning:

iris.loc[0, ("parameters", "sepal_length")] = array([700.0])

Output:

<input>:1: PerformanceWarning: indexing past lexsort depth may impact performance.

Thanks for this solution.

Multi-index columns are not so trivial to manipulate, even if you are not a Pandas beginner.

Note that Pandas will use a copy-on-write for in place modification as of Pandas 3.0 (so the `.loc` will be mandatory)

1 Like

Hi @SebastienBocquet ,

You’re right to mention that the use of .loc or .ilocwill be mandatory. This is why, in GEMSEOwe already use that in our code.

You’re also right to point out the difficulties to manage multi index columns. I had also some issues at the beginning, just like you. We wanted to create something that was very clear to use with basic methods. However, we might have failed to do so: you’re surely not the only one to struggle with the Datasets. If you have any idea to bring a refactoring that would help people in the future, do not hesitate!

I completely agree with you. I develop an extension of GEMSEO and the multi-index concepts are very difficult for the users. As a developer, it also really slows me down because I struggle to find appropriate syntax for manipulating the Datasets. While I do not have a definite solution to this, a pragmatic approach I successfully use is a pair of functions that convert a Dataset to a mono-index DataFrame, using a naming convention for the variables (typically foo[inputs][0]), and back to a Dataset.

Another remark I can make is that the burden of multi-index is only necessary when names are duplicated across groups, and when there are vectors in the samples.

I think that the case of vectors is very common and must be addressed in some way.

Duplicate names across groups occurs when sampling IDF or a self-coupled discipline. Another example: in the case of machine learning, I would like to store three groups, namely inputs, outputs and predictions; this will involve duplicate names.

Which actions are complicated when handling a Dataset. We could add methods, with a signature similar to Dataset.get_view, Dataset.transform_data and Dataset.udpate_data. The latter should meet your initial expectations.

I am convinced of the usefulness of the Dataset for internal use in GEMSEO.

My main point concerns the interaction of the user with a Dataset, either for developing scripts or libraries depending on GEMSEO.

I try to summarize some use cases:

  • Use of dataset in user-favorite plotting library : most plotting libraries and some ML frameworks one deal with mono-index dataframes
  • Concatenation of Datasets (for example from several DOE)
  • Loading csv data (from experiments for example) to a `Dataset`: some methods exist in the Dataset class, but I pointed some weaknesses in the some issues on GEMSEO gitlab

Some examples where I struggled with `Dataset`:

  • get_view() is not a real dataframe: all methods that can be applied to a dataframe do not work
  • Filtering by value
  • Modifying values in place

This list is not exhaustive.

Some transformations I found useful in the library I developed, to use the Dataset data:

  • convert to a mono-index with column name suffixing
  • convert to a Dataclass containing these attributes: scalars (with possible splitting numerical / strings), arrays (possibly sorted by dimension (1D, 2D, etc…), curves

Thanks for your feedbacks. Some comments below.

Use of dataset in user-favorite plotting library : most plotting libraries and some ML frameworks one deal with mono-index dataframes

We could add ato_monoindex()method with column names of the form variable_name[group_name][component], variable_name[component]or variable_name[group_name] depending on the ambiguity level.

Concatenation of Datasets (for example from several DOE)

Concatenating columns or rows? Could you add some details about the problem, please?

Loading csv data (from experiments for example) to a `Dataset`: some methods exist in the Dataset class, but I pointed some weaknesses in the some issues on GEMSEO gitlab

If these weaknesses are still there, then let us fix these issues asap because using CSV files is a standard.

Modifying values in place

We could add an example about the update_data() method.

Filtering by value

Could you elaborate, please?

get_view() is not a real dataframe: all methods that can be applied to a dataframe do not work

Is this limitation multiindex-specific?

Regarding concatenation, finally it works as smoothly as for mono-index dataframes.

Regarding robustness of csv reading, we could look at Reading a dataframe with header infering (#1628) · Issues · gemseo / dev / gemseo · GitLab

Regarding filtering by value, for mono-index dataframes you can use .loc or .iloc. For example: constant_df = df.loc[:, (df == df.iloc[0]).all()] returns a dataframe containaing only the columns having constant values. For such treatment, I prefer to first convert a GEMSEO Dataset to a mono-index dataframe, because working directly on multi-index columns is complicated (at least for me). Another alternative I use is to rebuild a new Dataset containing only part of the original Dataset, but I find this process verbose and complicated.

In general, I find the from_array not trivial to use and verbose, since you need to reason row-wise, and compute the total size of all the variables (which can be a mix of scalars and arrays). Compared to what can be done for mono-index, which is intuitive: df = DataFrame([mapping1, mapping2]) with mapping1, 2 are dictionary of scalars. I wonder if renaming variables of mono-index dataframes with a naming convention x1[group_name][0] to handle groups and components would not be simpler than passing by a variable_names_to_nb_components, variable_names_to_group_names.

I may be wrong, but for me, get_view is an alternative to .loc. However get_view does not return a dataframe.

My 2 cents now. A Dataset can contain strings, scalars, 1D arrays and 2D-3D arrays, and since they are generally physical data in the MDO world, the 1D arrays are generally interpreted as vectors or curves (curve=(array1, array2) where the two arrays have same length. In fact we introduced a specific Curve class embarking the variable names and plotting capabilities), and the 2D or 3D arrays are interpreted as fields (Similarly, a Field class with services). It is also practical to manipulate the scalars as a dictionnary.
Based on this remark, I find useful to transform the Dataset into an object where these concepts are separated: strings, scalars, curves, fields. Indeed, outside scenarios, there is no need to keep all the data in a very rigid formalism like a dataframe, because these different type of data are processed differently (curves are plotted or manipulated seperately and differently from scalars or fields). For the end user, I find this format more user friendly.