expand_to_match_ds(): A closer look.

In this section we shall take a closer look at the internals of expand_to_match_ds(). This method transforms a DataFrame into a DataArray by performing a series of operations to it.

Recall from its signature that the arguments it takes are:

  • dimension_matching_col

  • fill_method

  • fill_value_col

The transformation occurs essentially with the following code snippet:

return xr.DataArray(
    self.df
    .sort_values(dimension_matching_col)
    .reset_index()
    .rename(columns={'index': fill_value_col}, errors='ignore')
    .set_index(dimension_matching_col, drop=False)
    [fill_value_col]
    .reindex(
        self._ds[self._get_ds_from_df(dimension_matching_col)],
        method=fill_method
    )
)

Continuing with the tutorial, let’s see how the original DataFrame is progressively transformed.

  1. This is the original DataFrame.

import numpy as np
import pandas as pd
import xarray as xr
import xarray_events

ds = xr.Dataset(
    data_vars={
        'ball_trajectory': (
            ['frame', 'cartesian_coords'],
            np.exp(np.linspace((-6, -8), (3, 2), 2450))
        )
    },
    coords={
        'frame': np.arange(1, 2451),
        'cartesian_coords': ['x', 'y'],
        'player_id': [2, 3, 7, 19, 20, 21, 22, 28, 34, 79]
    },
    attrs={'match_id': 12, 'resolution_fps': 25}
)

events = pd.DataFrame(
    {
        'event_type':
            ['pass', 'goal', 'pass', 'pass', 'pass',
             'penalty', 'goal', 'pass', 'pass', 'penalty'],
        'start_frame': [1, 425, 600, 945, 1100, 1280, 1890, 2020, 2300, 2390],
        'end_frame': [424, 599, 944, 1099, 1279, 1889, 2019, 2299, 2389, 2450],
        'player_id': [79, 79, 19, 2, 3, 2, 3, 79, 2, 79]
    }
)
(
    ds
    .events.load(events, {'frame': ('start_frame', 'end_frame')})
    .events.expand_to_match_ds('start_frame')
)
ds.events.df
event_type start_frame end_frame player_id
0 pass 1 424 79
1 goal 425 599 79
2 pass 600 944 19
3 pass 945 1099 2
4 pass 1100 1279 3
5 penalty 1280 1889 2
6 goal 1890 2019 3
7 pass 2020 2299 79
8 pass 2300 2389 2
9 penalty 2390 2450 79
  1. The DataFrame gets sorted on the column dimension_matching_col, which is start_frame in this case.

    .sort_values(dimension_matching_col)
    

It is already sorted, so nothing changes.

  1. The index of the DataFrame gets reset.

    .reset_index()
    
(
    ds.events.df
    .sort_values('start_frame')
    .reset_index()
)
index event_type start_frame end_frame player_id
0 0 pass 1 424 79
1 1 goal 425 599 79
2 2 pass 600 944 19
3 3 pass 945 1099 2
4 4 pass 1100 1279 3
5 5 penalty 1280 1889 2
6 6 goal 1890 2019 3
7 7 pass 2020 2299 79
8 8 pass 2300 2389 2
9 9 penalty 2390 2450 79

Now index is a column of its own.

  1. The column index gets renamed to fill_value_col, which is event_index in this case:

    .rename(columns={'index': fill_value_col}, errors='ignore')
    
(
    ds.events.df
    .sort_values('start_frame')
    .reset_index()
    .rename(columns={'index': 'event_index'}, errors='ignore')
)
event_index event_type start_frame end_frame player_id
0 0 pass 1 424 79
1 1 goal 425 599 79
2 2 pass 600 944 19
3 3 pass 945 1099 2
4 4 pass 1100 1279 3
5 5 penalty 1280 1889 2
6 6 goal 1890 2019 3
7 7 pass 2020 2299 79
8 8 pass 2300 2389 2
9 9 penalty 2390 2450 79
  1. The column dimension_matching_col is set as the new index of the DataFrame:

    .set_index(dimension_matching_col, drop=False)
    
(
    ds.events.df
    .sort_values('start_frame')
    .reset_index()
    .rename(columns={'index': 'event_index'}, errors='ignore')
    .set_index('start_frame', drop=False)
)
event_index event_type start_frame end_frame player_id
start_frame
1 0 pass 1 424 79
425 1 goal 425 599 79
600 2 pass 600 944 19
945 3 pass 945 1099 2
1100 4 pass 1100 1279 3
1280 5 penalty 1280 1889 2
1890 6 goal 1890 2019 3
2020 7 pass 2020 2299 79
2300 8 pass 2300 2389 2
2390 9 penalty 2390 2450 79
  1. All columns of the DataFrame except for fill_value_col, which is event_index in this case, and the index are dropped.

    [fill_value_col]
    
(
    ds.events.df
    .sort_values('start_frame')
    .reset_index()
    .rename(columns={'index': 'event_index'}, errors='ignore')
    .set_index('start_frame', drop=False)
    ['event_index']
)
start_frame
1       0
425     1
600     2
945     3
1100    4
1280    5
1890    6
2020    7
2300    8
2390    9
Name: event_index, dtype: int64
  1. The DataFrame is now reindexed to the Dataset coordinate or dimension that matches dimension_matching_col, which is frame in this case. Notice that there’s no fill method.

    .reindex(
        self._ds[ds.events._get_ds_from_df(dimension_matching_col)],
        method=fill_method
    )
    
(
    ds.events.df
    .sort_values('start_frame')
    .reset_index()
    .rename(columns={'index': 'event_index'}, errors='ignore')
    .set_index('start_frame', drop=False)
    ['event_index']
    .reindex(
        ds.events._ds[ds.events._get_ds_from_df('start_frame')]
    )
)
frame
1       0.0
2       NaN
3       NaN
4       NaN
5       NaN
       ... 
2446    NaN
2447    NaN
2448    NaN
2449    NaN
2450    NaN
Name: event_index, Length: 2450, dtype: float64
  1. The DataFrame is finally converted into a DataArray.

    return xr.DataArray(
        ...
    )
    
xr.DataArray(
    ds.events.df
    .sort_values('start_frame')
    .reset_index()
    .rename(columns={'index': 'event_index'}, errors='ignore')
    .set_index('start_frame', drop=False)
    ['event_index']
    .reindex(
        ds.events._ds[ds.events._get_ds_from_df('start_frame')]
    )
)
<xarray.DataArray 'event_index' (frame: 2450)>
array([ 0., nan, nan, ..., nan, nan, nan])
Coordinates:
  * frame    (frame) int64 1 2 3 4 5 6 7 ... 2444 2445 2446 2447 2448 2449 2450

This DataArray is useful on its own because it allows us to see which values of the Dataset coordinate or dimension match with unique events. It is also used to group the Dataset in groupby_events().