expand_to_match_ds(): A closer look.¶
In this section we shall take a closer look at the internals of
expand_to_match_ds(). This method transforms a DataFrame into a
DataArray by performing a series of operations to it.
Recall from its signature that the arguments it takes are:
dimension_matching_colfill_methodfill_value_col
The transformation occurs essentially with the following code snippet:
return xr.DataArray(
self.df
.sort_values(dimension_matching_col)
.reset_index()
.rename(columns={'index': fill_value_col}, errors='ignore')
.set_index(dimension_matching_col, drop=False)
[fill_value_col]
.reindex(
self._ds[self._get_ds_from_df(dimension_matching_col)],
method=fill_method
)
)
Continuing with the
tutorial, let’s
see how the original DataFrame is progressively transformed.
This is the original
DataFrame.
import numpy as np
import pandas as pd
import xarray as xr
import xarray_events
ds = xr.Dataset(
data_vars={
'ball_trajectory': (
['frame', 'cartesian_coords'],
np.exp(np.linspace((-6, -8), (3, 2), 2450))
)
},
coords={
'frame': np.arange(1, 2451),
'cartesian_coords': ['x', 'y'],
'player_id': [2, 3, 7, 19, 20, 21, 22, 28, 34, 79]
},
attrs={'match_id': 12, 'resolution_fps': 25}
)
events = pd.DataFrame(
{
'event_type':
['pass', 'goal', 'pass', 'pass', 'pass',
'penalty', 'goal', 'pass', 'pass', 'penalty'],
'start_frame': [1, 425, 600, 945, 1100, 1280, 1890, 2020, 2300, 2390],
'end_frame': [424, 599, 944, 1099, 1279, 1889, 2019, 2299, 2389, 2450],
'player_id': [79, 79, 19, 2, 3, 2, 3, 79, 2, 79]
}
)
(
ds
.events.load(events, {'frame': ('start_frame', 'end_frame')})
.events.expand_to_match_ds('start_frame')
)
ds.events.df
| event_type | start_frame | end_frame | player_id | |
|---|---|---|---|---|
| 0 | pass | 1 | 424 | 79 |
| 1 | goal | 425 | 599 | 79 |
| 2 | pass | 600 | 944 | 19 |
| 3 | pass | 945 | 1099 | 2 |
| 4 | pass | 1100 | 1279 | 3 |
| 5 | penalty | 1280 | 1889 | 2 |
| 6 | goal | 1890 | 2019 | 3 |
| 7 | pass | 2020 | 2299 | 79 |
| 8 | pass | 2300 | 2389 | 2 |
| 9 | penalty | 2390 | 2450 | 79 |
The
DataFramegets sorted on the columndimension_matching_col, which isstart_framein this case..sort_values(dimension_matching_col)
It is already sorted, so nothing changes.
The index of the
DataFramegets reset..reset_index()
(
ds.events.df
.sort_values('start_frame')
.reset_index()
)
| index | event_type | start_frame | end_frame | player_id | |
|---|---|---|---|---|---|
| 0 | 0 | pass | 1 | 424 | 79 |
| 1 | 1 | goal | 425 | 599 | 79 |
| 2 | 2 | pass | 600 | 944 | 19 |
| 3 | 3 | pass | 945 | 1099 | 2 |
| 4 | 4 | pass | 1100 | 1279 | 3 |
| 5 | 5 | penalty | 1280 | 1889 | 2 |
| 6 | 6 | goal | 1890 | 2019 | 3 |
| 7 | 7 | pass | 2020 | 2299 | 79 |
| 8 | 8 | pass | 2300 | 2389 | 2 |
| 9 | 9 | penalty | 2390 | 2450 | 79 |
Now index is a column of its own.
The column index gets renamed to
fill_value_col, which isevent_indexin this case:.rename(columns={'index': fill_value_col}, errors='ignore')
(
ds.events.df
.sort_values('start_frame')
.reset_index()
.rename(columns={'index': 'event_index'}, errors='ignore')
)
| event_index | event_type | start_frame | end_frame | player_id | |
|---|---|---|---|---|---|
| 0 | 0 | pass | 1 | 424 | 79 |
| 1 | 1 | goal | 425 | 599 | 79 |
| 2 | 2 | pass | 600 | 944 | 19 |
| 3 | 3 | pass | 945 | 1099 | 2 |
| 4 | 4 | pass | 1100 | 1279 | 3 |
| 5 | 5 | penalty | 1280 | 1889 | 2 |
| 6 | 6 | goal | 1890 | 2019 | 3 |
| 7 | 7 | pass | 2020 | 2299 | 79 |
| 8 | 8 | pass | 2300 | 2389 | 2 |
| 9 | 9 | penalty | 2390 | 2450 | 79 |
The column
dimension_matching_colis set as the new index of theDataFrame:.set_index(dimension_matching_col, drop=False)
(
ds.events.df
.sort_values('start_frame')
.reset_index()
.rename(columns={'index': 'event_index'}, errors='ignore')
.set_index('start_frame', drop=False)
)
| event_index | event_type | start_frame | end_frame | player_id | |
|---|---|---|---|---|---|
| start_frame | |||||
| 1 | 0 | pass | 1 | 424 | 79 |
| 425 | 1 | goal | 425 | 599 | 79 |
| 600 | 2 | pass | 600 | 944 | 19 |
| 945 | 3 | pass | 945 | 1099 | 2 |
| 1100 | 4 | pass | 1100 | 1279 | 3 |
| 1280 | 5 | penalty | 1280 | 1889 | 2 |
| 1890 | 6 | goal | 1890 | 2019 | 3 |
| 2020 | 7 | pass | 2020 | 2299 | 79 |
| 2300 | 8 | pass | 2300 | 2389 | 2 |
| 2390 | 9 | penalty | 2390 | 2450 | 79 |
All columns of the
DataFrameexcept forfill_value_col, which isevent_indexin this case, and the index are dropped.[fill_value_col]
(
ds.events.df
.sort_values('start_frame')
.reset_index()
.rename(columns={'index': 'event_index'}, errors='ignore')
.set_index('start_frame', drop=False)
['event_index']
)
start_frame
1 0
425 1
600 2
945 3
1100 4
1280 5
1890 6
2020 7
2300 8
2390 9
Name: event_index, dtype: int64
The
DataFrameis now reindexed to theDatasetcoordinate or dimension that matchesdimension_matching_col, which isframein this case. Notice that there’s no fill method..reindex( self._ds[ds.events._get_ds_from_df(dimension_matching_col)], method=fill_method )
(
ds.events.df
.sort_values('start_frame')
.reset_index()
.rename(columns={'index': 'event_index'}, errors='ignore')
.set_index('start_frame', drop=False)
['event_index']
.reindex(
ds.events._ds[ds.events._get_ds_from_df('start_frame')]
)
)
frame
1 0.0
2 NaN
3 NaN
4 NaN
5 NaN
...
2446 NaN
2447 NaN
2448 NaN
2449 NaN
2450 NaN
Name: event_index, Length: 2450, dtype: float64
The
DataFrameis finally converted into aDataArray.return xr.DataArray( ... )
xr.DataArray(
ds.events.df
.sort_values('start_frame')
.reset_index()
.rename(columns={'index': 'event_index'}, errors='ignore')
.set_index('start_frame', drop=False)
['event_index']
.reindex(
ds.events._ds[ds.events._get_ds_from_df('start_frame')]
)
)
<xarray.DataArray 'event_index' (frame: 2450)> array([ 0., nan, nan, ..., nan, nan, nan]) Coordinates: * frame (frame) int64 1 2 3 4 5 6 7 ... 2444 2445 2446 2447 2448 2449 2450
- frame: 2450
- 0.0 nan nan nan nan nan nan nan ... nan nan nan nan nan nan nan nan
array([ 0., nan, nan, ..., nan, nan, nan])
- frame(frame)int641 2 3 4 5 ... 2447 2448 2449 2450
array([ 1, 2, 3, ..., 2448, 2449, 2450])
This DataArray is useful on its own because it allows us to see which
values of the Dataset coordinate or dimension match with unique events.
It is also used to group the Dataset in groupby_events().