expand_to_match_ds()
: A closer look.¶
In this section we shall take a closer look at the internals of
expand_to_match_ds()
. This method transforms a DataFrame
into a
DataArray
by performing a series of operations to it.
Recall from its signature that the arguments it takes are:
dimension_matching_col
fill_method
fill_value_col
The transformation occurs essentially with the following code snippet:
return xr.DataArray(
self.df
.sort_values(dimension_matching_col)
.reset_index()
.rename(columns={'index': fill_value_col}, errors='ignore')
.set_index(dimension_matching_col, drop=False)
[fill_value_col]
.reindex(
self._ds[self._get_ds_from_df(dimension_matching_col)],
method=fill_method
)
)
Continuing with the
tutorial, let’s
see how the original DataFrame
is progressively transformed.
This is the original
DataFrame
.
import numpy as np
import pandas as pd
import xarray as xr
import xarray_events
ds = xr.Dataset(
data_vars={
'ball_trajectory': (
['frame', 'cartesian_coords'],
np.exp(np.linspace((-6, -8), (3, 2), 2450))
)
},
coords={
'frame': np.arange(1, 2451),
'cartesian_coords': ['x', 'y'],
'player_id': [2, 3, 7, 19, 20, 21, 22, 28, 34, 79]
},
attrs={'match_id': 12, 'resolution_fps': 25}
)
events = pd.DataFrame(
{
'event_type':
['pass', 'goal', 'pass', 'pass', 'pass',
'penalty', 'goal', 'pass', 'pass', 'penalty'],
'start_frame': [1, 425, 600, 945, 1100, 1280, 1890, 2020, 2300, 2390],
'end_frame': [424, 599, 944, 1099, 1279, 1889, 2019, 2299, 2389, 2450],
'player_id': [79, 79, 19, 2, 3, 2, 3, 79, 2, 79]
}
)
(
ds
.events.load(events, {'frame': ('start_frame', 'end_frame')})
.events.expand_to_match_ds('start_frame')
)
ds.events.df
event_type | start_frame | end_frame | player_id | |
---|---|---|---|---|
0 | pass | 1 | 424 | 79 |
1 | goal | 425 | 599 | 79 |
2 | pass | 600 | 944 | 19 |
3 | pass | 945 | 1099 | 2 |
4 | pass | 1100 | 1279 | 3 |
5 | penalty | 1280 | 1889 | 2 |
6 | goal | 1890 | 2019 | 3 |
7 | pass | 2020 | 2299 | 79 |
8 | pass | 2300 | 2389 | 2 |
9 | penalty | 2390 | 2450 | 79 |
The
DataFrame
gets sorted on the columndimension_matching_col
, which isstart_frame
in this case..sort_values(dimension_matching_col)
It is already sorted, so nothing changes.
The index of the
DataFrame
gets reset..reset_index()
(
ds.events.df
.sort_values('start_frame')
.reset_index()
)
index | event_type | start_frame | end_frame | player_id | |
---|---|---|---|---|---|
0 | 0 | pass | 1 | 424 | 79 |
1 | 1 | goal | 425 | 599 | 79 |
2 | 2 | pass | 600 | 944 | 19 |
3 | 3 | pass | 945 | 1099 | 2 |
4 | 4 | pass | 1100 | 1279 | 3 |
5 | 5 | penalty | 1280 | 1889 | 2 |
6 | 6 | goal | 1890 | 2019 | 3 |
7 | 7 | pass | 2020 | 2299 | 79 |
8 | 8 | pass | 2300 | 2389 | 2 |
9 | 9 | penalty | 2390 | 2450 | 79 |
Now index is a column of its own.
The column index gets renamed to
fill_value_col
, which isevent_index
in this case:.rename(columns={'index': fill_value_col}, errors='ignore')
(
ds.events.df
.sort_values('start_frame')
.reset_index()
.rename(columns={'index': 'event_index'}, errors='ignore')
)
event_index | event_type | start_frame | end_frame | player_id | |
---|---|---|---|---|---|
0 | 0 | pass | 1 | 424 | 79 |
1 | 1 | goal | 425 | 599 | 79 |
2 | 2 | pass | 600 | 944 | 19 |
3 | 3 | pass | 945 | 1099 | 2 |
4 | 4 | pass | 1100 | 1279 | 3 |
5 | 5 | penalty | 1280 | 1889 | 2 |
6 | 6 | goal | 1890 | 2019 | 3 |
7 | 7 | pass | 2020 | 2299 | 79 |
8 | 8 | pass | 2300 | 2389 | 2 |
9 | 9 | penalty | 2390 | 2450 | 79 |
The column
dimension_matching_col
is set as the new index of theDataFrame
:.set_index(dimension_matching_col, drop=False)
(
ds.events.df
.sort_values('start_frame')
.reset_index()
.rename(columns={'index': 'event_index'}, errors='ignore')
.set_index('start_frame', drop=False)
)
event_index | event_type | start_frame | end_frame | player_id | |
---|---|---|---|---|---|
start_frame | |||||
1 | 0 | pass | 1 | 424 | 79 |
425 | 1 | goal | 425 | 599 | 79 |
600 | 2 | pass | 600 | 944 | 19 |
945 | 3 | pass | 945 | 1099 | 2 |
1100 | 4 | pass | 1100 | 1279 | 3 |
1280 | 5 | penalty | 1280 | 1889 | 2 |
1890 | 6 | goal | 1890 | 2019 | 3 |
2020 | 7 | pass | 2020 | 2299 | 79 |
2300 | 8 | pass | 2300 | 2389 | 2 |
2390 | 9 | penalty | 2390 | 2450 | 79 |
All columns of the
DataFrame
except forfill_value_col
, which isevent_index
in this case, and the index are dropped.[fill_value_col]
(
ds.events.df
.sort_values('start_frame')
.reset_index()
.rename(columns={'index': 'event_index'}, errors='ignore')
.set_index('start_frame', drop=False)
['event_index']
)
start_frame
1 0
425 1
600 2
945 3
1100 4
1280 5
1890 6
2020 7
2300 8
2390 9
Name: event_index, dtype: int64
The
DataFrame
is now reindexed to theDataset
coordinate or dimension that matchesdimension_matching_col
, which isframe
in this case. Notice that there’s no fill method..reindex( self._ds[ds.events._get_ds_from_df(dimension_matching_col)], method=fill_method )
(
ds.events.df
.sort_values('start_frame')
.reset_index()
.rename(columns={'index': 'event_index'}, errors='ignore')
.set_index('start_frame', drop=False)
['event_index']
.reindex(
ds.events._ds[ds.events._get_ds_from_df('start_frame')]
)
)
frame
1 0.0
2 NaN
3 NaN
4 NaN
5 NaN
...
2446 NaN
2447 NaN
2448 NaN
2449 NaN
2450 NaN
Name: event_index, Length: 2450, dtype: float64
The
DataFrame
is finally converted into aDataArray
.return xr.DataArray( ... )
xr.DataArray(
ds.events.df
.sort_values('start_frame')
.reset_index()
.rename(columns={'index': 'event_index'}, errors='ignore')
.set_index('start_frame', drop=False)
['event_index']
.reindex(
ds.events._ds[ds.events._get_ds_from_df('start_frame')]
)
)
<xarray.DataArray 'event_index' (frame: 2450)> array([ 0., nan, nan, ..., nan, nan, nan]) Coordinates: * frame (frame) int64 1 2 3 4 5 6 7 ... 2444 2445 2446 2447 2448 2449 2450
- frame: 2450
- 0.0 nan nan nan nan nan nan nan ... nan nan nan nan nan nan nan nan
array([ 0., nan, nan, ..., nan, nan, nan])
- frame(frame)int641 2 3 4 5 ... 2447 2448 2449 2450
array([ 1, 2, 3, ..., 2448, 2449, 2450])
This DataArray
is useful on its own because it allows us to see which
values of the Dataset
coordinate or dimension match with unique events.
It is also used to group the Dataset
in groupby_events()
.