]
.date[
### 2022-06
]
---
# Spatial data
.pull-left[
```r
stations_sf
```
```
Simple feature collection with 59 features and 6 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 141.2652 ymin: -39.1297 xmax: 153.3633 ymax: -28.9786
Geodetic CRS: GDA94
# A tibble: 59 × 7
id long lat elev name wmo_id
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 ASN00047016 141. -34.0 43 lake victo… 94692
2 ASN00047019 142. -32.4 61 menindee p… 94694
3 ASN00048015 147. -30.0 115 brewarrina… 95512
4 ASN00048027 146. -31.5 260 cobar mo 94711
5 ASN00048031 149. -29.5 145 collareneb… 95520
# … with 54 more rows, and 1 more variable:
# geometry <POINT [°]>
```
]
.pull-right[
<img src="index_files/figure-html/unnamed-chunk-4-1.png" width="504" style="display: block; margin: auto;" />
]
???
* Thanks everyone for coming
* The title of my talk today is
* Here is the link to this slide and the link now is also available in the chat.
* Spatial data is a common type of data and here there are 59 weather stations distributed in New South Wales and Victoria in Australia
* The data is organised in an `sf` class and the package `sf` provides various geometrical operations in the space for this class
---
# Temporal data
.pull-left-narrow[
```r
ts
```
```
# A tsibble: 1,099,052 x 3 [1D]
# Key: id [59]
id date tmax
<chr> <date> <dbl>
1 ASN00047016 1971-01-01 25
2 ASN00047016 1971-01-02 26.9
3 ASN00047016 1971-01-03 27.5
4 ASN00047016 1971-01-04 30
5 ASN00047016 1971-01-05 34.4
# … with 1,099,047 more rows
```
]
.pull-right-long[
<img src="index_files/figure-html/unnamed-chunk-8-1.png" width="720" style="display: block; margin: auto;" />
]
???
* Temporal data is another common data type
* Here I'm showing you some daily historical temperature data for those 59 stations.
* On the right, a fraction of the data in year 2020 is plotted
* This data is stored in a `tsibble` class, with `id` as the key to define each series and `date` as the index to define the time stamp.
* The `tsibble` class allows you to wrangle temporal data and build temporal models.
---
# Spatio-temporal data
When left joining an `sf` object with a `tsibble` object, the `tsibble` class (`tbl_ts`) gets lost:
```r
out <- stations_sf %>% left_join(ts, by = "id")
class(out)
```
```
[1] "sf" "tbl_df" "tbl" "data.frame"
```
When left joining the other way around, you lost the `sf` class:
```r
out2 <- ts %>% left_join(stations_sf, by = "id")
class(out2)
```
```
[1] "tbl_ts" "tbl_df" "tbl" "data.frame"
```
???
* However, spatial objects and temporal objects do not naturally work well together for spatio-temporal analysis.
* Here let me give you some examples
* If I join an sf object with a tsibble object, the tsibble class would gets lost
* If we join the data the other way around, the `sf` class will get lost.
---
# Multivariate spatio-temporal data
You can manually enforce the joined object to have both classes:
```r
out2 <- ts %>% left_join(stations_sf, by = "id")
out3 <- out2 %>% st_as_sf()
class(out3)
```
```
[1] "sf" "tbl_ts" "tbl_df" "tbl" "data.frame"
```
but the class lost again after a `tsibble` operation:
```r
out4 <- out3 %>% tsibble::fill_gaps()
class(out4)
```
```
[1] "tbl_ts" "tbl_df" "tbl" "data.frame"
```
???
* We can manually enforce the joined object to have both classes with the function `st_as_sf()`
* But the class label can still get lost during operations.
* Here I use a tsibble function `fill_gaps` and the result doesn't have the `sf` class
* Also, taking a step back, the left join approach on spatial and temporal data is not necessarily the best way to structure spatio-temporal data
* This is because all the feature geometries are repeated multiple times, especially for long daily data, like the temporal data I just show you.
* This motivates a new data representation
---
class: center, inverse, middle
# Cubble
## A new tidy data structure to organise and wrangle spatio-temporal data
???
* Today I will introduce a new data structure, called cubble, to organise spatio-temporal data.
* And we will see how data wangling with cubble can be fun
---
# Multivariate spatio-temporal cubes
<img src="figures/spatio-temporal-cube.png" width="2532" style="display: block; margin: auto;" />
???
* Conceptually spatio-temporal data can be thought of as a data cube
* In this cube, the three axes are Time, Site, and Variable.
* The axis **Site** defines the location of the entities
* The axis **Variable** is used to represent multivariate information.
* We define our data cube slightly different from a conventional cube to avoid introducing hypercubes for multivariate information.
* Operations on multivariate spatio-temporal data can be thought of as slicing and dicing on the cube.
* Although The data cube is conceptually convenient, for data wrangling, a 3D array structure may not sufficiently rich, for example, to wrangle special date time classes.
---
# Cubble basics
<img src="figures/long-nested-form.png" width="2560" height="575" style="display: block; margin: auto;" />
???
* Now I will demonstrate how cubble organises spatio-temporal data with two forms.
* The nested form organises each site in a row.
* Spatial variables fixed for each site can be directly wrangled.
* Temporal variables varied across time are nested in a list column called `ts`.
* On the other hand, the long form cubble organises each row by a combination of site and date, similar to a `tsibble`.
* Temporal variables can be directly wrangled and
* spatial variables are stored as a data attribute, which I will show you shortly in the code.
---
# Switching focus between time and space
<img src="figures/cubble-operations.png" width="2560" height="575" style="display: block; margin: auto;" />
???
* In a spatio-temporal analysis, we may want to first subset a few location and then explore their temporal patterns.
* We may also want to first calculate some temporal features and then investigate its spatial distribution.
* These analyses would require switch between the nested form and the long form in a cubble
* The function `face_temporal()` turns a nested cubble into the long form and
* This can be used to first filter the location on the nested form and then use `face_temporal()` to switch the data into the long form and then make temporal summaries
* The inverse of `face_temporal()` is `face_spatial()`, which switches the long cubble into a nested one
* With `face_spatial()` we can first make some calculations on the temporal side and switch back to the nested form to view its spatial distribution on the map
---
# Creating a cubble
```r
weather <- as_cubble(list(spatial = stations_sf, temporal = ts),
key = id, index = date, coords = c(long, lat))
```
???
* Now I'm going to show you how to create a cubble from the two data we have
* Here you specify the two separate objects in a list with the name `spatial` and `temporal`.
* Then you can specify the `key` and the `index` as what you would do when creating a `tsibble`.
* The `coords` argument needs to be specified in the order of longitude and latitude.
--
```
# cubble: id [59]: nested form [sf]
# bbox: [141.26, -39.13, 153.37, -28.97]
# temporal: date [date], tmax [dbl]
id long lat elev name wmo_id geometry ts
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <POINT [°]> <list>
1 ASN00047016 141. -34.0 43 lake victoria storage 94692 (141.2652 -34.0398) <tbl_ts>
2 ASN00047019 142. -32.4 61 menindee post office 94694 (142.4173 -32.3937) <tbl_ts>
3 ASN00048015 147. -30.0 115 brewarrina hospital 95512 (146.8651 -29.9614) <tbl_ts>
4 ASN00048027 146. -31.5 260 cobar mo 94711 (145.8294 -31.484) <tbl_ts>
5 ASN00048031 149. -29.5 145 collarenebri (albert st) 95520 (148.5818 -29.5407) <tbl_ts>
# … with 54 more rows
```
???
* This creates a cubble in the nested form.
--
- .sec[59] stations, in the nested form, and is a subclass of `sf`
- The available temporal variables are `date` and `tmax`
- Also, each temporal component in the list column is a tsibble (`tbl_ts`)
???
* The header of a cubble tells you that this data has
---
# Cubble summary (1/2)
.pull-left-larger[
<code class ='r hljs remark-code'>weather_long <- weather %>% <span style='background-color:#FFEECF'>face_temporal()</span><br>weather_long</code>
```
# cubble: date, id [59]: long form [tsibble]
# bbox: [141.26, -39.13, 153.37, -28.97]
# spatial: long [dbl], lat [dbl], elev [dbl],
# name [chr], wmo_id [dbl], geometry [POINT
# [°]]
id date tmax
<chr> <date> <dbl>
1 ASN00047016 1971-01-01 25
2 ASN00047016 1971-01-02 26.9
3 ASN00047016 1971-01-03 27.5
4 ASN00047016 1971-01-04 30
5 ASN00047016 1971-01-05 34.4
# … with 1,099,047 more rows
```
- a long form cubble as the subclass of `tsibble`
- the third row now shows the spatial variables
]
???
* We can pivot this object into the long form with `face_temporal()`
* Now the object `weather_long` is a long form cubble and it is a subclass of tsibble
* The third line in the header now changes to see the available spatial variables
--
.pull-right[
```r
attr(weather_long, "spatial")
```
```
Simple feature collection with 59 features and 6 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 141.2652 ymin: -39.1297 xmax: 153.3633 ymax: -28.9786
Geodetic CRS: GDA94
# A tibble: 59 × 7
# Rowwise: id
id long lat elev name wmo_id
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 ASN00047016 141. -34.0 43 lake victo… 94692
2 ASN00047019 142. -32.4 61 menindee p… 94694
3 ASN00048015 147. -30.0 115 brewarrina… 95512
4 ASN00048027 146. -31.5 260 cobar mo 94711
5 ASN00048031 149. -29.5 145 collareneb… 95520
# … with 54 more rows, and 1 more variable:
# geometry <POINT [°]>
```
]
???
* The spatial variables are stored in the `spatial` attribute, which you can see through this command.
* Here it is stored as an sf object
---
# Cubble summary (2/2)
<code class ='r hljs remark-code'>weather_back <- weather_long %>% <span style='background-color:#FFEECF'>face_spatial()</span><br>weather_back</code>
```
# cubble: id [59]: nested form [sf]
# bbox: [141.26, -39.13, 153.37, -28.97]
# temporal: date [date], tmax [dbl]
id long lat elev name wmo_id geometry ts
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <POINT [°]> <list>
1 ASN00047016 141. -34.0 43 lake victoria storage 94692 (141.2652 -34.0398) <tbl_ts>
2 ASN00047019 142. -32.4 61 menindee post office 94694 (142.4173 -32.3937) <tbl_ts>
3 ASN00048015 147. -30.0 115 brewarrina hospital 95512 (146.8651 -29.9614) <tbl_ts>
4 ASN00048027 146. -31.5 260 cobar mo 94711 (145.8294 -31.484) <tbl_ts>
5 ASN00048031 149. -29.5 145 collarenebri (albert st) 95520 (148.5818 -29.5407) <tbl_ts>
# … with 54 more rows
```
<code class ='r hljs remark-code'>identical(weather_back, weather)</code>
```
[1] TRUE
```
???
* Here is the code example of using the function `face_spatial()` on the long form cubble
* This would give us the nested cubble before making the switch to the long form
---
# Pipeline with cubble
.pull-left-narrow[
<code class ='r hljs remark-code'>cb_obj %>% <br> <span style='background-color:#888090'>{{ Your spatial analysis }}</span> %>% <br> <span style='background-color:#FFEECF'>face_temporal()</span> %>% <br> <span style='background-color:#BE7893'>{{ Your temporal analysis }}</span> %>% <br> <span style='background-color:#FFEECF'>face_spatial()</span> %>% <br> <span style='background-color:#888090'>{{ Your spatial analysis }}</span></code>
]
.pull-right-long[
<!-- # "#443750", secondary_color = "#840032", -->
<code class ='r hljs remark-code'>spatial <- stations_sf %>% <br> <span style='background-color:#888090'>{{ Your spatial analysis }}</span> <br><br>##############################<br># more subsetting step if temporal analysis<br># depends on spatial results<br><span style='background-color:#FFEECF'>sp_id <- spatial %>% pull(id)</span><br><span style='background-color:#FFEECF'>ts_subset <- ts %>% filter(id %in% sp_id)</span><br>##############################<br><br>temporal <- ts_subset %>% <br> <span style='background-color:#BE7893'>{{ Your temporal analysis }}</span> <br><br>##############################<br># more subsetting step if spatial analysis <br># depends on temporal results<br><span style='background-color:#FFEECF'>ts_id <- temporal %>% pull(id)</span><br><span style='background-color:#FFEECF'>sp_subset <- spatial %>% filter(id %in% ts_id)</span><br>##############################<br><br>sp_subset %>% <br> <span style='background-color:#888090'>{{ Your spatial analysis }}</span></code>
]
???
* Here is a syntax comparison with and without cubble
* With cubble, you can do some spatial analysis in the nested form, pivot it to the long form for some temporal analysis, and then pivot it back to the nested form for some additional spatial analysis.
* Sometimes, the spatial analysis include extracting some interesting sites.
* Without cubble, you will need to first pull out those interesting ids, and then filter the temporal data on these sites.
* Similar steps can also happen in the temporal analysis and the spatial data needs to be updated.
* In cubble, these updates are automatically handled by `face_temporal()` and `face_spatial()` and no manual updates are needed.
* Also the cubble pipeline chains all the operations together with no intermediate objects created in the workflow.
---
class: center, inverse, middle
# Spatio-temporal analysis in cubble
## A glyph map example
???
* Some analysis uses both spatial and temporal of variables at the same time.
* An example of this is making glyph maps.
* Here I will first show you a toy example before rolling out to the full picture
---
# Transform a dot into a glyph
<img src="figures/glyph-steps1.png" width="3905" style="display: block; margin: auto;" />
???
* A glyph map is a transformation of temporal coordinates into the spatial coordinates, so that temporal information can be visualised on the map.
* Here I have one weather station on the map and its maximum temperature on each day in January 2020
---
# Transform a dot into a glyph
<img src="figures/glyph-steps2.png" width="3905" style="display: block; margin: auto;" />
???
* A glyph map uses linear algebra to make this transformation
* You can see here the line in the bottom right plot does not change but its coordinates have been changed to the spatial coordinates
* In a glyph map, the spatial coordinates are called the major coordinates and the temporal coordinates are the minor coordinates.
* In the word of ggplot, we need four aesthetics to make a glyph map. Here they are longitude, latitude, date, and tmax.
---
# Transformation to glyphmap
.pull-left[
<img src="figures/unfold.png" width="500" height="500" style="display: block; margin: auto;" />
]
.pull-right-larger[
<code class ='r hljs remark-code'>cb_glyph <- weather_long %>% unfold(long, lat)</code>
```
# cubble: date, id [59]: long form [tsibble]
# bbox: [141.26, -39.13, 153.37, -28.97]
# spatial: long [dbl], lat [dbl], elev [dbl], name [chr],
# wmo_id [dbl], geometry [POINT [°]]
id date tmax long lat
<chr> <date> <dbl> <dbl> <dbl>
1 ASN00047016 1971-01-01 25 141. -34.0
2 ASN00047016 1971-01-02 26.9 141. -34.0
3 ASN00047016 1971-01-03 27.5 141. -34.0
4 ASN00047016 1971-01-04 30 141. -34.0
5 ASN00047016 1971-01-05 34.4 141. -34.0
# … with 1,099,047 more rows
```
```r
cb_glyph %>%
ggplot(aes(x_major = long, y_major = lat,
x_minor = date, y_minor = tmax)) +
geom_glyph()
```
]
???
* To work with ggplot2, all the four variables need to be in the same table.
* In cubble you can use the function `unfold()` to relocate spatial variables into the long form.
* Here I have the diagram, cube, and the code to demonstrate this function.
* This is how the data looks like after the unfold
* After this, the data can be piped into the ggplot with the four aesthetics need for `geom_glyph()` to draw the glyph map.
---
# Example: Australian temperature comparison
<img src="figures/temperature-workflow.png" width="90%" height="530" style="display: block; margin: auto;" />
???
* Now here is an full example that combines everything I have introduced in this talk, to analyse historical temperature data in Australia.
* We have maximum temperature dated back to the 70s, which allows us to compare the maximum temperature between now and then, and also across space.
* The diagram here shows each step needed in this analysis
* The data I have shown you in this talk is a subset from all the weather stations in Australia and there are hundreds of them.
* The first step here is to narrow it down to those in New South Wales and Victoria
* Then we pivot it into the long form to select a historical segment (from 1971 - 1975) and a recent segment (from 2016 to 2020) in step 2.
* In step 3, still in the long form, maximum temperature is summarised into monthly average in each period
* A quick check on the number of observations reveals that some stations don't have temperature recorded at both groups - look at id 4
* We remove them in the nested form in step 4
* In step 5 and 6 we unfold longitude and latitude with temporal variables and make the glyph map with `geom_glyph()`
---
# Example: Australian temperature comparison
.pull-left-larger[
<br>
<code class ='r hljs remark-code'>tmax <- DATA %>% <br> <span style='background-color:#888090'>{{filter NSW & VIC stations}}</span> %>% <br> <span style='background-color:#FFEECF'>face_temporal()</span> %>% <br> <span style='background-color:#BE7893'>{{group by month and period (71-75, 16-20)}}</span> %>% <br> <span style='background-color:#BE7893'>{{summarise into monthly average}}</span> %>% <br> <span style='background-color:#FFEECF'>face_spatial()</span> %>% <br> <span style='background-color:#888090'>{{filter out sites with no historical record}} </span>%>% <br> <span style='background-color:#FFEECF'>face_temporal()</span> %>% <br> <span style='background-color:#FFEECF'>unfold(long, lat)</span><br><br>tmax %>% <br> ggplot(aes(x_minor = month, y_minor = tmax, <br> x_major = long, y_major = lat)) + <br> <span style='background-color:#FFEECF'>geom_glyph()</span> + <br> ...</code>
]
.pull-right[
<img src="index_files/figure-html/unnamed-chunk-32-1.png" width="504" style="display: block; margin: auto;" />
]
???
* This is the code version of the diagram illustration
* Functions highlighted in light yellow are developed in the cubble package
* Spatial operations are highlighted in purple and temporal ones in pink
* On the top left of the plot is a more annotated version of the glyph for one specific station Cobar.
* Australia has a U-shape temperature curve and if you look carefully, the inland NSW stations has a noticeable higher average maximum temperature in January in recent years.
---
class: inverse, middle
# More you can do with cubble
* Pick up unmatched entries from the spatial and temporal inputs
???
There are more things cubble can do
--
* Merge two data sources by spatial and temporal similarities
--
* Handle (spatial) hierarchical structure of sites
--
* Input data can be of various forms, including a single combined data frame and netCDF
---
background-image: url("figures/3cubes-in-one.png")
background-position: 90% 70%
background-size: 280px, 280px
class: inverse, middle
# Additional Information
Slides created via the R package [.inverse-code[xaringan]](https://github.com/yihui/xaringan) and [.inverse-code[xaringanthemer]](https://github.com/gadenbuie/xaringanthemer), available at
.inverse-code[https://sherryzhang-user2022.netlify.app]
<br>
.inverse-code[