This document details a solutions guide to the chapter exercises in R for Data Science by Hadley Wickham. The author created the document while in a study group with Jason Baik. Inspiration for answers also came from jarnold-thanks to him!
No exercises
No exercises
library(tidyverse)
ggplot()
nrow(mpg)
## [1] 234
ncol(mpg)
## [1] 11
?mpg
ggplot( data = mpg ) +
geom_point( mapping = aes( x = cyl, y = hwy ) )
ggplot( data = mpg ) +
geom_point( mapping = aes( x = class, y = drv ) )
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = "blue" ) )
Because assigning the color within aes() provides a mapping of the levels given to color to the x and y coordinates, which in this case is does not because the “blue” by itself does not map to x and y. However, to make all the points blue, one would set the color outside aes() within geom_point(), which doesn’t require a mapping and just passes the property “blue” to all the data points.
mpg
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29
## 3 audi a4 2.0 2008 4 manual(m6) f 20 31
## 4 audi a4 2.0 2008 4 auto(av) f 21 30
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26
## 7 audi a4 3.1 2008 6 auto(av) f 18 27
## 8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26
## 9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25
## 10 audi a4 quattro 2.0 2008 4 manual(m6) 4 20 28
## # ... with 224 more rows, and 2 more variables: fl <chr>, class <chr>
Underneath the variable names, a class indicator is given. For example the categorical variables are denoted by “chr” meaning character and are manufacturer, model, trans, drv, fl and class. The continuous variables are denoted by “int” and “dbl” meaning integer and double, respectively, and are year, cyl, trans, cty, and hwy. They are continuous because consecutive numbers are smaller or larger for these variables while categorical variables do not have this relation.
#Color
# categorical
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class ) )
#continuous
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = cyl ) )
#Size
# categorical
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class ) )
#continuous
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = cyl ) )
#Shape
# categorical
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class ) )
#continuous-error
#ggplot(data = mpg) +
# geom_point(mapping = aes(x = displ, y = hwy, shape = cyl ) )
For color aesthetics, categorical variables are given discrete colors to map to the color levels, where continuous variables are assigned colors along a gradient to designate a lighter or darker color tone to how large or small the value is. Foe size aesthetics, assigning it a categorucal variable doesn’t make sense because increasing size doesn’t necessarily mean a larger value, so this only makes sense for continuous variables. For shape aesthetics, only up to 6 discrete levels are assigned shapes because beyond 6 is hard to interpret. Assigning a continuous variable gives an error because a shape has no property for indicating similarity between values.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = class, color = class ) )
You’re essentially representing a variable twice which is redundant.
?geom_point
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class ), shape = 21, stroke = 1 )
The stroke easthetic indicates the width of the shape border which is given as a property to geom_points().
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = displ<5 ))
displ<5 in actuallity divides the observations into two, which provides it’s own mapping to the observations. So you would get two colors, two shapes, etc. based on how one divides the data.
ggplot(data = mpg) +
geom_point(mapping = aes(x = cty, y = hwy)) +
facet_wrap(~ displ)
You’ll get as many plots as there are distinct values.
ggplot(data = mpg) +
geom_point(mapping = aes(x = drv, y = cyl)) #+
#facet_grid(drv ~ cyl)
The empty grids using facet_grid() compare to the above plot because there are no points for those combination of variables and hence the empty grids are representing the same as the absence of a point for those combination of variables.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)
The above plots compare displ vs hwy but for either the drv classes in the rows or the cyl classes in the columns.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color=class)) #+
#facet_wrap(~ class, nrow = 2)
The advantages of labeling the classes by different colors versus showing points for each combination of variables in each grid is that showing all the points in one grid may be too cluttered to show especially with a large dataset and if there are many classes. But it may be worth showing each class in a different color in the same grid if the dataset is smaller and there aren’t that many classes to distinguish.
?facet_wrap
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow=1)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(~ class)
Nrow establishes the number of rows to display and ncol establishes the number of columns to display when plotting each facet of the variable. Other options to control the layout of the individual panels includes scales (either frees or fixes the x and y scales of facets), shrink (shrinks scales to fit output of the statistics, not raw data), labeller (a function that takes one data frame of labels and returns a list or data frame of character vectors-changes the labelling on the facets see ?labellers for options), and switch (changes the facet labelling from top to bottom or from right to left of plots). facet_grid() doesn’t have nrow and ncol options because it is the simple case of facet_wrap with nrow = 1. facet_wrap allows one to better use screen space.
Because if there are many levels, you’ll have many panels in the rows which is not a good use of screen space. facet_grid() works well with only a few levels to break up into separate panels.
bar chart: geom_bar()
histogram: geom_histogram()
line chart: geom_smooth()
area chart: geom_area()
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
I predict this will show a line chart of display vs hwy, without se gray areas around the line, with the corresponding points, and with drive type as color. Also there’s 3 different lines per color (the mapping in ggplot() is global so it passes on to geom_point() and geom_smooth()).
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
Why do you think I used it earlier in the chapter?
show.legend = FALSE removes the legend key from the side of the plot, which should be done to remove clutter.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
ggplot() +
geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
No because both have thesame mappings passed on to geom_point() and geom_smooth() (either globally or locally).
ggplot(data = mpg, mapping = aes(x=displ,y = hwy))+
geom_point()+
geom_smooth(se=FALSE)
ggplot(data = mpg, mapping = aes(x=displ,y = hwy))+
geom_point()+
geom_smooth(se=FALSE, mapping = aes(x=displ,y = hwy, group = drv))
ggplot(data = mpg, mapping = aes(x=displ,y = hwy,color = drv))+
geom_point()+
geom_smooth(se=FALSE, mapping = aes(x=displ,y = hwy, group = drv))
ggplot(data = mpg, mapping = aes(x=displ,y = hwy))+
geom_point(aes(x=displ,y = hwy,color = drv))+
geom_smooth(se=FALSE, mapping = aes(x=displ,y = hwy))
ggplot(data = mpg, mapping = aes(x=displ,y = hwy))+
geom_point(aes(x=displ,y = hwy,color = drv))+
geom_smooth(se=FALSE, mapping = aes(x=displ,y = hwy, linetype = drv))
ggplot(data = mpg, mapping = aes(x=displ,y = hwy))+
geom_point(aes(x=displ,y = hwy),shape=21,size=5,fill="white",color="white")+
geom_point(aes(x=displ,y = hwy,fill = drv, color = drv),shape=21,size=2)
The default geom for stat_summary() is geom_pointrange. For geom_pointrange, the default stat is “identity”, so in order to duplicate the previous plot we need to change the stat to summary and change the min, max and midpoint to reflect the same parameters as previously.
ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.ymin = min,
fun.ymax = max,
fun.y = median
)
ggplot(data = diamonds) +
geom_pointrange(
mapping = aes(x = cut, y = depth),
stat = "summary",
fun.ymin = min,
fun.ymax = max,
fun.y = median
)
Geom_bar makes the height of the bar proportional to the number of cases in each group (or if the weight aesthetic is supplied, the sum of the weights). If you want the heights of the bars to represent values in the data, use geom_col instead.
ggplot(data = diamonds, mapping = aes(x = cut, y = depth)) +
geom_col()
stat_smooth calculates:
y: predicted value ymin: lower value of the confidence interval ymax: upper value of the confidence interval se: standard error
There’s parameters such as method which determines which method is used to calculate the predictions and confidence interval, and some other arguments that are passed to that.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop..))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))
If group is not set to 1, then all the bars have prop == 1. The function geom_bar assumes that the groups are equal to the x values, since the stat computes the counts within the group.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop..,group=1))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..,group=1))
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point()
A lot of points aren’t shown here because they overlap. Using geom_jitter() allows you to see them all.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter()
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter(width=1)
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter(width=10)
The amount of jitter is controlled by the width argument-increases the distance (noise) between the points.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_count()
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter()
Seems like geom_count() increases the size of the points when there are more overlapping points. Kind of like estimating the density of points in that location. geom_jitter() just makes all the points visible and the same size.
The deault is for the boxplots to be non overlapping or dodged.
ggplot(data = mpg, mapping = aes(x = drv, y = hwy, color = class)) +
geom_boxplot(position="dodge")
In contrast, we can have them overlapping by using identity.
ggplot(data = mpg, mapping = aes(x = drv, y = hwy, color = class)) +
geom_boxplot(position="identity")
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), width = 1)
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), width=1)+coord_polar(theta = "y")
?labs
Allows one to label the coordinates.
nz<-map_data("nz")
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black") +
coord_quickmap()
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black") +
coord_map()
Seems like coord_map() eliminates some grid lines and shrinks the map a tiny bit.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline() +
coord_fixed()
The abline shown with the scatter points between cty and hwy tell me that one gets higher highway mpg compared to city mpg, but they are positively correlated. geom_abline() gives the x = y line where if the points were on that line the highway and city mpg would be the same. coord_fixed() fixes the ratio between the physical representation of data units on the axes-the ratio represents the number of units on the y-axis equivalent to one unit on the x-axis.Also, coord_fixed() ensures that the abline is at a 45 degree angle, which makes it easy to compare the highway and city mileage against what it would be if they were exactly the same.
my_variable <- 10
#my_varıable #Error: object 'my_varıable' not found
Because my_variable that is assigned is different from my_varıable, which is not assigned and thus R will throw an error since it’s not an object in the environment.
library(tidyverse)
ggplot(dota = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
fliter(mpg, cyl = 8)
filter(diamond, carat > 3)
‘dota’ should be ‘data’, ‘fliter’ should be ‘filter’, ‘=’ should be ‘==’, and ‘diamond’ should be ‘diamonds’. It should be:
library(tidyverse)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy) )
filter(mpg, cyl == 8)
## # A tibble: 70 x 11
## manufacturer model displ year cyl trans drv
## <chr> <chr> <dbl> <int> <int> <chr> <chr>
## 1 audi a6 quattro 4.2 2008 8 auto(s6) 4
## 2 chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r
## 3 chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r
## 4 chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r
## 5 chevrolet c1500 suburban 2wd 5.7 1999 8 auto(l4) r
## 6 chevrolet c1500 suburban 2wd 6.0 2008 8 auto(l4) r
## 7 chevrolet corvette 5.7 1999 8 manual(m6) r
## 8 chevrolet corvette 5.7 1999 8 auto(l4) r
## 9 chevrolet corvette 6.2 2008 8 manual(m6) r
## 10 chevrolet corvette 6.2 2008 8 auto(s6) r
## # ... with 60 more rows, and 4 more variables: cty <int>, hwy <int>,
## # fl <chr>, class <chr>
filter(diamonds, carat > 3)
## # A tibble: 32 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 3.01 Premium I I1 62.7 58 8040 9.10 8.97 5.67
## 2 3.11 Fair J I1 65.9 57 9823 9.15 9.02 5.98
## 3 3.01 Premium F I1 62.2 56 9925 9.24 9.13 5.73
## 4 3.05 Premium E I1 60.9 58 10453 9.26 9.25 5.66
## 5 3.02 Fair I I1 65.2 56 10577 9.11 9.02 5.91
## 6 3.01 Fair H I1 56.1 62 10761 9.54 9.38 5.31
## 7 3.65 Fair H I1 67.1 53 11668 9.53 9.48 6.38
## 8 3.24 Premium H I1 62.1 58 12300 9.44 9.40 5.85
## 9 3.22 Ideal I I1 62.6 55 12545 9.49 9.42 5.92
## 10 3.50 Ideal H I1 62.8 57 12587 9.65 9.59 6.03
## # ... with 22 more rows
library(nycflights13)
library(tidyverse)
#glimpse(flights)
filter(flights, arr_delay > 119 & arr_delay < 121)
## # A tibble: 166 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 2 1806 1629 97 2008
## 2 2013 1 10 1801 1619 102 1923
## 3 2013 1 13 1958 1836 82 2231
## 4 2013 1 13 2145 2005 100 4
## 5 2013 1 14 1652 1445 127 1806
## 6 2013 1 15 1603 1446 77 1957
## 7 2013 1 21 1957 1815 102 2237
## 8 2013 1 22 1550 1420 90 1820
## 9 2013 1 23 1031 825 126 1337
## 10 2013 1 23 1805 1619 106 1926
## # ... with 156 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
filter(flights, dest == 'HOU' | dest == 'IAH')
## # A tibble: 9,313 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 623 627 -4 933
## 4 2013 1 1 728 732 -4 1041
## 5 2013 1 1 739 739 0 1104
## 6 2013 1 1 908 908 0 1228
## 7 2013 1 1 1028 1026 2 1350
## 8 2013 1 1 1044 1045 -1 1352
## 9 2013 1 1 1114 900 134 1447
## 10 2013 1 1 1205 1200 5 1503
## # ... with 9,303 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
filter(flights, carrier == "UA" | carrier == 'AA' | carrier == 'DA')
## # A tibble: 91,394 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 554 558 -4 740
## 5 2013 1 1 558 600 -2 753
## 6 2013 1 1 558 600 -2 924
## 7 2013 1 1 558 600 -2 923
## 8 2013 1 1 559 600 -1 941
## 9 2013 1 1 559 600 -1 854
## 10 2013 1 1 606 610 -4 858
## # ... with 91,384 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
filter(flights, month == 7 | month == 8 | month == 9)
## # A tibble: 86,326 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 7 1 1 2029 212 236
## 2 2013 7 1 2 2359 3 344
## 3 2013 7 1 29 2245 104 151
## 4 2013 7 1 43 2130 193 322
## 5 2013 7 1 44 2150 174 300
## 6 2013 7 1 46 2051 235 304
## 7 2013 7 1 48 2001 287 308
## 8 2013 7 1 58 2155 183 335
## 9 2013 7 1 100 2146 194 327
## 10 2013 7 1 100 2245 135 337
## # ... with 86,316 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
filter(flights,arr_delay > 120 & dep_delay <= 0)
## # A tibble: 29 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 27 1419 1420 -1 1754
## 2 2013 10 7 1350 1350 0 1736
## 3 2013 10 7 1357 1359 -2 1858
## 4 2013 10 16 657 700 -3 1258
## 5 2013 11 1 658 700 -2 1329
## 6 2013 3 18 1844 1847 -3 39
## 7 2013 4 17 1635 1640 -5 2049
## 8 2013 4 18 558 600 -2 1149
## 9 2013 4 18 655 700 -5 1213
## 10 2013 5 22 1827 1830 -3 2217
## # ... with 19 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
filter(flights, dep_delay > 60 & air_time > 30)
## # A tibble: 26,185 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 811 630 101 1047
## 2 2013 1 1 826 715 71 1136
## 3 2013 1 1 848 1835 853 1001
## 4 2013 1 1 957 733 144 1056
## 5 2013 1 1 1114 900 134 1447
## 6 2013 1 1 1120 944 96 1331
## 7 2013 1 1 1301 1150 71 1518
## 8 2013 1 1 1337 1220 77 1649
## 9 2013 1 1 1400 1250 70 1645
## 10 2013 1 1 1505 1310 115 1638
## # ... with 26,175 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
filter(flights, hour >= 0 & hour <= 6)
## # A tibble: 27,905 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 27,895 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
between() is a shortcut for an inclusive range, implemented efficiently in C++. So instead of filter(flights, hour >= 0 & hour <= 6), we can have between(flights$hour,0,6) which would give TRUE for rows of flights in the range. Or, to simplify 1g, do filter( flights, between(dep_time, 0, 600) ). (I don’t think it simplifies it.)
filter(flights, is.na(dep_time) )
## # A tibble: 8,255 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 NA 1630 NA NA
## 2 2013 1 1 NA 1935 NA NA
## 3 2013 1 1 NA 1500 NA NA
## 4 2013 1 1 NA 600 NA NA
## 5 2013 1 2 NA 1540 NA NA
## 6 2013 1 2 NA 1620 NA NA
## 7 2013 1 2 NA 1355 NA NA
## 8 2013 1 2 NA 1420 NA NA
## 9 2013 1 2 NA 1321 NA NA
## 10 2013 1 2 NA 1545 NA NA
## # ... with 8,245 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Along with departure time, the arrival time, delays, and air time are missing. But these flights are scheduled-it seems that these flights were scheduled but wasn’t recorded to actually fly.
NA^0
## [1] 1
NA | TRUE
## [1] TRUE
FALSE & NA
## [1] FALSE
NA * 0
## [1] NA
Hmm. Anything to the 0 is 1, NA | TRUE is TRUE because of the TRUE, FALSE & NA is false because anything and false is always false, because the value of the missing element matters in NA | FALSE and NA & TRUE, these are missing (NA), and the reason that NA * 0 is not equal to 0 is that x * 0 = NaN is undefined when x = Inf or x = -Inf.
for example:
arrange(flights, desc(is.na(dep_time)), dep_time)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 NA 1630 NA NA
## 2 2013 1 1 NA 1935 NA NA
## 3 2013 1 1 NA 1500 NA NA
## 4 2013 1 1 NA 600 NA NA
## 5 2013 1 2 NA 1540 NA NA
## 6 2013 1 2 NA 1620 NA NA
## 7 2013 1 2 NA 1355 NA NA
## 8 2013 1 2 NA 1420 NA NA
## 9 2013 1 2 NA 1321 NA NA
## 10 2013 1 2 NA 1545 NA NA
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
arrange(flights, desc(dep_delay))
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 9 641 900 1301 1242
## 2 2013 6 15 1432 1935 1137 1607
## 3 2013 1 10 1121 1635 1126 1239
## 4 2013 9 20 1139 1845 1014 1457
## 5 2013 7 22 845 1600 1005 1044
## 6 2013 4 10 1100 1900 960 1342
## 7 2013 3 17 2321 810 911 135
## 8 2013 6 27 959 1900 899 1236
## 9 2013 7 22 2257 759 898 121
## 10 2013 12 5 756 1700 896 1058
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
arrange(flights, dep_delay)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 12 7 2040 2123 -43 40
## 2 2013 2 3 2022 2055 -33 2240
## 3 2013 11 10 1408 1440 -32 1549
## 4 2013 1 11 1900 1930 -30 2233
## 5 2013 1 29 1703 1730 -27 1947
## 6 2013 8 9 729 755 -26 1002
## 7 2013 10 23 1907 1932 -25 2143
## 8 2013 3 30 2030 2055 -25 2213
## 9 2013 3 2 1431 1455 -24 1601
## 10 2013 5 5 934 958 -24 1225
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
arrange(flights, air_time)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 16 1355 1315 40 1442
## 2 2013 4 13 537 527 10 622
## 3 2013 12 6 922 851 31 1021
## 4 2013 2 3 2153 2129 24 2247
## 5 2013 2 5 1303 1315 -12 1342
## 6 2013 2 12 2123 2130 -7 2211
## 7 2013 3 2 1450 1500 -10 1547
## 8 2013 3 8 2026 1935 51 2131
## 9 2013 3 18 1456 1329 87 1533
## 10 2013 3 19 2226 2145 41 2305
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
arrange(flights,desc(distance))
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 857 900 -3 1516
## 2 2013 1 2 909 900 9 1525
## 3 2013 1 3 914 900 14 1504
## 4 2013 1 4 900 900 0 1516
## 5 2013 1 5 858 900 -2 1519
## 6 2013 1 6 1019 900 79 1558
## 7 2013 1 7 1042 900 102 1620
## 8 2013 1 8 901 900 1 1504
## 9 2013 1 9 641 900 1301 1242
## 10 2013 1 10 859 900 -1 1449
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
arrange(flights,distance)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 7 27 NA 106 NA NA
## 2 2013 1 3 2127 2129 -2 2222
## 3 2013 1 4 1240 1200 40 1333
## 4 2013 1 4 1829 1615 134 1937
## 5 2013 1 4 2128 2129 -1 2218
## 6 2013 1 5 1155 1200 -5 1241
## 7 2013 1 6 2125 2129 -4 2224
## 8 2013 1 7 2124 2129 -5 2212
## 9 2013 1 8 2127 2130 -3 2304
## 10 2013 1 9 2126 2129 -3 2217
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
I could reorder, I could select a subset or I could remove variables.
select(flights, dep_time, dep_time)
## # A tibble: 336,776 x 1
## dep_time
## <int>
## 1 517
## 2 533
## 3 542
## 4 544
## 5 554
## 6 554
## 7 555
## 8 557
## 9 557
## 10 558
## # ... with 336,766 more rows
It ignores the call to the same variable if it already occurred.
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
select(flights, one_of(vars) )
## # A tibble: 336,776 x 5
## year month day dep_delay arr_delay
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 2 11
## 2 2013 1 1 4 20
## 3 2013 1 1 2 33
## 4 2013 1 1 -1 -18
## 5 2013 1 1 -6 -25
## 6 2013 1 1 -4 12
## 7 2013 1 1 -5 19
## 8 2013 1 1 -3 -14
## 9 2013 1 1 -3 -8
## 10 2013 1 1 -2 8
## # ... with 336,766 more rows
one_of(), when used in select(), makes a table of the variables in the vector vars found in flights.
select(flights, contains("TIME"))
## # A tibble: 336,776 x 6
## dep_time sched_dep_time arr_time sched_arr_time air_time
## <int> <int> <int> <int> <dbl>
## 1 517 515 830 819 227
## 2 533 529 850 830 227
## 3 542 540 923 850 160
## 4 544 545 1004 1022 183
## 5 554 600 812 837 116
## 6 554 558 740 728 150
## 7 555 600 913 854 158
## 8 557 600 709 723 53
## 9 557 600 838 846 140
## 10 558 600 753 745 138
## # ... with 336,766 more rows, and 1 more variables: time_hour <dttm>
This command makes sense-any variable in flights containing the word TIME (case insensitive) is part of a new table.
transmute(flights,
dep_time_since_midnight = (dep_time %% 100) + ((dep_time %/% 100) * 60),
sched_dep_time_since_midnight = (sched_dep_time %% 100) + ((sched_dep_time %/% 100) * 60)
)
## # A tibble: 336,776 x 2
## dep_time_since_midnight sched_dep_time_since_midnight
## <dbl> <dbl>
## 1 317 315
## 2 333 329
## 3 342 340
## 4 344 345
## 5 354 360
## 6 354 358
## 7 355 360
## 8 357 360
## 9 357 360
## 10 358 360
## # ... with 336,766 more rows
I expect to see air_time to equal arr_time - dep_time
transmute(flights,
air_time,
tmp = arr_time - dep_time
)
## # A tibble: 336,776 x 2
## air_time tmp
## <dbl> <int>
## 1 227 313
## 2 227 317
## 3 160 381
## 4 183 460
## 5 116 258
## 6 150 186
## 7 158 358
## 8 53 152
## 9 140 281
## 10 138 195
## # ... with 336,766 more rows
Actually, arr_time and dep_time is in different units than air_time, so let me convert.
transmute(flights,
air_time,
arr_minutes = (arr_time %% 100) + ((arr_time %/% 100) * 60),
dep_minutes = (dep_time %% 100) + ((dep_time %/% 100) * 60),
arr_dep_minutes_diff = arr_minutes - dep_minutes
)
## # A tibble: 336,776 x 4
## air_time arr_minutes dep_minutes arr_dep_minutes_diff
## <dbl> <dbl> <dbl> <dbl>
## 1 227 510 317 193
## 2 227 530 333 197
## 3 160 563 342 221
## 4 183 604 344 260
## 5 116 492 354 138
## 6 150 460 354 106
## 7 158 553 355 198
## 8 53 429 357 72
## 9 140 518 357 161
## 10 138 473 358 115
## # ... with 336,766 more rows
Interesting-arr_dep_minutes_diff should be the same but it’s a bit inflated compared to air_time. There might be a disconnect between actual air time and the logging of departure and arrival times.
mutate(flights,
dep_delay_rank = min_rank(-dep_delay)) %>%
arrange(dep_delay_rank)
## # A tibble: 336,776 x 20
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 9 641 900 1301 1242
## 2 2013 6 15 1432 1935 1137 1607
## 3 2013 1 10 1121 1635 1126 1239
## 4 2013 9 20 1139 1845 1014 1457
## 5 2013 7 22 845 1600 1005 1044
## 6 2013 4 10 1100 1900 960 1342
## 7 2013 3 17 2321 810 911 135
## 8 2013 6 27 959 1900 899 1236
## 9 2013 7 22 2257 759 898 121
## 10 2013 12 5 756 1700 896 1058
## # ... with 336,766 more rows, and 13 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, dep_delay_rank <int>
1:3
## [1] 1 2 3
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
1:3 + 1:10
## [1] 2 4 6 5 7 9 8 10 12 11
Because 1:3 is short than 1:10, 1:3 is recycled and it goes 1+1,2+2,3+3,1+4,2+5,3+6, etc.
R provides:
cos(x) sin(x) tan(x)
acos(x) asin(x) atan(x) atan2(y, x)
cospi(x) sinpi(x) tanpi(x)
From ?Trig, They respectively compute the cosine, sine, tangent, arc-cosine, arc-sine, arc-tangent, and the two-argument arc-tangent.
I would say arrival delay is more important-if I depart late but arrive on time that’s ok but I usually like to adhere to adhere to when my arrival is supposed to be.
So flights can have an arrival delay that modulates the departure and thus that delay, it can also change consecutive flight arrivals/departures. The delay characteristics can vary e.g. large and small delays, or can be consistent e.g. typical amount of delay. The scenarios above vary from having small to large amounts of delays but either either have small variation or are skewed like the last scenario. I would also look to see if the delay of a flight correlates with a delay in a consecutive flight, which airlines attribute to the most delays and even if there’s an airport with more or loss delays, especially if one is cchoosing to fly into say LGA or JFK or Newark in the NYC area, which can advise people on choosing flights. Also, is the arrival delay correlative with the departure delay?
not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
not_cancelled %>% count(dest)
## # A tibble: 104 x 2
## dest n
## <chr> <int>
## 1 ABQ 254
## 2 ACK 264
## 3 ALB 418
## 4 ANC 8
## 5 ATL 16837
## 6 AUS 2411
## 7 AVL 261
## 8 BDL 412
## 9 BGR 358
## 10 BHM 269
## # ... with 94 more rows
not_cancelled %>% count(tailnum, wt = distance)
## # A tibble: 4,037 x 2
## tailnum n
## <chr> <dbl>
## 1 D942DN 3418
## 2 N0EGMQ 239143
## 3 N10156 109664
## 4 N102UW 25722
## 5 N103US 24619
## 6 N104UW 24616
## 7 N10575 139903
## 8 N105UW 23618
## 9 N107US 21677
## 10 N108UW 32070
## # ... with 4,027 more rows
I can group by the airport (dest) and sum the amount of non-na values instead of using count().
not_cancelled %>%
group_by(dest) %>%
summarise(n = sum(!is.na(dest)))
## # A tibble: 104 x 2
## dest n
## <chr> <int>
## 1 ABQ 254
## 2 ACK 264
## 3 ALB 418
## 4 ANC 8
## 5 ATL 16837
## 6 AUS 2411
## 7 AVL 261
## 8 BDL 412
## 9 BGR 358
## 10 BHM 269
## # ... with 94 more rows
Since count() is a shorthand for group_by() and tally(), I can just break up the command.
not_cancelled %>%
group_by(tailnum) %>%
tally(wt = distance)
## # A tibble: 4,037 x 2
## tailnum n
## <chr> <dbl>
## 1 D942DN 3418
## 2 N0EGMQ 239143
## 3 N10156 109664
## 4 N102UW 25722
## 5 N103US 24619
## 6 N104UW 24616
## 7 N10575 139903
## 8 N105UW 23618
## 9 N107US 21677
## 10 N108UW 32070
## # ... with 4,027 more rows
There could be flights that depart but never arrive, which could mean a cancelled flight. However, a flight that never departs will never arrive. I would say arr_delay would be more important-if a flight did arrive it will depart but not true the other way around. For making an optimal cancelled flight table, I would just do
flights %>%
filter(is.na(arr_delay))
## # A tibble: 9,430 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 1525 1530 -5 1934
## 2 2013 1 1 1528 1459 29 2002
## 3 2013 1 1 1740 1745 -5 2158
## 4 2013 1 1 1807 1738 29 2251
## 5 2013 1 1 1939 1840 59 29
## 6 2013 1 1 1952 1930 22 2358
## 7 2013 1 1 2016 1930 46 NA
## 8 2013 1 1 NA 1630 NA NA
## 9 2013 1 1 NA 1935 NA NA
## 10 2013 1 1 NA 1500 NA NA
## # ... with 9,420 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
The extra is.na(dep_delay) isn’t necessary.
So we want to group by carrier. And since we want to take into account the airport (dest), we can group by dest and carrier.
not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
tmp <- not_cancelled %>% group_by(carrier) %>% summarise( "mean_delay" = mean(arr_delay))
ggplot( data = tmp , mapping = aes(x = carrier, y = mean_delay) ) +
geom_bar(stat="identity")
tmp <- not_cancelled %>% group_by(carrier,dest) %>% summarise( "mean_delay" = mean(arr_delay))
ggplot( data = tmp , mapping = aes(x = factor(1), y = mean_delay, fill = dest) ) +
geom_bar(stat="identity") +
facet_wrap(~carrier) +
theme(
legend.position = "none"
)
The barplot shows the mean delay time across all carriers, clearly showing those with more average delays where others have early arrivals.
We see that some carriers are at a lot more aiports. We should take that into account.
tmp <- not_cancelled %>%
group_by(carrier) %>%
summarise( "mean_delay" = mean(arr_delay),
"num_flights" = n(),
"norm" = mean_delay / num_flights)
ggplot( data = tmp , mapping = aes(x = carrier, y = norm) ) +
geom_bar(stat="identity")
There are multiple and better ways to disentangle worse airport/carrier delays. See http://fivethirtyeight.com/features/the-best-and-worst-airlines-airports-and-flights-summer-2015-update/
not_cancelled %>% count(dest)
## # A tibble: 104 x 2
## dest n
## <chr> <int>
## 1 ABQ 254
## 2 ACK 264
## 3 ALB 418
## 4 ANC 8
## 5 ATL 16837
## 6 AUS 2411
## 7 AVL 261
## 8 BDL 412
## 9 BGR 358
## 10 BHM 269
## # ... with 94 more rows
not_cancelled %>% count(dest,sort=T)
## # A tibble: 104 x 2
## dest n
## <chr> <int>
## 1 ATL 16837
## 2 ORD 16566
## 3 LAX 16026
## 4 BOS 15022
## 5 MCO 13967
## 6 CLT 13674
## 7 SFO 13173
## 8 FLL 11897
## 9 MIA 11593
## 10 DCA 9111
## # ... with 94 more rows
The sort argument of count() sorts from largest to smallest counts. One might use this to view the highest count of something among the counts of others.
When you combine the mutate and filtering functions with groupings, you operate on the grouped data as opposed to the entire data frame.
not_cancelled %>%
group_by(tailnum) %>%
summarise(arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
ungroup() %>%
filter(rank(desc(arr_delay)) <= 1)
## # A tibble: 1 x 2
## tailnum arr_delay
## <chr> <dbl>
## 1 N844MH 320
not_cancelled %>%
group_by(hour) %>%
summarise( mean_arr_delay = mean(arr_delay, na.rm=T) ) %>%
ungroup() %>%
arrange(mean_arr_delay)
## # A tibble: 19 x 2
## hour mean_arr_delay
## <dbl> <dbl>
## 1 7 -5.3044716
## 2 5 -4.7969072
## 3 6 -3.3844854
## 4 9 -1.4514074
## 5 8 -1.1132266
## 6 10 0.9539401
## 7 11 1.4819300
## 8 12 3.4890104
## 9 13 6.5447397
## 10 14 9.1976501
## 11 23 11.7552783
## 12 15 12.3241920
## 13 16 12.5976412
## 14 18 14.7887244
## 15 22 15.9671618
## 16 17 16.0402670
## 17 19 16.6558736
## 18 20 16.6761098
## 19 21 18.3869371
You want to leave as early as possible.
not_cancelled %>%
group_by(dest) %>%
summarise( sum_arr_delay = sum(arr_delay) )
## # A tibble: 104 x 2
## dest sum_arr_delay
## <chr> <dbl>
## 1 ABQ 1113
## 2 ACK 1281
## 3 ALB 6018
## 4 ANC -20
## 5 ATL 190260
## 6 AUS 14514
## 7 AVL 2089
## 8 BDL 2904
## 9 BGR 2874
## 10 BHM 4540
## # ... with 94 more rows
not_cancelled %>%
group_by(flight) %>%
transmute( prop_arr_delay = arr_delay / sum(arr_delay) )
## # A tibble: 327,346 x 2
## # Groups: flight [3,835]
## flight prop_arr_delay
## <int> <dbl>
## 1 1545 0.186440678
## 2 1714 0.024183797
## 3 1141 0.036789298
## 4 725 -0.193548387
## 5 461 0.010521886
## 6 1696 -0.122448980
## 7 507 0.005950517
## 8 5708 -0.121739130
## 9 79 0.021390374
## 10 301 -0.002056555
## # ... with 327,336 more rows
I also computed a observed vs expected ratio using the median air time. I’m sure there’s better ways to find outliers.
flights %>%
filter(!is.na(air_time)) %>%
group_by(dest) %>%
mutate( med_air_time = median(air_time),
o_vs_e = (air_time - med_air_time) / med_air_time,
air_time_diff = air_time - min(air_time) ) %>%
arrange(desc(air_time_diff)) %>%
select(air_time, o_vs_e, air_time_diff, dep_time, sched_dep_time, arr_time, sched_arr_time) %>%
head(15)
## # A tibble: 15 x 8
## # Groups: dest [8]
## dest air_time o_vs_e air_time_diff dep_time sched_dep_time arr_time
## <chr> <dbl> <dbl> <dbl> <int> <int> <int>
## 1 SFO 490 0.4202899 195 1727 1730 2242
## 2 LAX 440 0.3414634 165 1812 1815 2302
## 3 EGE 382 0.5098814 163 1806 1700 2253
## 4 DEN 331 0.4711111 149 1513 1507 1914
## 5 LAX 422 0.2865854 147 1814 1815 2240
## 6 LAS 399 0.3255814 143 2142 1729 143
## 7 SFO 438 0.2695652 143 1727 1730 2206
## 8 SAN 413 0.2707692 134 1646 1620 2107
## 9 HNL 695 0.1282468 133 1337 1335 1937
## 10 SFO 426 0.2347826 131 1746 1745 2225
## 11 SNA 405 0.2310030 131 1820 1819 2242
## 12 LAX 405 0.2347561 130 1814 1815 2240
## 13 LAX 404 0.2317073 129 1655 1655 2110
## 14 HNL 691 0.1217532 129 853 900 1542
## 15 LAX 403 0.2286585 128 1113 1112 1515
## # ... with 1 more variables: sched_arr_time <int>
not_cancelled %>%
group_by(dest, carrier) %>%
count(carrier) %>%
filter(n >= 2) %>%
group_by(carrier) %>%
count(sort = TRUE)
## # A tibble: 16 x 2
## # Groups: carrier [16]
## carrier nn
## <chr> <int>
## 1 EV 60
## 2 9E 45
## 3 UA 43
## 4 B6 42
## 5 DL 36
## 6 MQ 20
## 7 AA 19
## 8 WN 11
## 9 US 5
## 10 VX 4
## 11 FL 3
## 12 OO 3
## 13 YV 3
## 14 AS 1
## 15 F9 1
## 16 HA 1
not_cancelled %>%
group_by(tailnum) %>%
mutate(delay_gt1hr = dep_delay > 60) %>%
mutate(before_delay = cumsum(delay_gt1hr)) %>%
filter(before_delay < 1) %>%
count(sort = TRUE)
## # A tibble: 3,818 x 2
## # Groups: tailnum [3,818]
## tailnum n
## <chr> <int>
## 1 N952UW 214
## 2 N315NB 160
## 3 N705TW 159
## 4 N706TW 149
## 5 N961UW 139
## 6 N346NB 127
## 7 N713TW 127
## 8 N765US 122
## 9 N721TW 119
## 10 N744P 118
## # ... with 3,808 more rows
This is a good one from https://twitter.com/vuorre/status/869545189056053248: Cmd+Option+o collapses all sections in your Rmarkdown document-pretty useful!
Very useful doc-must read!
library(tidyverse)
ggplot( data = diamonds ) +
geom_freqpoly(binwidth=0.1,aes(x = x ), color = "red") +
geom_freqpoly(binwidth=0.1,aes(x = y ), color = "blue") +
geom_freqpoly(binwidth=0.1,aes(x = z ), color ="green")
Looks like the y and z axis have similar value frequencies, and the z axis has more smaller values, which I think might represent the depth of diamonds. I realy like the graph representation from https://jrnold.github.io/e4qf/exploratory-data-analysis.html so I’m plotting it here for reference.
diamonds %>%
mutate(id = row_number()) %>%
select(x, y, z, id) %>%
gather(variable, value, -id) %>%
ggplot(aes(x = value)) +
geom_density() +
geom_rug() +
facet_grid(variable ~ .)
ggplot( data = diamonds , aes( x = price ) ) +
geom_histogram()
There is a skew towards 0 then it tails off to very high prices. But why are diamonds priced so low?
ggplot( data = diamonds , aes( x = price ) ) +
geom_histogram(binwidth=10) +
coord_cartesian(xlim=c(0,2000))
Actually the minimum price is between 250 and 300. Then there’s no diamonds priced around 1500. Wow, with more domain expertise and interesting questions we can learn a lot more about this data!
diamonds %>% group_by(carat) %>% count() %>%
ggplot() +
geom_histogram( aes( x = carat ) ) +
coord_cartesian(xlim=c(0,2))
filter(diamonds,carat==0.99) %>% count()
## # A tibble: 1 x 1
## n
## <int>
## 1 23
filter(diamonds,carat==1) %>% count()
## # A tibble: 1 x 1
## n
## <int>
## 1 1558
There’s so many more at 1 carat-maybe makes sense since I wouldn’t be asking for a .99 carat diamond. I’ll ask for a nice round 1!
diamonds %>%
ggplot() +
geom_histogram( aes( x = carat ) ) +
coord_cartesian(xlim=c(0,5))
diamonds %>%
ggplot() +
geom_histogram( aes( x = carat ) ) +
xlim( c(0,5) )
diamonds %>%
ggplot() +
geom_histogram( aes( x = carat ) ) +
coord_cartesian(ylim=c(0,1000))
diamonds %>%
ggplot() +
geom_histogram( aes( x = carat ) ) +
ylim( c(0,1000) )
That’s weird. Using ylim vs. coord_cartesian cuts of the values to only include those with the max or lower bin height. For the x axis it’s virtually the same but not sure why one row is removed…
nycflights13::flights %>%
count(is.na(dep_time)==TRUE)
## # A tibble: 2 x 2
## `is.na(dep_time) == TRUE` n
## <lgl> <int>
## 1 FALSE 328521
## 2 TRUE 8255
nycflights13::flights %>%
ggplot() +
geom_bar( aes(x = dep_time ) )
nycflights13::flights %>%
ggplot() +
geom_histogram( aes(x = dep_time ) )
nycflights13::flights %>%
ggplot() +
geom_histogram( binwidth=1,aes(x = dep_time ) )
I think there’s a difference in the bar plot vs. histogram because the binwidth matters for showing that there are gaps present in the data. Trying out different bin widths might be helpful in exploring missingness in your data.
nycflights13::flights %>%
group_by(dep_time) %>%
count()
## # A tibble: 1,319 x 2
## # Groups: dep_time [1,319]
## dep_time n
## <int> <int>
## 1 1 25
## 2 2 35
## 3 3 26
## 4 4 26
## 5 5 21
## 6 6 22
## 7 7 22
## 8 8 23
## 9 9 28
## 10 10 22
## # ... with 1,309 more rows
nycflights13::flights %>%
group_by(dep_time) %>%
count() %>%
sum()
## [1] NA
nycflights13::flights %>%
group_by(dep_time) %>%
count() %>%
sum(na.rm=TRUE)
## [1] 1995334
Seems like adding a na.rm=TRUE argument to sum (and mean) gets rid of NA values so that the sum and mean can be computed by R.
Here’s the original
cancellation <- nycflights13::flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
)
ggplot( data = cancellation, mapping = aes(sched_dep_time)) +
geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)
So we see the cancelled flights are less frequent then not cancelled flights so it’s hard to see the distribution. We can instead view it as a boxplot.
ggplot(data = cancellation , mapping = aes( x = cancelled , y = sched_dep_time) ) +
geom_boxplot()
The distributions are more comparable now.
require(ggstance)
ggplot( data = cancellation ) +
geom_boxplot( mapping = aes( x = cancelled , y = sched_dep_time ) ) +
coord_flip()
ggplot( data = cancellation ) +
geom_boxploth( mapping = aes( x = cancelled , y = sched_dep_time ) )
ggplot( data = cancellation ) +
geom_boxploth( mapping = aes( y = cancelled , x = sched_dep_time ) )
ggstance is a new package to me, so just using geom_boxploth it says we can’t have overlapping y axes, but apparently I need to switch the x and y assignment to show the same thing. Hard to see the advantage with ggstance here except you save one line of code?
ggplot(diamonds,aes(x=cut,y=price))+
geom_boxplot()+coord_flip()
require(lvplot)
ggplot(diamonds,aes(x=cut,y=price))+
geom_lv(aes(fill=..LV..))+coord_flip()
Very cool! Try ?geom_lv for the explanation of these distributions. The letter values correspond to quantiles of the data. So we see a better representation of the data quantiles than with regular boxplots. This is useful when we have lots of data.
ggplot(diamonds,aes(x=cut,y=price))+
geom_violin()
ggplot(diamonds,aes(x=price))+
geom_histogram()+facet_wrap(~cut)
ggplot(diamonds,aes(color=cut,x=price))+
geom_freqpoly()
Hmm. It’s harder to compare the price distribution of different cuts with geom_histogram, geom_violin provides a better indication of the density of values in the distribution but I think geom_freqpoly provides a better way to compare and contrast the price distribution of different cuts.
require(ggbeeswarm)
geom_beeswarm-points jittered using the beeswarm package
geom_quasirandom-points jittered using the vipor package
ggbeeswarm-extends ggplot2 with violin point/beeswarm plots
position_beeswarm-Violin point-style plots to show overlapping points. x must be discrete.
position_quasirandom-Violin point-style plots to show overlapping points. x must be discrete.
d <- diamonds %>%
count(color, cut)
d
## # A tibble: 35 x 3
## color cut n
## <ord> <ord> <int>
## 1 D Fair 163
## 2 D Good 662
## 3 D Very Good 1513
## 4 D Premium 1603
## 5 D Ideal 2834
## 6 E Fair 224
## 7 E Good 933
## 8 E Very Good 2400
## 9 E Premium 2337
## 10 E Ideal 3903
## # ... with 25 more rows
d %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))
Per jarnold’s answer you can do this by
require(viridis)
#cut within color
diamonds %>%
count(color, cut) %>%
group_by(color) %>%
mutate(prop = n / sum(n)) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = prop)) +
scale_fill_viridis(limits = c(0, 1))
#color within cut
diamonds %>%
count(color, cut) %>%
group_by(cut) %>%
mutate(prop = n / sum(n)) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = prop)) +
scale_fill_viridis(limits = c(0, 1))
nycflights13::flights %>%
group_by(month,dest) %>%
summarize(dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(x=factor(month),y=dest,fill=dep_delay))+geom_tile()+
labs(x = "Month", y = "Destination", fill = "Departure Delay")
So many of the tiles couldn’t be drawn. Again, per jarnold’s answer, we filtered for those with 12 destinations,reorder destination and dep_delay factors to get the higher delay destinations toward the top of the plot, and changed the color scheme.
require(forcats)
nycflights13::flights %>%
group_by(month,dest) %>%
summarize(dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
group_by(dest) %>%
filter(n() == 12) %>%
ungroup() %>%
mutate(dest = fct_reorder(dest, dep_delay)) %>%
ggplot(aes(x=factor(month),y=dest,fill=dep_delay))+geom_tile()+
labs(x = "Month", y = "Destination", fill = "Departure Delay")+
scale_fill_viridis()
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = cut, y = color)) +
geom_tile(mapping = aes(fill = n))
Longer cut names so better to have them on the y axis and the tiles are less wide in the former so it looks less stretched.
Per jarnold:
“When using cut_width the number in each bin may be unequal. The distribution of carat is right skewed so there are few diamonds in those bins.”
ggplot(data = diamonds,
mapping = aes(x = price,
colour = cut_width(carat, 0.3))) +
geom_freqpoly()
“Plotting the density instead of counts will make the distributions comparable, although the bins with few observations will still be hard to interpret.”
#cut width
ggplot(data = diamonds,
mapping = aes(x = price,
y = ..density..,
colour = cut_width(carat, 0.3))) +
geom_freqpoly()
#cut number
ggplot(data = diamonds,
mapping = aes(x = price,
colour = cut_number(carat, 10))) +
geom_freqpoly()
“Since there are equal numbers in each bin, the plot looks the same if density is used for the y aesthetic (although the values are on a different scale).”
ggplot(data = diamonds,
mapping = aes(x = price,
y = ..density..,
colour = cut_number(carat, 10))) +
geom_freqpoly()
ggplot(diamonds, aes(x = cut_number(price, 10), y = carat)) +
geom_boxplot() +
coord_flip() +
xlab("Price")
ggplot(diamonds, aes(x = cut_number(carat, 10), y = price)) +
geom_boxplot() +
coord_flip() +
xlab("Carat")
I would expect larger carat diamonds to be more expensive, but there’s substantial overlap with that, depending on other factors.
require(hexbin)
diamonds %>%
ggplot(mapping = aes(x = carat, y = price)) +
geom_hex()+
facet_wrap(~ cut, ncol=3) +
scale_fill_viridis()
ggplot(diamonds, aes(x = cut_number(carat, 5), y = price, color = cut)) +
geom_boxplot()
ggplot(diamonds, aes(color = cut_number(carat, 5), y = price, x = cut)) +
geom_boxplot()
ggplot(data = diamonds) +
geom_point(mapping = aes(x = x, y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
ggplot(data = diamonds) +
geom_bar(stat="identity",mapping = aes(x = cut_number(x,5), y = y))
geom_point gives far greater resolution for picking out outliers in the distributions.
No exercises
No Exercises
You can tell if an object is a tibble, as opposed to a dataframe, in that the class tible is printed, each column class is printed out below the variable name, and the trailing variable names are not printed out but only listed. Addtionally, tibbles have class “tbl_df” and “tbl_” in addition to “data.frame”.
as.tibble(mtcars)
## # A tibble: 32 x 11
## mpg cyl disp hp drat wt qsec vs am gear carb
## * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## # ... with 22 more rows
df <- data.frame(abc = 1, xyz = "a")
df$x
## [1] a
## Levels: a
df[, "xyz"]
## [1] a
## Levels: a
df[, c("abc", "xyz")]
## abc xyz
## 1 1 a
On a tibble,
tbl <- as.tibble(df)
#tbl %>% .$x #Doesn't work because "x" is not a variable name
tbl %>% .[["xyz"]]
## [1] a
## Levels: a
tbl[,c("abc","xyz")]
## # A tibble: 1 x 2
## abc xyz
## <dbl> <fctr>
## 1 1 a
df\(x* is completed to *df\)xyz which could be good but also could call the wrong variable and therefore troublesome.
From jarnold “With data.frames, with [ the type of object that is returned differs on the number of columns. If it is one column, it won’t return a data.frame, but instead will return a vector. With more than one column, then it will return a data.frame. This is fine if you know what you are passing in, but suppose you did df[ , vars] where vars was a variable. Then you what that code does depends on length(vars) and you’d have to write code to account for those situations or risk bugs.”
tbl <- as.tibble(mtcars)
var <- "mpg"
tbl[[var]]
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
#or
pull(tbl,var)
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
Extracting the variable called 1.
Plotting a scatterplot of 1 vs 2.
Creating a new column called 3 which is 2 divided by 1.
Renaming the columns to one, two and three.
annoying <- tibble(
`1` = 1:10,
`2` = `1` * 2 + rnorm(length(`1`))
)
pull(annoying,`1`)
## [1] 1 2 3 4 5 6 7 8 9 10
ggplot(annoying,aes(x= `1`,y = `2`)) + geom_point()
annoying <- annoying %>% mutate(`3` = `2` / `1`)
annoying
## # A tibble: 10 x 3
## `1` `2` `3`
## <int> <dbl> <dbl>
## 1 1 0.9907587 0.9907587
## 2 2 2.2437326 1.1218663
## 3 3 5.0863258 1.6954419
## 4 4 7.0942603 1.7735651
## 5 5 10.9757141 2.1951428
## 6 6 11.9408270 1.9901378
## 7 7 15.0688189 2.1526884
## 8 8 16.8350357 2.1043795
## 9 9 16.7230989 1.8581221
## 10 10 19.5716049 1.9571605
annoying <- rename(annoying, one = `1`, two = `2`, three = `3`)
annoying
## # A tibble: 10 x 3
## one two three
## <int> <dbl> <dbl>
## 1 1 0.9907587 0.9907587
## 2 2 2.2437326 1.1218663
## 3 3 5.0863258 1.6954419
## 4 4 7.0942603 1.7735651
## 5 5 10.9757141 2.1951428
## 6 6 11.9408270 1.9901378
## 7 7 15.0688189 2.1526884
## 8 8 16.8350357 2.1043795
## 9 9 16.7230989 1.8581221
## 10 10 19.5716049 1.9571605
It converts named vectors to a data frame with names and values
enframe(c(a = 1, b = 2, c = 3))
## # A tibble: 3 x 2
## name value
## <chr> <dbl>
## 1 a 1
## 2 b 2
## 3 c 3
I would use read_delim(), and indicate with sep=“|”.
Look up the arguments at ?read_tsv or ?read_csv. Other arguments in common include col_names, quote and delim.
library(readr)
union(names(formals(read_csv)), names(formals(read_tsv)))
## [1] "file" "col_names" "col_types" "locale" "na"
## [6] "quoted_na" "quote" "comment" "trim_ws" "skip"
## [11] "n_max" "guess_max" "progress"
The most important arguments to read_fwf which reads in a fixed width file includes the file path and col_positions which tells the function where data columns begin and end. Read ?read_fwf for more details.
x <- "x,y\n1,'a,b'"
read_delim(x, ",", quote = "'")
## # A tibble: 1 x 2
## x y
## <int> <chr>
## 1 1 a,b
read_csv("a,b\n1,2,3\n4,5,6") #implying different number of rows so chops of the 3rd element in rows 2 and 3.
## # A tibble: 2 x 2
## a b
## <int> <int>
## 1 1 2
## 2 4 5
read_csv("a,b,c\n1,2\n1,2,3,4") #adds NA to 2nd column and chops of 4th element in 4th column
## # A tibble: 2 x 3
## a b c
## <int> <int> <int>
## 1 1 2 NA
## 2 1 2 3
read_csv("a,b\n\"1") #2nd element in 1st row is missing
## # A tibble: 1 x 2
## a b
## <int> <chr>
## 1 1 <NA>
read_csv("a,b\n1,2\na,b") #nothing that I see
## # A tibble: 2 x 2
## a b
## <chr> <chr>
## 1 1 2
## 2 a b
read_csv("a;b\n1;3") #need to indicate sep=";"
## # A tibble: 1 x 1
## `a;b`
## <chr>
## 1 1;3
Arguably, the most important arguments are decimal_mark which tells of the decimal separator and grouping_mark which tells of the grouping separator. Use ?locale
parse_number("1,000.00",locale=locale(grouping_mark=",",decimal_mark="."))
## [1] 1000
#parse_number("1,000.00",locale=locale(grouping_mark=",",decimal_mark=",")) # Error: `decimal_mark` and `grouping_mark` must be different
parse_number("1.6,500,000.60",locale=locale(decimal_mark=",")) #grouping_mark default is also ","
## [1] 16.5
parse_number("1.666,500,000.60",locale=locale(grouping_mark=".")) # the "." is ignored
## [1] 1666.5
In locale, the date_format and time_format option indicates the rules for parsing a date and time, respectively.
Outside the US there seems to be a lack of standards for encoding/decoding.
sep=“,” in read_csv and sep=“;” in read_csv2
Most frequent encodings overall on the web: https://stackoverflow.com/questions/8509339/what-is-the-most-common-encoding-of-each-language
UTF-8 (89.2%) ISO-8859-1 (5.0%) Windows-1251 (1.6%) Shift JIS (0.9%) Windows-1252 (0.8%) GB2312 (0.7%) EUC-KR (0.4%) EUC-JP (0.3%)
Also read http://unicodebook.readthedocs.io/encodings.html and http://kunststube.net/encoding/
d1 <- "January 1, 2010"
parse_date(d1, format="%B %d, %Y")
## [1] "2010-01-01"
d2 <- "2015-Mar-07"
parse_date(d2, format="%Y-%b-%d")
## [1] "2015-03-07"
d3 <- "06-Jun-2017"
parse_date(d3, format="%d-%b-%Y")
## [1] "2017-06-06"
d4 <- c("August 19 (2015)", "July 1 (2015)")
#parse_date(d4, format=c("%B %b (%Y)","%B %b (%Y)"))
d5 <- "12/30/14" # Dec 30, 2014
parse_date(d5, format="%m/%d/%y")
## [1] "2014-12-30"
t1 <- "1705"
parse_time(t1, format="%H%M")
## 17:05:00
t2 <- "11:15:10.12 PM"
#parse_datetime(t2) #can't figure out
In table 1, each observation is a (country, year) and variables are cases and count.
In table 2, each observation is a (county, year, type) and the variable is count.
In table 3, each observation is (country, year) and the variable is rate (cases/population)
In the table 4s, each observation is country and variables are years. Table 4a is cases and 4b is population.
library(tidyverse)
table2 %>%
filter(type=="cases") %>%
group_by(country) %>%
summarize(TB = sum(count))
## # A tibble: 3 x 2
## country TB
## <chr> <int>
## 1 Afghanistan 3411
## 2 Brazil 118225
## 3 China 426024
table2 %>% filter(type == "population")
## # A tibble: 6 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 population 19987071
## 2 Afghanistan 2000 population 20595360
## 3 Brazil 1999 population 172006362
## 4 Brazil 2000 population 174504898
## 5 China 1999 population 1272915272
## 6 China 2000 population 1280428583
tb2_cases <- filter(table2, type == "cases")[["count"]]
tb2_country <- filter(table2, type == "cases")[["country"]]
tb2_year <- filter(table2, type == "cases")[["year"]]
tb2_population <- filter(table2, type == "population")[["count"]]
options(scipen=-99)
table2_clean <- tibble(country = tb2_country,
year = tb2_year,
rate = tb2_cases / tb2_population)
table2_clean
## # A tibble: 6 x 3
## country year rate
## <chr> <int> <dbl>
## 1 Afghanistan 1999 3.727410e-05
## 2 Afghanistan 2000 1.294466e-04
## 3 Brazil 1999 2.193930e-04
## 4 Brazil 2000 4.612363e-04
## 5 China 1999 1.667495e-04
## 6 China 2000 1.669488e-04
It’s stored in table2_clean.
Hmm. It’s easier to work with table2 than table4a or 4b because all the data is in table2 but is separated in table4a and 4b which makes doing operations a bit more complicated.
data <- table2 %>%
filter(type == "cases")
options(scipen=99)
ggplot(data,aes(year, count)) +
geom_line(aes(group = country), colour = "grey50") +
geom_point(aes(colour = country)) +
theme(legend.box.just = "centre")
stocks <- tibble(
year = c(2015, 2015, 2016, 2016),
half = c( 1, 2, 1, 2),
return = c(1.88, 0.59, 0.92, 0.17)
)
stocks %>%
spread(year, return) %>%
gather("year", "return", `2015`:`2016`)
## # A tibble: 4 x 3
## half year return
## <dbl> <chr> <dbl>
## 1 1 2015 1.88
## 2 2 2015 0.59
## 3 1 2016 0.92
## 4 2 2016 0.17
From jarnold
" The functions spread and gather are not perfectly symmetrical because column type information is not transferred between them. In the original table the column year was numeric, but after the spread-gather cyle it is character, because with gather, variable names are always converted to a character vector.
The convert argument tries to convert character vectors to the appropriate type. In the background this uses the type.convert function. "
stocks %>%
spread(year, return) %>%
gather("year", "return", `2015`:`2016`, convert = TRUE)
## # A tibble: 4 x 3
## half year return
## <dbl> <int> <dbl>
## 1 1 2015 1.88
## 2 2 2015 0.59
## 3 1 2016 0.92
## 4 2 2016 0.17
table4a %>% gather(1999, 2000, key = “year”, value = “cases”)
Error in combine_vars(vars, ind_list) : Position must be between 0 and n
The code fails because the column names 1999 and 2000 are not standard and thus needs to be quoted. The tidyverse functions will interpret 1999 and 2000 without quotes as looking for the 1999th and 2000th column of the data frame.
people <- tribble(
~name, ~key, ~value,
#-----------------|--------|------
"Phillip Woods", "age", 45,
"Phillip Woods", "height", 186,
"Phillip Woods", "age", 50,
"Jessica Cordero", "age", 37,
"Jessica Cordero", "height", 156
)
glimpse(people)
## Observations: 5
## Variables: 3
## $ name <chr> "Phillip Woods", "Phillip Woods", "Phillip Woods", "Jess...
## $ key <chr> "age", "height", "age", "age", "height"
## $ value <dbl> 45, 186, 50, 37, 156
spread(people, key, value) Error: Duplicate identifiers for rows (1, 3)
Spreading the data frame fails because there are two rows with “age” for “Phillip Woods”. We would need to add another column with an indicator for the number observation it is,
people <- tribble(
~name, ~key, ~value, ~obs,
#-----------------|--------|------|------
"Phillip Woods", "age", 45, 1,
"Phillip Woods", "height", 186, 1,
"Phillip Woods", "age", 50, 2,
"Jessica Cordero", "age", 37, 1,
"Jessica Cordero", "height", 156, 1
)
spread(people, key, value)
## # A tibble: 3 x 4
## name obs age height
## * <chr> <dbl> <dbl> <dbl>
## 1 Jessica Cordero 1 37 156
## 2 Phillip Woods 1 45 186
## 3 Phillip Woods 2 50 NA
preg <- tribble(
~pregnant, ~male, ~female,
"yes", NA, 10,
"no", 20, 12
)
You need to gather male (values are logical) and count (integer count of gender)
gather(preg, sex, count, male, female) %>%
mutate(pregnant = pregnant == "yes",
female = sex == "female") %>%
select(-sex)
## # A tibble: 4 x 3
## pregnant count female
## <lgl> <dbl> <lgl>
## 1 TRUE NA FALSE
## 2 FALSE 20 FALSE
## 3 TRUE 10 TRUE
## 4 FALSE 12 TRUE
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j"))
## # A tibble: 3 x 1
## x
## <chr>
## 1 a,b,c
## 2 d,e,f,g
## 3 h,i,j
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"),extra="merge")
## # A tibble: 3 x 3
## one two three
## * <chr> <chr> <chr>
## 1 a b c
## 2 d e f,g
## 3 h i j
tibble(x = c("a,b,c", "d,e", "f,g,i"))
## # A tibble: 3 x 1
## x
## <chr>
## 1 a,b,c
## 2 d,e
## 3 f,g,i
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"),fill="left")
## # A tibble: 3 x 3
## one two three
## * <chr> <chr> <chr>
## 1 a b c
## 2 <NA> d e
## 3 f g i
fill and extra dictate how extra or missing values in rows are delt with when making a table.
If there’s an extra value (like having multiple rows with the minority of the rows having more values), then we can indicate whether to give a warning about beingh dropped, drop the value without warning, or merge the last two values together. Or we can fill the missing value with NA frrom the left or right.
From jarnold, “You would set it to FALSE if you want to create a new variable, but keep the old one.”
From jarnold, “The function extract uses a regular expression to find groups and split into columns. In unite it is unambigous since it is many columns to one, and once the columns are specified, there is only one way to do it, the only choice is the sep. In separate, it is one to many, and there are multiple ways to split the character string.”
From jarnold, “In spread, the fill argument explicitly sets the value to replace NAs. In complete, the fill argument also sets a value to replace NAs but it is named list, allowing for different values for different variables. Also, both cases replace both implicit and explicit missing values.”
From jarnold, “With fill, it determines whether NA values should be replaced by the previous non-missing value (”down“) or the next non-missing value (”up“).”
With help from jarnold’s solutions,
who1 <- who %>%
gather(new_sp_m014:newrel_f65, key = "key", value = "cases", na.rm = TRUE)
who1 %>%
filter(cases == 0) %>%
nrow()
## [1] 11080
Starting the tidy process gives us the number of cases of a type of TB for a country. We see above there’s 11080 rows with 0 cases, indicating no representation of TB cases for a country. Setting na.rm=T or F doesn’t matter since WHO seems to have no data entered if none were measured for a country (as far as we can tell just from the data). In the end, it doesn’t matter the specification.
Neglecting the mutate step can be OK if we know that all cases are new and we just parse the case type after the 3rd character. But we may not know that so better to mutate.
From jarnold,
select(who, country, iso2, iso3) %>%
distinct() %>%
group_by(country) %>%
filter(n() > 1)
## # A tibble: 0 x 3
## # Groups: country [0]
## # ... with 3 variables: country <chr>, iso2 <chr>, iso3 <chr>
If there were redundant values, after displaying only distinct rows, there would be a country represented >1 if it paired with a differet iso2 or iso3, thus they’re redundant columns.
Because of the 0s for cases of TB types, it’s better to just sum cases of all types. Also because there are no or few cases before 1195, we’ll filter for after then. Also to distinguish cases by country and sex, we’re making a new variable to group by to have the graph less cluttered. This is adapted from jarnold.
who_new <- who %>%
gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>%
mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%
separate(code, c("new", "var", "sexage")) %>%
select(-new, -iso2, -iso3) %>%
separate(sexage, c("sex", "age"), sep = 1)
who_new %>%
group_by(country, year, sex) %>%
filter(year > 1995) %>%
summarise(cases = sum(value)) %>%
unite(country_sex, country, sex, remove = FALSE) %>%
ggplot(aes(x = year, y = cases, group = country_sex, colour = sex)) +
geom_line()
To get the route, I need the origin and dest variables in the flights table. And I would connect the flights table to the tailnum variable in the planes table to account for each plane. The route would be constructed from the latitude and longtiude variablesin the airports table.
The tables weather and airports are connected by the airports (NYC only) in the faa or origins variables in the tables, respectively.
year, month, day, hour, origin in weather would be matched to year, month, day, hour, dest in flight.
If I wanted to make a “special” table, I would indicate the year, month and day variables to be keys and include other variables which would connect to tables.
library(tidyverse)
library(nycflights13)
flights %>%
arrange(year,month,day,sched_dep_time,carrier,flight) %>%
mutate(flight_id = row_number()) %>%
count(flight_id) %>%
filter(n > 1)
## # A tibble: 0 x 2
## # ... with 2 variables: flight_id <int>, n <int>
Without a surrogate key like above, we’d need to combine these 5 different variables to get a parent key.
(You might need to install some packages and read some documentation.)
Lahman::Batting %>%
group_by(playerID, yearID, stint) %>%
filter(n() > 1) %>%
nrow()
## [1] 0
babynames::babynames %>%
group_by(year, sex, name) %>%
filter(n() > 1) %>%
nrow()
## [1] 0
nasaweather::atmos %>%
group_by(lat, long, year, month) %>%
filter(n() > 1) %>%
nrow()
## [1] 0
fueleconomy::vehicles %>%
group_by(id) %>%
filter(n() > 1) %>%
nrow()
## [1] 0
There is no primary key for ggplot2::diamonds. Using all variables in the data frame, the number of distinct rows is less than the total number of rows, meaning no combination of variables uniquely identifies the observations.
How would you characterise the relationship between the Batting, Pitching, and Fielding tables?
colnames(Lahman::Batting)
## [1] "playerID" "yearID" "stint" "teamID" "lgID" "G"
## [7] "AB" "R" "H" "X2B" "X3B" "HR"
## [13] "RBI" "SB" "CS" "BB" "SO" "IBB"
## [19] "HBP" "SH" "SF" "GIDP"
colnames(Lahman::Master)
## [1] "playerID" "birthYear" "birthMonth" "birthDay"
## [5] "birthCountry" "birthState" "birthCity" "deathYear"
## [9] "deathMonth" "deathDay" "deathCountry" "deathState"
## [13] "deathCity" "nameFirst" "nameLast" "nameGiven"
## [17] "weight" "height" "bats" "throws"
## [21] "debut" "finalGame" "retroID" "bbrefID"
## [25] "deathDate" "birthDate"
colnames(Lahman::Salaries)
## [1] "yearID" "teamID" "lgID" "playerID" "salary"
From jarnold (I’m too lazy to give this a good college try):
Batting primary key: playerID, yearID, stint foreign keys: playerID -> Master.playerID Master primary key: playerID Salaries primary key: yearID, teamID, playerID foreign keys: playerID -> Master.playerID Managers: primary key: yearID, playerID, teamID, inseason foreign keys: playerID -> Master.teamID Managers: primary key: awardID, yearID AwardsManagers: primary key: playerID, awardID, yearID (since there are ties and while tie distinguishes those awards it has NA values) foreign keys: playerID -> Master.playerID playerID, yearID, lgID -> Managers.playerID, yearID, lgID lgID and teamID appear in multiple tables, but should be primary keys for league and team tables.
airports %>%
semi_join(flights, c("faa" = "dest")) %>%
ggplot(aes(lon, lat)) +
borders("state") +
geom_point() +
coord_quickmap()
avg_dest_delays <-
flights %>%
group_by(dest) %>%
# arrival delay NA's are cancelled flights
summarise(delay = mean(arr_delay, na.rm = TRUE)) %>%
inner_join(airports, by = c(dest = "faa"))
avg_dest_delays %>%
ggplot(aes(lon, lat, colour = delay)) +
borders("state") +
geom_point() +
coord_quickmap()
flights %>%
left_join(airports, c(origin = "faa")) %>%
left_join(airports, c(dest = "faa"))
## # A tibble: 336,776 x 33
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 26 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, name.x <chr>, lat.x <dbl>,
## # lon.x <dbl>, alt.x <int>, tz.x <dbl>, dst.x <chr>, tzone.x <chr>,
## # name.y <chr>, lat.y <dbl>, lon.y <dbl>, alt.y <int>, tz.y <dbl>,
## # dst.y <chr>, tzone.y <chr>
from jarnold,
Suprisingly not. If anything (departure) delay seems to decrease slightly with age (perhaps because of selection):
plane_ages <-
planes %>%
mutate(age = 2013 - year) %>%
select(tailnum, age)
flights %>%
inner_join(plane_ages, by = "tailnum") %>%
group_by(age) %>%
filter(!is.na(dep_delay)) %>%
summarise(delay = mean(dep_delay)) %>%
ggplot(aes(x = age, y = delay)) +
geom_point() +
geom_line()
flight_weather <- flights %>%
inner_join(weather,by=c("origin" = "origin",
"year" = "year",
"month" = "month",
"day" = "day",
"hour" = "hour")
)
flight_weather %>%
group_by(precip) %>%
summarise(delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(x = precip, y = delay)) +
geom_line() + geom_point()
flight_weather %>%
group_by(wind_gust) %>%
summarise(delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(x = wind_gust, y = delay)) +
geom_line() + geom_point()
flight_weather %>%
group_by(visib) %>%
summarise(delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(x = visib, y = delay)) +
geom_line() + geom_point()
library(viridis)
flights %>%
filter(year == 2013, month == 6, day == 13) %>%
group_by(dest) %>%
summarise(delay = mean(arr_delay, na.rm = TRUE)) %>%
inner_join(airports, by = c("dest" = "faa")) %>%
ggplot(aes(y = lat, x = lon, size = delay, colour = delay)) +
borders("state") +
geom_point() +
coord_quickmap() +
scale_color_viridis()
Looks like NA tailnums never departed though they were scheduled to
American and Envoy airlines planes don’t have tail numbers.
na_flights <- filter(flights,is.na(flights$tailnum))
flights %>%
anti_join(planes, by = "tailnum") %>%
count(carrier, sort = TRUE)
## # A tibble: 10 x 2
## carrier n
## <chr> <int>
## 1 MQ 25397
## 2 AA 22558
## 3 UA 1693
## 4 9E 1044
## 5 B6 830
## 6 US 699
## 7 FL 187
## 8 DL 110
## 9 F9 50
## 10 WN 38
flights %>%
group_by(tailnum) %>%
filter(n() > 100)
## # A tibble: 229,202 x 19
## # Groups: tailnum [1,201]
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 544 545 -1 1004
## 4 2013 1 1 554 558 -4 740
## 5 2013 1 1 555 600 -5 913
## 6 2013 1 1 557 600 -3 709
## 7 2013 1 1 557 600 -3 838
## 8 2013 1 1 558 600 -2 849
## 9 2013 1 1 558 600 -2 853
## 10 2013 1 1 558 600 -2 923
## # ... with 229,192 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
com_models <- fueleconomy::common %>%
semi_join(fueleconomy::vehicles)
2 day window with highest delay times
flight_weather %>%
mutate(time = day*24 + hour + (minute/60) ) %>%
ggplot(aes(time)) +
geom_histogram()
Thought I had something…
anti_join(flights, airports, by = c(“dest” = “faa”)) are flights that go to an airport that is not in FAA list of destinations, likely foreign airports.
anti_join(airports, flights, by = c(“faa” = “dest”)) are US airports that don’t have a flight in the data, meaning that there were no flights to that aiport from New York in 2013.
Is there a plane flown by more than 1 airline? Nope.
flights %>%
group_by(tailnum, carrier) %>%
distinct() %>%
distinct(tailnum) %>%
group_by(tailnum) %>%
count(carrier) %>%
filter(n>1) %>%
arrange(desc(n))
## # A tibble: 0 x 3
## # Groups: tailnum [0]
## # ... with 3 variables: tailnum <chr>, carrier <chr>, n <int>
The paste0 separator in nothing ( “” ) and paste the default separator is space ( " " ). The equivalent stringr functions are str_c(), stri_paste(), stri_join(), and stri_flatten(). In paste, missing values, NA, are treated as strings, “NA”.
The collapse argument joins elements in a character vector while the sep argument joins elements from multiple character vectors.
library(stringr)
x <- "Apple"
med <- ifelse(
str_length(x) %% 2 == 0,
ceiling(str_length(x) / 2),
floor(str_length(x) / 2)
)
str_sub(x,med,med)
## [1] "p"
This is useful for reformatting text so that it doesn’t run off the visible page or you want to indent.
str_trim() trims whitespace from the beginning or end of a string. The opposite, str_pad(), adds whitespace.
func <- function(x) {
if( length(x) < 2 ){warning("need string with atleast two characters")}else{
str_c(x,collapse=", ")
}
}
x0 <- c()
x1 <- c("a")
x2 <- c("a","b")
func(x2)
## [1] "a, b"
“" doesn’t match ‘' because that is a special character. “\” doesn’t match’' because even though it escaped the ‘' in the regexp, we wrap it in a string first, and a string requires more escaping outside the special behavior. “\" doesn’t match a literal ’' because in strings ’' is also used as an escape so we need”\\" to match a’'.
x <- "\"\'\\"
x
## [1] "\"'\\"
str_view(x,"\\\"\\'\\\\")
Escaping " three times, once for escaping the speical behavior then because we’re in a string and then because we’re escaping special behavior in a string.
Escaping ’ twice, twice for escaping special behavior in the regexp and then in the string.
Escaping for the reason stated in the question above.
{r chapter14,eval=F}\..\..\..
match? How would you represent it as a string?x <- "\\..\\..\\.."
x
## [1] "\\..\\..\\.."
Can’t figure out how to get rid of that second backslash…
x <- '"$^$"'
str_view(x,'"$^$"')
words <- stringr::words
str_view(words,"^y",match=T)
str_view(words,"x$",match=T)
str_view(words,"\\b[[:alpha:]]{3}\\b",match = T)
str_match(words,"\\b[[:alpha:]]{7,}\\b")
## [,1]
## [1,] NA
## [2,] NA
## [3,] NA
## [4,] "absolute"
## [5,] NA
## [6,] "account"
## [7,] "achieve"
## [8,] NA
## [9,] NA
## [10,] NA
## [11,] NA
## [12,] NA
## [13,] "address"
## [14,] NA
## [15,] "advertise"
## [16,] NA
## [17,] NA
## [18,] NA
## [19,] "afternoon"
## [20,] NA
## [21,] "against"
## [22,] NA
## [23,] NA
## [24,] NA
## [25,] NA
## [26,] NA
## [27,] NA
## [28,] NA
## [29,] NA
## [30,] NA
## [31,] "already"
## [32,] "alright"
## [33,] NA
## [34,] "although"
## [35,] NA
## [36,] "america"
## [37,] NA
## [38,] NA
## [39,] "another"
## [40,] NA
## [41,] NA
## [42,] NA
## [43,] "apparent"
## [44,] NA
## [45,] NA
## [46,] "appoint"
## [47,] "approach"
## [48,] "appropriate"
## [49,] NA
## [50,] NA
## [51,] NA
## [52,] NA
## [53,] "arrange"
## [54,] NA
## [55,] NA
## [56,] NA
## [57,] "associate"
## [58,] NA
## [59,] NA
## [60,] NA
## [61,] "authority"
## [62,] "available"
## [63,] NA
## [64,] NA
## [65,] NA
## [66,] NA
## [67,] NA
## [68,] NA
## [69,] NA
## [70,] "balance"
## [71,] NA
## [72,] NA
## [73,] NA
## [74,] NA
## [75,] NA
## [76,] NA
## [77,] NA
## [78,] NA
## [79,] NA
## [80,] "because"
## [81,] NA
## [82,] NA
## [83,] NA
## [84,] NA
## [85,] NA
## [86,] "believe"
## [87,] "benefit"
## [88,] NA
## [89,] NA
## [90,] "between"
## [91,] NA
## [92,] NA
## [93,] NA
## [94,] NA
## [95,] NA
## [96,] NA
## [97,] NA
## [98,] NA
## [99,] NA
## [100,] NA
## [101,] NA
## [102,] NA
## [103,] NA
## [104,] NA
## [105,] NA
## [106,] NA
## [107,] NA
## [108,] NA
## [109,] NA
## [110,] NA
## [111,] NA
## [112,] "brilliant"
## [113,] NA
## [114,] "britain"
## [115,] "brother"
## [116,] NA
## [117,] NA
## [118,] NA
## [119,] "business"
## [120,] NA
## [121,] NA
## [122,] NA
## [123,] NA
## [124,] NA
## [125,] NA
## [126,] NA
## [127,] NA
## [128,] NA
## [129,] NA
## [130,] NA
## [131,] NA
## [132,] NA
## [133,] NA
## [134,] NA
## [135,] NA
## [136,] NA
## [137,] "certain"
## [138,] NA
## [139,] "chairman"
## [140,] NA
## [141,] NA
## [142,] NA
## [143,] "character"
## [144,] NA
## [145,] NA
## [146,] NA
## [147,] NA
## [148,] NA
## [149,] NA
## [150,] NA
## [151,] "Christmas"
## [152,] NA
## [153,] NA
## [154,] NA
## [155,] NA
## [156,] NA
## [157,] NA
## [158,] NA
## [159,] NA
## [160,] NA
## [161,] NA
## [162,] NA
## [163,] NA
## [164,] NA
## [165,] NA
## [166,] "colleague"
## [167,] "collect"
## [168,] "college"
## [169,] NA
## [170,] NA
## [171,] "comment"
## [172,] NA
## [173,] "committee"
## [174,] NA
## [175,] "community"
## [176,] "company"
## [177,] "compare"
## [178,] "complete"
## [179,] "compute"
## [180,] "concern"
## [181,] "condition"
## [182,] NA
## [183,] "consider"
## [184,] "consult"
## [185,] "contact"
## [186,] "continue"
## [187,] "contract"
## [188,] "control"
## [189,] "converse"
## [190,] NA
## [191,] NA
## [192,] NA
## [193,] "correct"
## [194,] NA
## [195,] NA
## [196,] "council"
## [197,] NA
## [198,] "country"
## [199,] NA
## [200,] NA
## [201,] NA
## [202,] NA
## [203,] NA
## [204,] NA
## [205,] NA
## [206,] NA
## [207,] "current"
## [208,] NA
## [209,] NA
## [210,] NA
## [211,] NA
## [212,] NA
## [213,] NA
## [214,] NA
## [215,] NA
## [216,] NA
## [217,] NA
## [218,] "decision"
## [219,] NA
## [220,] "definite"
## [221,] NA
## [222,] "department"
## [223,] NA
## [224,] "describe"
## [225,] NA
## [226,] NA
## [227,] "develop"
## [228,] NA
## [229,] "difference"
## [230,] "difficult"
## [231,] NA
## [232,] NA
## [233,] "discuss"
## [234,] "district"
## [235,] NA
## [236,] NA
## [237,] NA
## [238,] "document"
## [239,] NA
## [240,] NA
## [241,] NA
## [242,] NA
## [243,] NA
## [244,] NA
## [245,] NA
## [246,] NA
## [247,] NA
## [248,] NA
## [249,] NA
## [250,] NA
## [251,] NA
## [252,] NA
## [253,] NA
## [254,] NA
## [255,] NA
## [256,] NA
## [257,] "economy"
## [258,] "educate"
## [259,] NA
## [260,] NA
## [261,] NA
## [262,] NA
## [263,] NA
## [264,] "electric"
## [265,] NA
## [266,] NA
## [267,] NA
## [268,] "encourage"
## [269,] NA
## [270,] NA
## [271,] "english"
## [272,] NA
## [273,] NA
## [274,] NA
## [275,] "environment"
## [276,] NA
## [277,] "especial"
## [278,] NA
## [279,] NA
## [280,] "evening"
## [281,] NA
## [282,] NA
## [283,] "evidence"
## [284,] NA
## [285,] "example"
## [286,] NA
## [287,] NA
## [288,] "exercise"
## [289,] NA
## [290,] NA
## [291,] "expense"
## [292,] "experience"
## [293,] "explain"
## [294,] "express"
## [295,] NA
## [296,] NA
## [297,] NA
## [298,] NA
## [299,] NA
## [300,] NA
## [301,] NA
## [302,] NA
## [303,] NA
## [304,] NA
## [305,] NA
## [306,] NA
## [307,] NA
## [308,] NA
## [309,] NA
## [310,] NA
## [311,] NA
## [312,] NA
## [313,] NA
## [314,] NA
## [315,] NA
## [316,] NA
## [317,] "finance"
## [318,] NA
## [319,] NA
## [320,] NA
## [321,] NA
## [322,] NA
## [323,] NA
## [324,] NA
## [325,] NA
## [326,] NA
## [327,] NA
## [328,] NA
## [329,] NA
## [330,] NA
## [331,] NA
## [332,] NA
## [333,] NA
## [334,] NA
## [335,] NA
## [336,] "fortune"
## [337,] "forward"
## [338,] NA
## [339,] NA
## [340,] NA
## [341,] NA
## [342,] NA
## [343,] NA
## [344,] NA
## [345,] NA
## [346,] NA
## [347,] "function"
## [348,] NA
## [349,] "further"
## [350,] NA
## [351,] NA
## [352,] NA
## [353,] NA
## [354,] "general"
## [355,] "germany"
## [356,] NA
## [357,] NA
## [358,] NA
## [359,] NA
## [360,] NA
## [361,] NA
## [362,] NA
## [363,] "goodbye"
## [364,] NA
## [365,] NA
## [366,] NA
## [367,] NA
## [368,] NA
## [369,] NA
## [370,] NA
## [371,] NA
## [372,] NA
## [373,] NA
## [374,] NA
## [375,] NA
## [376,] NA
## [377,] NA
## [378,] NA
## [379,] NA
## [380,] NA
## [381,] NA
## [382,] NA
## [383,] NA
## [384,] NA
## [385,] NA
## [386,] NA
## [387,] NA
## [388,] NA
## [389,] NA
## [390,] NA
## [391,] NA
## [392,] NA
## [393,] NA
## [394,] NA
## [395,] "history"
## [396,] NA
## [397,] NA
## [398,] "holiday"
## [399,] NA
## [400,] NA
## [401,] NA
## [402,] NA
## [403,] "hospital"
## [404,] NA
## [405,] NA
## [406,] NA
## [407,] NA
## [408,] "however"
## [409,] NA
## [410,] "hundred"
## [411,] "husband"
## [412,] NA
## [413,] "identify"
## [414,] NA
## [415,] "imagine"
## [416,] "important"
## [417,] "improve"
## [418,] NA
## [419,] "include"
## [420,] NA
## [421,] "increase"
## [422,] NA
## [423,] "individual"
## [424,] "industry"
## [425,] NA
## [426,] NA
## [427,] "instead"
## [428,] NA
## [429,] "interest"
## [430,] NA
## [431,] "introduce"
## [432,] NA
## [433,] "involve"
## [434,] NA
## [435,] NA
## [436,] NA
## [437,] NA
## [438,] NA
## [439,] NA
## [440,] NA
## [441,] NA
## [442,] NA
## [443,] NA
## [444,] NA
## [445,] NA
## [446,] NA
## [447,] NA
## [448,] NA
## [449,] "kitchen"
## [450,] NA
## [451,] NA
## [452,] NA
## [453,] NA
## [454,] NA
## [455,] NA
## [456,] "language"
## [457,] NA
## [458,] NA
## [459,] NA
## [460,] NA
## [461,] NA
## [462,] NA
## [463,] NA
## [464,] NA
## [465,] NA
## [466,] NA
## [467,] NA
## [468,] NA
## [469,] NA
## [470,] NA
## [471,] NA
## [472,] NA
## [473,] NA
## [474,] NA
## [475,] NA
## [476,] NA
## [477,] NA
## [478,] NA
## [479,] NA
## [480,] NA
## [481,] NA
## [482,] NA
## [483,] NA
## [484,] NA
## [485,] NA
## [486,] NA
## [487,] NA
## [488,] NA
## [489,] NA
## [490,] NA
## [491,] NA
## [492,] NA
## [493,] NA
## [494,] NA
## [495,] NA
## [496,] NA
## [497,] "machine"
## [498,] NA
## [499,] NA
## [500,] NA
## [501,] NA
## [502,] NA
## [503,] NA
## [504,] NA
## [505,] NA
## [506,] NA
## [507,] NA
## [508,] NA
## [509,] NA
## [510,] NA
## [511,] NA
## [512,] "meaning"
## [513,] "measure"
## [514,] NA
## [515,] NA
## [516,] "mention"
## [517,] NA
## [518,] NA
## [519,] NA
## [520,] NA
## [521,] "million"
## [522,] NA
## [523,] "minister"
## [524,] NA
## [525,] NA
## [526,] NA
## [527,] NA
## [528,] NA
## [529,] NA
## [530,] NA
## [531,] NA
## [532,] NA
## [533,] "morning"
## [534,] NA
## [535,] NA
## [536,] NA
## [537,] NA
## [538,] NA
## [539,] NA
## [540,] NA
## [541,] NA
## [542,] NA
## [543,] NA
## [544,] NA
## [545,] NA
## [546,] "necessary"
## [547,] NA
## [548,] NA
## [549,] NA
## [550,] NA
## [551,] NA
## [552,] NA
## [553,] NA
## [554,] NA
## [555,] NA
## [556,] NA
## [557,] NA
## [558,] NA
## [559,] NA
## [560,] NA
## [561,] NA
## [562,] NA
## [563,] NA
## [564,] NA
## [565,] "obvious"
## [566,] "occasion"
## [567,] NA
## [568,] NA
## [569,] NA
## [570,] NA
## [571,] NA
## [572,] NA
## [573,] NA
## [574,] NA
## [575,] NA
## [576,] NA
## [577,] NA
## [578,] NA
## [579,] NA
## [580,] "operate"
## [581,] "opportunity"
## [582,] NA
## [583,] NA
## [584,] NA
## [585,] "organize"
## [586,] "original"
## [587,] NA
## [588,] "otherwise"
## [589,] NA
## [590,] NA
## [591,] NA
## [592,] NA
## [593,] NA
## [594,] NA
## [595,] NA
## [596,] NA
## [597,] NA
## [598,] "paragraph"
## [599,] NA
## [600,] NA
## [601,] NA
## [602,] NA
## [603,] "particular"
## [604,] NA
## [605,] NA
## [606,] NA
## [607,] NA
## [608,] NA
## [609,] "pension"
## [610,] NA
## [611,] NA
## [612,] "percent"
## [613,] "perfect"
## [614,] "perhaps"
## [615,] NA
## [616,] NA
## [617,] "photograph"
## [618,] NA
## [619,] "picture"
## [620,] NA
## [621,] NA
## [622,] NA
## [623,] NA
## [624,] NA
## [625,] NA
## [626,] NA
## [627,] NA
## [628,] NA
## [629,] "politic"
## [630,] NA
## [631,] "position"
## [632,] "positive"
## [633,] "possible"
## [634,] NA
## [635,] NA
## [636,] NA
## [637,] "practise"
## [638,] "prepare"
## [639,] "present"
## [640,] NA
## [641,] "pressure"
## [642,] "presume"
## [643,] NA
## [644,] "previous"
## [645,] NA
## [646,] NA
## [647,] "private"
## [648,] "probable"
## [649,] "problem"
## [650,] "proceed"
## [651,] "process"
## [652,] "produce"
## [653,] "product"
## [654,] "programme"
## [655,] "project"
## [656,] NA
## [657,] "propose"
## [658,] "protect"
## [659,] "provide"
## [660,] NA
## [661,] NA
## [662,] "purpose"
## [663,] NA
## [664,] NA
## [665,] "quality"
## [666,] "quarter"
## [667,] "question"
## [668,] NA
## [669,] NA
## [670,] NA
## [671,] NA
## [672,] NA
## [673,] NA
## [674,] NA
## [675,] NA
## [676,] NA
## [677,] NA
## [678,] NA
## [679,] NA
## [680,] NA
## [681,] "realise"
## [682,] NA
## [683,] NA
## [684,] "receive"
## [685,] NA
## [686,] NA
## [687,] "recognize"
## [688,] "recommend"
## [689,] NA
## [690,] NA
## [691,] NA
## [692,] NA
## [693,] NA
## [694,] NA
## [695,] "relation"
## [696,] "remember"
## [697,] NA
## [698,] "represent"
## [699,] "require"
## [700,] "research"
## [701,] "resource"
## [702,] "respect"
## [703,] "responsible"
## [704,] NA
## [705,] NA
## [706,] NA
## [707,] NA
## [708,] NA
## [709,] NA
## [710,] NA
## [711,] NA
## [712,] NA
## [713,] NA
## [714,] NA
## [715,] NA
## [716,] NA
## [717,] NA
## [718,] NA
## [719,] NA
## [720,] NA
## [721,] "saturday"
## [722,] NA
## [723,] NA
## [724,] NA
## [725,] NA
## [726,] "science"
## [727,] NA
## [728,] "scotland"
## [729,] NA
## [730,] NA
## [731,] "secretary"
## [732,] "section"
## [733,] NA
## [734,] NA
## [735,] NA
## [736,] NA
## [737,] NA
## [738,] NA
## [739,] NA
## [740,] "separate"
## [741,] "serious"
## [742,] NA
## [743,] "service"
## [744,] NA
## [745,] NA
## [746,] NA
## [747,] NA
## [748,] NA
## [749,] NA
## [750,] NA
## [751,] NA
## [752,] NA
## [753,] NA
## [754,] NA
## [755,] NA
## [756,] NA
## [757,] NA
## [758,] NA
## [759,] NA
## [760,] NA
## [761,] NA
## [762,] "similar"
## [763,] NA
## [764,] NA
## [765,] NA
## [766,] NA
## [767,] NA
## [768,] NA
## [769,] NA
## [770,] NA
## [771,] "situate"
## [772,] NA
## [773,] NA
## [774,] NA
## [775,] NA
## [776,] NA
## [777,] NA
## [778,] NA
## [779,] NA
## [780,] NA
## [781,] "society"
## [782,] NA
## [783,] NA
## [784,] NA
## [785,] NA
## [786,] NA
## [787,] NA
## [788,] NA
## [789,] NA
## [790,] NA
## [791,] "special"
## [792,] "specific"
## [793,] NA
## [794,] NA
## [795,] NA
## [796,] NA
## [797,] NA
## [798,] NA
## [799,] NA
## [800,] NA
## [801,] "standard"
## [802,] NA
## [803,] NA
## [804,] "station"
## [805,] NA
## [806,] NA
## [807,] NA
## [808,] NA
## [809,] NA
## [810,] NA
## [811,] "straight"
## [812,] "strategy"
## [813,] NA
## [814,] NA
## [815,] NA
## [816,] "structure"
## [817,] "student"
## [818,] NA
## [819,] NA
## [820,] NA
## [821,] "subject"
## [822,] "succeed"
## [823,] NA
## [824,] NA
## [825,] "suggest"
## [826,] NA
## [827,] NA
## [828,] NA
## [829,] NA
## [830,] NA
## [831,] "support"
## [832,] "suppose"
## [833,] NA
## [834,] "surprise"
## [835,] NA
## [836,] NA
## [837,] NA
## [838,] NA
## [839,] NA
## [840,] NA
## [841,] NA
## [842,] NA
## [843,] NA
## [844,] NA
## [845,] "telephone"
## [846,] "television"
## [847,] NA
## [848,] NA
## [849,] NA
## [850,] NA
## [851,] "terrible"
## [852,] NA
## [853,] NA
## [854,] NA
## [855,] NA
## [856,] NA
## [857,] NA
## [858,] "therefore"
## [859,] NA
## [860,] NA
## [861,] NA
## [862,] "thirteen"
## [863,] NA
## [864,] NA
## [865,] NA
## [866,] NA
## [867,] "thousand"
## [868,] NA
## [869,] "through"
## [870,] NA
## [871,] "thursday"
## [872,] NA
## [873,] NA
## [874,] NA
## [875,] NA
## [876,] "together"
## [877,] "tomorrow"
## [878,] "tonight"
## [879,] NA
## [880,] NA
## [881,] NA
## [882,] NA
## [883,] NA
## [884,] NA
## [885,] NA
## [886,] "traffic"
## [887,] NA
## [888,] "transport"
## [889,] NA
## [890,] NA
## [891,] NA
## [892,] "trouble"
## [893,] NA
## [894,] NA
## [895,] NA
## [896,] "tuesday"
## [897,] NA
## [898,] NA
## [899,] NA
## [900,] NA
## [901,] NA
## [902,] NA
## [903,] "understand"
## [904,] NA
## [905,] NA
## [906,] NA
## [907,] "university"
## [908,] NA
## [909,] NA
## [910,] NA
## [911,] NA
## [912,] NA
## [913,] NA
## [914,] NA
## [915,] "various"
## [916,] NA
## [917,] NA
## [918,] NA
## [919,] "village"
## [920,] NA
## [921,] NA
## [922,] NA
## [923,] NA
## [924,] NA
## [925,] NA
## [926,] NA
## [927,] NA
## [928,] NA
## [929,] NA
## [930,] NA
## [931,] NA
## [932,] NA
## [933,] NA
## [934,] NA
## [935,] NA
## [936,] "wednesday"
## [937,] NA
## [938,] NA
## [939,] NA
## [940,] "welcome"
## [941,] NA
## [942,] NA
## [943,] NA
## [944,] NA
## [945,] NA
## [946,] "whether"
## [947,] NA
## [948,] NA
## [949,] NA
## [950,] NA
## [951,] NA
## [952,] NA
## [953,] NA
## [954,] NA
## [955,] NA
## [956,] NA
## [957,] NA
## [958,] NA
## [959,] NA
## [960,] NA
## [961,] NA
## [962,] "without"
## [963,] NA
## [964,] NA
## [965,] NA
## [966,] NA
## [967,] NA
## [968,] NA
## [969,] NA
## [970,] NA
## [971,] NA
## [972,] NA
## [973,] NA
## [974,] NA
## [975,] NA
## [976,] NA
## [977,] "yesterday"
## [978,] NA
## [979,] NA
## [980,] NA
Go to ?regex in the base package to see more examples and details about regular expressions.
Start with a vowel.
That only contain consonants. (Hint: thinking about matching “not”-vowels.)
End with ed, but not with eed.
End with ing or ise.
sub_words <- sample(words,20)
str_view(sub_words,"^[aeiou]",match=T)
str_view(sub_words,"^[aeiou]",match=T)
str_view(sub_words,"[^e]ed$",match=T)
str_view(sub_words,"ing$|ise$",match=T)
str_view(words,"cei",match=T)
str_view(words,"cie",match=T)
str_view(words,"iec",match=T)
str_view(words,"eic",match=T)
I think it’s only “iec” not “eic”.
str_view(words,"uq|qu",match=T)
yes
colour
str_view(words,"colour",match=T)
("[0-9]{3}-[0-9]{3}-[0-9]{4}")
## [1] "[0-9]{3}-[0-9]{3}-[0-9]{4}"
? <- {0,1}
<- {1,}
<- {0.}
^.*$ # This matches a 0 or more periods at the beginning of the string to the end
“\{.+\}” #this string of a regex that matches one or more periods
-- #this regex matches 4 digits, a dash, 2 digits, a dash then 2 digits
“\\{4}” # this regex is in a string and matches 4 's.
str_view(words,"^[^aeiou]{3}.*",match=T)
str_view(words,"[aeiou]{3,}",match=T)
str_view(words,"([aeiou][^aeiou]){2,}.*",match=T)
No thanks I’ll stick with this.
(.) #words with two of the same letter once
“(.)(.)\2\1” # words with any two letters repeated twice consecutively once in the word
(..) # words with two letters repeated once
“(.).\1.\1” # words with any letter, a specific letter repeated once then another letter repeated once
“(.)(.)(.).*\3\2\1" # words with one letter, then two of the same letter, 3 of the same letter then 0 or more of a letter.
Start and end with the same character.
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
str_view(words,"^(.).*\\1$",match=T) #couldn't figure out how top feasibly expand to any character.
str_view(words,"(..).*\\1",match = T)
str_view(words,"(.).*\\1.*\\1",match = T)
# one regex
words[str_detect(words, "^x|x$")]
## [1] "box" "sex" "six" "tax"
# split regex into parts
start_with_x <- str_detect(words, "^x")
end_with_x <- str_detect(words, "x$")
words[start_with_x | end_with_x]
## [1] "box" "sex" "six" "tax"
str_subset(words, "^[aeiou].*[^aeiou]$") %>% head()
## [1] "about" "accept" "account" "across" "act" "actual"
start_with_vowel <- str_detect(words, "^[aeiou]")
end_with_consonant <- str_detect(words, "[^aeiou]$")
words[start_with_vowel & end_with_consonant] %>% head()
## [1] "about" "accept" "account" "across" "act" "actual"
library(purrr)
pattern <-
cross(rerun(5, c("a", "e", "i", "o", "u")),
.filter = function(...) {
x <- as.character(unlist(list(...)))
length(x) != length(unique(x))
}) %>%
map_chr(~ str_c(unlist(.x), collapse = ".*")) %>%
str_c(collapse = "|")
str_subset(words, pattern)
## character(0)
words[str_detect(words, "a") &
str_detect(words, "e") &
str_detect(words, "i") &
str_detect(words, "o") &
str_detect(words, "u")]
## character(0)
all from jarnold
ind <- which.max(str_count(words,"[aeiou]"))
words[ind]
## [1] "appropriate"
Add the “” before and after the pattern
colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = "|")
more <- sentences[str_count(sentences, colour_match) > 1]
str_view_all(more, colour_match)
colour_match2 <- str_c("\\b(", str_c(colours, collapse = "|"), ")\\b")
more2 <- sentences[str_count(sentences, colour_match) > 1]
str_view_all(more2, colour_match2, match = TRUE)
str_extract(sentences, "[a-zA-X]+") %>% head()
## [1] "The" "Glue" "It" "These" "Rice" "The"
pattern <- "\\b[A-Za-z]+ing\\b"
sentences_with_ing <- str_detect(sentences, pattern)
unique(unlist(str_extract_all(sentences[sentences_with_ing], pattern))) %>%
head()
## [1] "spring" "evening" "morning" "winding" "living" "king"
unique(unlist(str_extract_all(sentences, "\\b[A-Za-z]{3,}s\\b"))) %>%
head()
## [1] "planks" "days" "bowls" "lemons" "makes" "hogs"
from jarnold
numword <- "(one|two|three|four|five|six|seven|eight|nine|ten) +(\\S+)"
sentences[str_detect(sentences, numword)] %>%
str_extract(numword)
## [1] "ten served" "one over" "seven books" "two met"
## [5] "two factors" "one and" "three lists" "seven is"
## [9] "two when" "one floor." "ten inches." "one with"
## [13] "one war" "one button" "six minutes." "ten years"
## [17] "one in" "ten chased" "one like" "two shares"
## [21] "two distinct" "one costs" "ten two" "five robins."
## [25] "four kinds" "one rang" "ten him." "three story"
## [29] "ten by" "one wall." "three inches" "ten your"
## [33] "six comes" "one before" "three batches" "two leaves."
contraction <- "([A-Za-z]+)'([A-Za-z]+)"
sentences %>%
`[`(str_detect(sentences, contraction)) %>%
str_extract(contraction)
## [1] "It's" "man's" "don't" "store's" "workmen's"
## [6] "Let's" "sun's" "child's" "king's" "It's"
## [11] "don't" "queen's" "don't" "pirate's" "neighbor's"
x <- c("apples, pears, and bananas")
str_split(x, ", +(and +)?")[[1]]
## [1] "apples" "pears" "bananas"
Can’t find a replce_all function…
Couldn’t figure out how to do this…
x <- c("apples, pears, and bananas")
str_split(x, ", +(and +)?")[[1]]
## [1] "apples" "pears" "bananas"
Splitting by boundary(“word”) splits on punctuation and not just whitespace.
str_split("ab. cd|agt", "")[[1]]
## [1] "a" "b" "." " " "c" "d" "|" "a" "g" "t"
from jarnold.
str_subset(c("a\\b", "ab"), "\\\\")
## [1] "a\\b"
str_subset(c("a\\b", "ab"), fixed("\\"))
## [1] "a\\b"
library(dplyr)
str_extract_all(sentences, boundary("word")) %>%
unlist() %>%
str_to_lower() %>%
tibble() %>%
set_names("word") %>%
group_by(word) %>%
count(sort = TRUE) %>%
head(5)
## # A tibble: 5 x 2
## # Groups: word [5]
## word n
## <chr> <int>
## 1 the 751
## 2 a 202
## 3 of 132
## 4 to 123
## 5 and 118
from jarnold.
stri_count_words
stri_duplicated
There are several functions beginning with stri_rand_. stri_rand_lipsum generates lorem ipsum text, stri_rand_strings generates random strings, stri_rand_shuffle randomly shuffles the code points in the text.
Use the locale argument to the opts_collator argument. # 15. Factors
library(forcats)
library(tidyverse)
library(ggplot2)
ggplot(gss_cat,aes(rincome)) +
geom_bar(drop=F) +
coord_flip()
This is ok-would be better if we sorted by decreasing order or in another meaningful way.
gss_cat %>% group_by(relig) %>% count() %>% arrange(desc(n))
## # A tibble: 15 x 2
## # Groups: relig [15]
## relig n
## <fctr> <int>
## 1 Protestant 10846
## 2 Catholic 5124
## 3 None 3523
## 4 Christian 689
## 5 Jewish 388
## 6 Other 224
## 7 Buddhism 147
## 8 Inter-nondenominational 109
## 9 Moslem/islam 104
## 10 Orthodox-christian 95
## 11 No answer 93
## 12 Hinduism 71
## 13 Other eastern 32
## 14 Native american 23
## 15 Don't know 15
gss_cat %>% group_by(partyid) %>% count() %>% arrange(desc(n))
## # A tibble: 10 x 2
## # Groups: partyid [10]
## partyid n
## <fctr> <int>
## 1 Independent 4119
## 2 Not str democrat 3690
## 3 Strong democrat 3490
## 4 Not str republican 3032
## 5 Ind,near dem 2499
## 6 Strong republican 2314
## 7 Ind,near rep 1791
## 8 Other party 393
## 9 No answer 154
## 10 Don't know 1
gss_cat %>% group_by(denom) %>% count()
## # A tibble: 30 x 2
## # Groups: denom [30]
## denom n
## <fctr> <int>
## 1 No answer 117
## 2 Don't know 52
## 3 No denomination 1683
## 4 Other 2534
## 5 Episcopal 397
## 6 Presbyterian-dk wh 244
## 7 Presbyterian, merged 67
## 8 Other presbyterian 47
## 9 United pres ch in us 110
## 10 Presbyterian c in us 104
## # ... with 20 more rows
My educated guess is different sects of the religions but let’s try to use our programming skillset.
gss_cat %>%
ggplot() +
geom_bar(aes(denom)) +
scale_x_discrete(drop=T) +
facet_wrap(~relig,scales = "free") +
ggtitle("All religions",subtitle = "and all denominations catalogued within")
coord_flip()
## <ggproto object: Class CoordFlip, CoordCartesian, Coord>
## aspect: function
## distance: function
## expand: TRUE
## is_linear: function
## labels: function
## limits: list
## range: function
## render_axis_h: function
## render_axis_v: function
## render_bg: function
## render_fg: function
## train: function
## transform: function
## super: <ggproto object: Class CoordFlip, CoordCartesian, Coord>
denoms <- levels(gss_cat$denom)
tmp <- filter(gss_cat,relig=="Protestant") %>% select(denom) %>% unique()
protestant_denoms <- pull(tmp,denom)
setdiff(denoms,protestant_denoms)
## [1] "Not applicable"
filter(gss_cat,relig=="Protestant") %>%
ggplot() +
geom_bar(aes(denom)) +
ggtitle("Protestant religion",subtitle = "and all denominations catalogued within") +
coord_flip()
filter(gss_cat,denom=="Not applicable") %>%
ggplot() +
geom_bar(aes(relig)) +
ggtitle("Not applicable denomination",subtitle = "within all religions") +
coord_flip()
gss_cat %>%
ggplot() +
geom_density(aes(tvhours))
The distribution of tvhours is a little wonkey-a lot of peaks and valleys. But depending on your question the mean (or median) wuld be what you want to use. Or viewing the histogram for different groups.
gss_cat
## # A tibble: 21,483 x 9
## year marital age race rincome partyid
## <int> <fctr> <int> <fctr> <fctr> <fctr>
## 1 2000 Never married 26 White $8000 to 9999 Ind,near rep
## 2 2000 Divorced 48 White $8000 to 9999 Not str republican
## 3 2000 Widowed 67 White Not applicable Independent
## 4 2000 Never married 39 White Not applicable Ind,near rep
## 5 2000 Divorced 25 White Not applicable Not str democrat
## 6 2000 Married 25 White $20000 - 24999 Strong democrat
## 7 2000 Never married 36 White $25000 or more Not str republican
## 8 2000 Divorced 44 White $7000 to 7999 Ind,near dem
## 9 2000 Married 44 White $25000 or more Not str democrat
## 10 2000 Married 47 White $25000 or more Strong republican
## # ... with 21,473 more rows, and 3 more variables: relig <fctr>,
## # denom <fctr>, tvhours <int>
We want to look at marital, race, rincome, partyid, relig, and denom.
levs <- c("marital", "race", "rincome", "partyid", "relig", "denom")
for(i in levs){
gg <- gss_cat %>%
ggplot() + geom_bar(aes_string(x = i)) + coord_flip()
print(gg)
}
Only race is in descending order, the rest are more or less arbitrary.
From jarnold, “Because that gives the level “Not applicable” an integer value of 1.”
gss_cat %>%
group_by(year,partyid) %>%
count(partyid) %>%
ggplot() +
geom_line(aes(year,n,group=partyid,color=partyid))
We need to do some factor reciding…
col_gss_cat <- gss_cat %>%
mutate(partyid = fct_collapse(partyid,
Other = c("No answer", "Don't know", "Other party"),
Republican = c("Strong republican", "Not str republican"),
Independent = c("Ind,near rep", "Independent", "Ind,near dem"),
Democrat = c("Not str democrat", "Strong democrat")
))
col_gss_cat %>%
group_by(year,partyid) %>%
count(partyid) %>%
ggplot() +
ylab("Number") +
geom_line(aes(year,n,group=partyid,color=partyid))
From jarnold (just being lazy myself :-) ):
“Group all the non-responses into one category, and then group other categories into a smaller number. Since there is a clear ordering, we wouldn’t want to use something like fct_lump.”
library("stringr")
gss_cat %>%
mutate(rincome =
fct_collapse(
rincome,
`Unknown` = c("No answer", "Don't know", "Refused", "Not applicable"),
`Lt $5000` = c("Lt $1000", str_c("$", c("1000", "3000", "4000"),
" to ", c("2999", "3999", "4999"))),
`$5000 to 10000` = str_c("$", c("5000", "6000", "7000", "8000"),
" to ", c("5999", "6999", "7999", "9999"))
)) %>%
ggplot(aes(x = rincome)) +
geom_bar() +
coord_flip()
# 16. Dates and times
library(tidyverse)
library(lubridate)
library(nycflights13)
ymd(c("2010-10-10", "bananas"))
## [1] "2010-10-10" NA
Time zone is very important because the time changes e.g. between NYC and London, and you want to have the right time.
d1 <- "January 1, 2010"
d2 <- "2015-Mar-07"
d3 <- "06-Jun-2017"
d4 <- c("August 19 (2015)", "July 1 (2015)")
d5 <- "12/30/14" # Dec 30, 2014
mdy(d1)
## [1] "2010-01-01"
ymd(d2)
## [1] "2015-03-07"
dmy(d3)
## [1] "2017-06-06"
mdy(d4)
## [1] "2015-08-19" "2015-07-01"
mdy(d5)
## [1] "2014-12-30"
making date time variables using jarnold’s code
make_datetime_100 <- function(year, month, day, time) {
make_datetime(year, month, day, time %/% 100, time %% 100)
}
flights_dt <- flights %>%
filter(!is.na(dep_time), !is.na(arr_time)) %>%
mutate(
dep_time = make_datetime_100(year, month, day, dep_time),
arr_time = make_datetime_100(year, month, day, arr_time),
sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
) %>%
select(origin, dest, ends_with("delay"), ends_with("time"))
We can show, over time in seconds and separated by each month, the departure time.
flights_dt %>%
mutate(time = hour(dep_time) * 100 + minute(dep_time),
mon = as.factor(month
(dep_time))) %>%
ggplot(aes(x = time, y = ..density.., group = mon, color = mon)) +
geom_freqpoly(binwidth = 100)
The distribution looks the same from month to month.
If they are consistent, then dep_time = sched_dep_time + dep_delay
flights_dt %>%
mutate(dep_time_ = sched_dep_time + dep_delay * 60) %>%
filter(dep_time_ != dep_time) %>%
select(dep_time_, dep_time, sched_dep_time, dep_delay)
## # A tibble: 1,205 x 4
## dep_time_ dep_time sched_dep_time dep_delay
## <dttm> <dttm> <dttm> <dbl>
## 1 2013-01-02 08:48:00 2013-01-01 08:48:00 2013-01-01 18:35:00 853
## 2 2013-01-03 00:42:00 2013-01-02 00:42:00 2013-01-02 23:59:00 43
## 3 2013-01-03 01:26:00 2013-01-02 01:26:00 2013-01-02 22:50:00 156
## 4 2013-01-04 00:32:00 2013-01-03 00:32:00 2013-01-03 23:59:00 33
## 5 2013-01-04 00:50:00 2013-01-03 00:50:00 2013-01-03 21:45:00 185
## 6 2013-01-04 02:35:00 2013-01-03 02:35:00 2013-01-03 23:59:00 156
## 7 2013-01-05 00:25:00 2013-01-04 00:25:00 2013-01-04 23:59:00 26
## 8 2013-01-05 01:06:00 2013-01-04 01:06:00 2013-01-04 22:45:00 141
## 9 2013-01-06 00:14:00 2013-01-05 00:14:00 2013-01-05 23:59:00 15
## 10 2013-01-06 00:37:00 2013-01-05 00:37:00 2013-01-05 22:30:00 127
## # ... with 1,195 more rows
From jarnold:
“There exist discrepencies. It looks like there are mistakes in the dates. These are flights in which the actual departure time is on the next day relative to the scheduled departure time. We forgot to account for this when creating the date-times. The code would have had to check if the departure time is less than the scheduled departure time. Alternatively, simply adding the delay time is more robust because it will automatically account for crossing into the next day.”
flights_dt %>%
mutate(flight_duration = as.numeric(arr_time - dep_time),
air_time_mins = air_time,
diff = flight_duration - air_time_mins) %>%
select(origin, dest, flight_duration, air_time_mins, diff)
## # A tibble: 328,063 x 5
## origin dest flight_duration air_time_mins diff
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 EWR IAH 193 227 -34
## 2 LGA IAH 197 227 -30
## 3 JFK MIA 221 160 61
## 4 JFK BQN 260 183 77
## 5 LGA ATL 138 116 22
## 6 EWR ORD 106 150 -44
## 7 EWR FLL 198 158 40
## 8 LGA IAD 72 53 19
## 9 JFK MCO 161 140 21
## 10 LGA ORD 115 138 -23
## # ... with 328,053 more rows
There seems to be a discrepency-air time should be always equal or less than flight duration right?
sched_dep_time is what’s most interesting for passengers. From jarnold,
flights_dt %>%
mutate(sched_dep_hour = hour(sched_dep_time)) %>%
group_by(sched_dep_hour) %>%
summarise(dep_delay = mean(dep_delay)) %>%
ggplot(aes(y = dep_delay, x = sched_dep_hour)) +
geom_point() +
geom_smooth()
from jarnold,
flights_dt %>%
mutate(dow = wday(sched_dep_time)) %>%
group_by(dow) %>%
summarise(dep_delay = mean(dep_delay),
arr_delay = mean(arr_delay, na.rm = TRUE),
tot_delay = dep_delay + arr_delay)
## # A tibble: 7 x 4
## dow dep_delay arr_delay tot_delay
## <dbl> <dbl> <dbl> <dbl>
## 1 1 11.495974 4.820024 16.31600
## 2 2 14.734220 9.653739 24.38796
## 3 3 10.591532 5.388526 15.98006
## 4 4 11.699198 7.051119 18.75032
## 5 5 16.064837 11.740819 27.80566
## 6 6 14.660643 9.070120 23.73076
## 7 7 7.620118 -1.448828 6.17129
Sunday
diamonds %>%
ggplot(aes(carat)) +
geom_histogram()
flights_dt %>%
ggplot(aes(sched_dep_time)) +
geom_histogram()
Let’s look at more granularity,
diamonds %>%
ggplot(aes(carat)) +
geom_histogram(bins=100) +
xlim(0,1)
flights_dt %>%
ggplot(aes(minute(sched_dep_time))) +
geom_histogram()
Both distributions have multi-modal distributions, reflecting diamond carats at 1/3, 1/2, 2/3, etc and sched_dep_time of flights being close to every 5 minutes of the hour.
flights_dt %>%
mutate( wn2030or5060 = ifelse(
(minute(sched_dep_time)>=20 & minute(sched_dep_time)<=30) |
(minute(sched_dep_time)>=50 & minute(sched_dep_time)<=60),
1,0)) %>%
select(sched_dep_time,wn2030or5060,dep_delay) %>%
ggplot() +
geom_violin(aes(factor(wn2030or5060),dep_delay)) +
ylim(-25,25)
I would expect there to be a lower-value distribution to go with the hypothesis. Doesn’t seem like it.
From jarnold,
“There is no direct unambigous value of months in seconds:
31 days: Jan, Mar, May, Jul, Aug, Oct, 30 days: Apr, Jun, Sep, Nov, Dec 28 or 29 days: Feb
Though in the past, in the pre-computer era, for arithmetic convenience, bankers adopoted a 360 day year with 30 day months."
In R we can easily manipulate the concept of time using variables. This is helpful when we have messy time data, such as times for overnight flights. We can clean up time discrepencies by accomidating for whether there was an overnight flight-the variables overnight is either TRUE or FALSE so by multiplying by 1 all overnight flights get added a day and those not overnight stay the same.
from jarnold,
ymd("2015-01-01") + months(0:11)
## [1] "2015-01-01" "2015-02-01" "2015-03-01" "2015-04-01" "2015-05-01"
## [6] "2015-06-01" "2015-07-01" "2015-08-01" "2015-09-01" "2015-10-01"
## [11] "2015-11-01" "2015-12-01"
floor_date(today(), unit = "year") + months(0:11)
## [1] "2017-01-01" "2017-02-01" "2017-03-01" "2017-04-01" "2017-05-01"
## [6] "2017-06-01" "2017-07-01" "2017-08-01" "2017-09-01" "2017-10-01"
## [11] "2017-11-01" "2017-12-01"
func <- function(date){
todays_yr <- year(today())
birth_yr <- year(ymd(date))
cat("You are ",todays_yr - birth_yr, " years old!")
}
func("1992-07-15")
## You are 25 years old!
It doesn’t throw an error, but the functions provide different time classes.
(today() %--% (today() + years(1))) / months(1)
## [1] 12
No Exercises
No Exercises
x <- c(0,.5,1, NA)
range(x,na.rm = T)
## [1] 0 1
range(x,na.rm = F)
## [1] NA NA
na.rm = F makes all the elements NA values. Instead with na.rm = T, you can tell how many NA values you have.
from jarnold (I can’t beat his work…)
x <- c(1:10, Inf, -Inf)
#second variant
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE, finite = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
rescale01(x)
## [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
## [8] 0.7777778 0.8888889 1.0000000 Inf -Inf
rescale01_alt <- function(x) {
rng <- range(x, na.rm = TRUE, finite = TRUE)
y <- (x - rng[1]) / (rng[2] - rng[1])
y[ y== Inf] <- 1
y[ y== -Inf] <- 0
}
rescale01_alt(x)
mean(is.na(x))
## [1] 0
x / sum(x, na.rm = TRUE)
## [1] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)
## [1] NaN
Not really sure what the mean of a logical is doing…
The second one I can do which has a transparent name and reducing duplication a little.
And the third one I would do which does the same as above. This is the coefficient of variation so maybe coef_var would be a better name.
Looks easy enough, but I’ll move on.
jarnold has a better answer, of course, but here’s mine.
both_na <- function(x,y){
if(length(x) != length(y)){
stop("You're two vectors need to be the same length")
}
x_ints <- as.integer( is.na(x) )
y_ints <- as.integer( is.na(y) )
sum((x_ints + y_ints)==2)
}
x <- c(NA,2,NA)
y <- c(NA,1,2)
both_na(x,y)
## [1] 1
is_directory <- function(x) file.info(x)$isdir
is_readable <- function(x) file.access(x, 4) == 0
is_directory extracts whether x is a directory and is_readable extracts how many files have read permission in my directory. Their still useful even though they’re short because the function name is a bit more intuitive and requires a bit less thinking about what it’s doing compared to the code in the function calls.
To me, I’d not do functions for these unless there’s a lot of repitition because I would still take the time to see what the functions are doing and then read the help pages on the called functions.
I like jarnold’s answer, but it’s broken…it still gets the point across of writing functions to reduce duplication and increase interpretability.
library(dplyr)
threat <- function(chances) {
give_chances(from = Good_Fairy,
to = foo_foo,
number = chances,
condition = "Don't behave",
consequence = turn_into_goon)
}
lyric <- function() {
foo_foo %>%
hop(through = forest) %>%
scoop(up = field_mouse) %>%
bop(on = head)
down_came(Good_Fairy)
said(Good_Fairy,
c("Little bunny Foo Foo",
"I don't want to see you",
"Scooping up the field mice",
"And bopping them on the head."))
}
lyric()
threat(3)
lyric()
threat(2)
lyric()
threat(1)
lyric()
turn_into_goon(Good_Fairy, foo_foo)
Quick note: in an R script use Cmd/Ctrl + Shift + R to get a cool comment line!
f1 <- function(string, prefix) {
substr(string, 1, nchar(prefix)) == prefix
}
f2 <- function(x) {
if (length(x) <= 1) return(NULL)
x[-length(x)]
}
f3 <- function(x, y) {
rep(y, length.out = length(x))
}
from jarnold,
f1 returns whether a function has a common prefix.
The function f2 drops the last elemen. A better name for f2 is drop_last().
The function f3 repeats y once for each element of x. This is a harder one to name. I would say something like recycle (R’s name for this behavior), or expand.
Haven’t written a function in a while, so I’ll pass.
from jarnold,
You can ignore.
rnorm samples from the univariate normal distribution, while MASS::mvrnorm samples from the multivariate normal distribution. The main arguments in rnorm are n, mean, sd. The main arguments is MASS::mvrnorm are n, mu, Sigma. To be consistent they should have the same names. However, this is difficult. In general, it is better to be consistent with more widely used functions, e.g. rmvnorm should follow the conventions of rnorm. However, while mean is correct in the multivariate case, sd does not make sense in the multivariate case. Both functions an internally consistent though; it would be bad to have mu and sd or mean and Sigma.
from jarnold,
If named norm_r and norm_d, it groups the family of functions related to the normal distribution. If named rnorm, and dnorm, functions related to are grouped into families by the action they perform. r* functions always sample from distributions: rnorm, rbinom, runif, rexp. d* functions calculate the probability density or mass of a distribution: dnorm, dbinom, dunif, dexp.
if statements execute upon a true condition and ifelse executes always because the otherwise option is given.
x <- FALSE
if(x) "Hi"
ifelse(x,"Hi","FALSE condition")
## [1] "FALSE condition"
library(lubridate)
greet <- function(x){
hr <- hour(x)
if(hr<12){
"Good morning"
} else if(hr>17){
"Good evening"
} else{
"Good afternoon"
}
}
greet( now() )
## [1] "Good evening"
fizzbuzz <- function(x){
stopifnot( length(x)==1 && is.numeric(x) )
if(x %% 3==0){
print("fizz")
} else if(x %% 5==0){
print("buzz")
} else if((x %% 3==0) && (x %% 5==0)){
print("fizzbuzz")
} else{
print(x)
}
}
fizzbuzz(9)
## [1] "fizz"
I like jarnold’s explanation that cut works with vectors while if works only with single values.
temp <- seq(-10, 50, by = 5)
# this
if (temp <= 0) {
"freezing"
} else if (temp <= 10) {
"cold"
} else if (temp <= 20) {
"cool"
} else if (temp <= 30) {
"warm"
} else {
"hot"
}
## [1] "freezing"
#is converted to this
cut(temp, c(-Inf, 0, 10, 20, 30, Inf), right = TRUE,
labels = c("freezing", "cold", "cool", "warm", "hot"))
## [1] freezing freezing freezing cold cold cool cool
## [8] warm warm hot hot hot hot
## Levels: freezing cold cool warm hot
This is cool-the intervals could be equally spaced by specifying an integer for breaks or you can specify the breakpoints like above.
from jarnold,
It selects that number argument from …
two
x <- "e"
switch(x,
a = ,
b = "ab",
c = ,
d = "cd"
)
x <- "a"
switch(x,
a = ,
b = "ab",
c = ,
d = "cd"
)
## [1] "ab"
Nothing happens because e isn’t a named argument in switch. But if a named argument is given, it points to that-but if nothing is given like in “a” above it’ll point to the next argument in switch.
letters is a, b, c and
library(stringr)
letters = c("a", "b", "c")
commas <- function(...) stringr::str_c(..., collapse = ", ")
#commas(letters, collapse = "-")
is supposed to concatenates the letters and connects them by ‘-’. But it throws an error: The argument collapse is passed to str_c as part of …, so it tries to run str_c(letters, collapse = “-”, collapse = “,”).
from jarnold,
rule <- function(..., pad = "-") {
title <- paste0(...)
width <- getOption("width") - nchar(title) - 5
cat(title, " ", stringr::str_dup(pad, width), "\n", sep = "")
}
rule("Important output")
## Important output ------------------------------------------------------
rule("Important output", pad = "-+")
## Important output -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
It does not work because it duplicates pad by the width minus the length of the string. This is implictly assuming that pad is only one character. I could adjust the code to calculate the length of pad. The trickiest part is handling what to do if width is not a multiple of the number of characters of pad.
rule <- function(..., pad = "-") {
title <- paste0(...)
width <- getOption("width") - nchar(title) - 5
padchar <- nchar(pad)
cat(title, " ",
stringr::str_dup(pad, width %/% padchar),
# if not multiple, fill in the remaining characters
stringr::str_sub(pad, 1, width %% padchar),
"\n", sep = "")
}
rule("Important output")
## Important output ------------------------------------------------------
rule("Important output", pad = "-+")
## Important output -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
rule("Important output", pad = "-+-")
## Important output -+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+-
The trim argument omits the end proportion of elements in ‘x’. Why do this? Maybe if you know the end data is faulty? Remove outliers from calculation?
Those are the options-if others are given the function will stop. The first is used by default.
From the docs,
is.finite returns a vector of the same length as x the jth element of which is TRUE if x[j] is finite (i.e., it is not one of the values NA, NaN, Inf or -Inf) and FALSE otherwise.
is.infinite returns a vector of the same length as x the jth element of which is TRUE if x[j] is infinite (i.e., equal to one of Inf or -Inf) and FALSE otherwise.
So !is.infinite() will be true for everything but Inf or -Inf. is.finite() will be false for NA, NaN while !is.infinite() will be true.
It determines of the difference between the two numbers given is less than a tolerance threshold, which is 1.490116e-08 encoded by .Machine$double.eps^0.5. This gives greater precision for determining equality of two numbers. Also you can change the threshold to your liking.
from the docs of ?is.integer,
An integer vector can actually contain onlyin the range of +/-2*10^9. Doubles can hold much larger integers exactly.
from jarnold,
Broadly, could convert a double to an integer by truncating or rounding to the nearest integer. For truncating or for handling ties (doubles ending in 0.5), there are multiple methods for determing which integer value to go to.
For rounding, R and many programming languages use the IEEE standard. This is “round to nearest, ties to even”. This is not the same as what you See the value of looking at the value of .Machine$double.rounding and its documentation.
He provides a list of cool links for rounding errors.
The parse_* functions for those types
The former tells is you the average amount of NA values in vector x, which the the sum of NAs divided by the length of NAs and non-NAs. The latter returns the number of elements in x that are either NA, NaN, Inf or -Inf.
from the help page,
is.vector returns TRUE if x is a vector of the specified mode having no attributes other than names. It returns FALSE otherwise.
is.atomic returns TRUE if x is of an atomic type (or NULL) and FALSE otherwise.
it is common to call the atomic types ‘atomic vectors’, but note that is.vector imposes further restrictions: an object can be atomic but not a vector (in that sense).
is.atomic(list()) gives FALSE and is.vector(list()) gives TRUE
setNames
## function (object = nm, nm)
## {
## names(object) <- nm
## object
## }
## <bytecode: 0x7fc562eca4a8>
## <environment: namespace:stats>
purrr::set_names
## function (x, nm = x, ...)
## {
## if (!is_vector(x)) {
## abort("`x` must be a vector")
## }
## if (is_function(nm) || is_formula(nm)) {
## nm <- as_function(nm)
## nm <- nm(names2(x), ...)
## }
## else if (!is_null(nm)) {
## nm <- as.character(nm)
## nm <- chr(nm, ...)
## }
## if (!is_null(nm) && !is_character(nm, length(x))) {
## abort("`nm` must be `NULL` or a character vector the same length as `x`")
## }
## names(x) <- nm
## x
## }
## <environment: namespace:rlang>
set_names adds a few sanity checks: x has to be a vector, and the lengths of the object and the names have to be the same.
Create functions that take a vector as input and returns:
The last value. Should you use [ or [[?
The elements at even numbered positions.
Every element except the last value.
Only even numbers (and no missing values).
see jarnold
from jarnold,
x <- c(-5:5, Inf, -Inf, NaN, NA)
x[-which(x > 0)]
## [1] -5 -4 -3 -2 -1 0 -Inf NaN NA
-which(x > 0)
## [1] -7 -8 -9 -10 -11 -12
x[x <= 0]
## [1] -5 -4 -3 -2 -1 0 -Inf NA NA
x <= 0
## [1] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE TRUE NA NA
-which(x > 0) which calculates the indexes for any value that is TRUE and ignores NA. Thus is keeps NA and NaN because the comparison is not TRUE. x <= 0 works slightly differently. If x <= 0 returns TRUE or FALSE it works the same way. Hoewver, if the comparison generates a NA, then it will always keep that entry, but set it to NA. This is why the last two values of x[x <= 0] are NA rather than c(NaN, NA).
From jarnold,
When you subset with positive integers that are larger than the length of the vector, NA values are returned for those integers larger than the length of the vector.
When a vector is subset with a name that doesn’t exist, an error is generated.
list(a, b, list(c, d), list(e, f)) list(list(list(list(list(list(a))))))
Sorry, not drawing-see the chapter for drawing representations.
From jarnold,
Subsetting a tibble works the same way as a list; a data frame can be thought of as a list of columns. The key different between a list and a tibble is that a tibble (data frame) has the restriction that all its elements (columns) must have the same length.
x <- hms::hms(3600)
class(x)
## [1] "hms" "difftime"
str(x)
## Classes 'hms', 'difftime' atomic [1:1] 3600
## ..- attr(*, "units")= chr "secs"
typeof(x)
## [1] "double"
attributes(x)
## $units
## [1] "secs"
##
## $class
## [1] "hms" "difftime"
library(dplyr)
tibble(x = 1:2, y = 1:3)
#Error: Column `x` must be length 1 or 3, not 2
Only if the length is equal to the other column vectors/lists.
Think about the output, sequence, and body before you start writing the loop.
# 1
col_means <- vector("double",length = ncol(mtcars))
for(i in seq_along(mtcars)){
col_means[[i]] <- mean(mtcars[[i]])
}
col_means
## [1] 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250
## [7] 17.848750 0.437500 0.406250 3.687500 2.812500
#2
types <- vector("list",length = ncol(nycflights13::flights)) #class can have multiple values
names(types) <- names(nycflights13::flights)
for(i in seq_along(nycflights13::flights)){
types[[i]] <- class(nycflights13::flights[[i]])
}
types
## $year
## [1] "integer"
##
## $month
## [1] "integer"
##
## $day
## [1] "integer"
##
## $dep_time
## [1] "integer"
##
## $sched_dep_time
## [1] "integer"
##
## $dep_delay
## [1] "numeric"
##
## $arr_time
## [1] "integer"
##
## $sched_arr_time
## [1] "integer"
##
## $arr_delay
## [1] "numeric"
##
## $carrier
## [1] "character"
##
## $flight
## [1] "integer"
##
## $tailnum
## [1] "character"
##
## $origin
## [1] "character"
##
## $dest
## [1] "character"
##
## $air_time
## [1] "numeric"
##
## $distance
## [1] "numeric"
##
## $hour
## [1] "numeric"
##
## $minute
## [1] "numeric"
##
## $time_hour
## [1] "POSIXct" "POSIXt"
#3
uniq_in_col <- vector("integer",ncol(iris))
names(uniq_in_col) <- names(iris)
for(i in seq_along(iris)){
uniq_in_col[[i]] <- length(unique(iris[[i]]))
}
uniq_in_col
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 35 23 43 22 3
#4
means <- c(-10,0,10,100)
ran_norms <- vector("list",length=length(means))
n <- 10
names(ran_norms) <- means
for(i in seq_along(ran_norms)){
ran_norms[[i]] <- rnorm(n = n, mean=means[i])
}
ran_norms
## $`-10`
## [1] -9.782210 -9.783519 -8.713145 -9.699514 -11.355749 -11.505992
## [7] -10.646917 -9.439703 -10.322458 -9.335176
##
## $`0`
## [1] 1.0560113 0.4293358 0.5480951 -1.1116816 0.1633620 -0.8094876
## [7] -0.2377104 0.3179641 -1.8586886 -1.0749297
##
## $`10`
## [1] 11.321203 10.686546 9.782033 11.400680 10.083576 10.727077 8.952497
## [8] 9.502575 10.794428 8.909179
##
## $`100`
## [1] 100.32602 99.88585 98.13351 100.06831 100.38857 100.23556 99.71547
## [8] 98.84670 100.73822 99.40213
#jarnold notes we can just make a matrix because rnorm recycles the means (vectorized operation)
matrix(rnorm(n * length(means), mean = means), ncol = n)
## [,1] [,2] [,3] [,4] [,5]
## [1,] -10.0523338 -10.5643940 -10.20917153 -11.518092 -10.0557345
## [2,] 0.9700397 0.4048292 -0.04495856 2.218319 0.1312883
## [3,] 10.2047852 9.4786549 9.45667593 9.086414 10.6150856
## [4,] 100.5452153 100.0354454 99.84771066 100.305060 99.2797483
## [,6] [,7] [,8] [,9] [,10]
## [1,] -10.522843 -8.7190236 -10.42173 -11.05073285 -6.856809
## [2,] -1.764183 -0.7201603 2.36971 -0.03026167 -1.713422
## [3,] 8.383111 9.0472588 10.33865 9.12435959 8.243709
## [4,] 100.751077 99.7857383 99.52560 99.10466249 99.273852
#1
out <- ""
for (x in letters) {
out <- stringr::str_c(out, x)
}
#2
x <- sample(100)
sd <- 0
for (i in seq_along(x)) {
sd <- sd + (x[i] - mean(x)) ^ 2
}
sd <- sqrt(sd / (length(x) - 1))
#3
x <- runif(100)
out <- vector("numeric", length(x))
out[1] <- x[1]
for (i in 2:length(x)) {
out[i] <- out[i - 1] + x[i]
}
from jarnold-takes advantage of the fact that many of these functions are vectorized (know to work through each element in a vector).
#1
stringr::str_c(letters, collapse = "")
## [1] "abc"
#2
x <- sample(100)
sd(x) #or
## [1] 29.01149
sqrt(sum((x - mean(x)) ^ 2) / (length(x) - 1))
## [1] 29.01149
#3-this is a cumulative summation so we just use the cumsum function
x <- runif(100)
cumsum(x)
## [1] 0.1585237 0.4955077 0.8746246 1.0127657 1.2579870 1.4299225
## [7] 1.9900154 2.7079516 3.0328263 3.4117178 4.2827850 5.2428236
## [13] 5.7240713 6.5145766 7.2684402 7.9535453 8.5283808 8.7555752
## [19] 9.2210613 9.3179354 9.4374757 9.6260160 9.9355400 10.2116301
## [25] 11.0322993 11.6172869 12.5815333 13.0098218 13.2954146 13.9561334
## [31] 14.7877616 14.8420736 15.0975497 15.5626921 16.5214859 16.8348926
## [37] 16.8597550 17.8347179 17.8502785 18.3934595 18.5537550 18.8951895
## [43] 19.1981427 19.4788238 19.5107065 19.7670737 20.7511489 20.9087983
## [49] 21.4272117 22.2743143 22.9163670 23.8264348 24.2381663 24.2630072
## [55] 25.0678384 25.2808299 25.9409406 26.1956704 27.1185219 27.6052099
## [61] 28.1732145 28.5461514 29.0448202 29.0931784 29.1818545 29.4814858
## [67] 29.8924843 30.3276582 30.6004047 31.2561873 32.1578170 32.7051952
## [73] 33.2658342 33.5007759 34.1604669 34.7843218 35.3188907 35.8099561
## [79] 35.9350050 36.4674255 36.5813578 37.0869016 37.6270703 37.8291504
## [85] 38.5884003 39.1523516 39.5579204 40.2984624 40.7025057 40.8446495
## [91] 41.0116712 41.4858749 41.7671875 41.8597004 42.1691953 42.6564239
## [97] 42.9437491 43.7527671 44.5986954 45.4420465
Write a for loop that prints() the lyrics to the children’s song “Alice the camel”“.
Convert the nursery rhyme “ten in the bed” to a function. Generalise it to any number of people in any sleeping structure.
Convert the song “99 bottles of beer on the wall” to a function. Generalise to any number of any vessel containing any liquid on any surface.
just going to use jarnold’s solutions
library(stringr)
#1
humps <- c("five", "four", "three", "two", "one", "no")
for (i in humps) {
cat(str_c("Alice the camel has ", rep(i, 3), " humps.",
collapse = "\n"), "\n")
if (i == "no") {
cat("Now Alice is a horse.\n")
} else {
cat("So go, Alice, go.\n")
}
cat("\n")
}
## Alice the camel has five humps.
## Alice the camel has five humps.
## Alice the camel has five humps.
## So go, Alice, go.
##
## Alice the camel has four humps.
## Alice the camel has four humps.
## Alice the camel has four humps.
## So go, Alice, go.
##
## Alice the camel has three humps.
## Alice the camel has three humps.
## Alice the camel has three humps.
## So go, Alice, go.
##
## Alice the camel has two humps.
## Alice the camel has two humps.
## Alice the camel has two humps.
## So go, Alice, go.
##
## Alice the camel has one humps.
## Alice the camel has one humps.
## Alice the camel has one humps.
## So go, Alice, go.
##
## Alice the camel has no humps.
## Alice the camel has no humps.
## Alice the camel has no humps.
## Now Alice is a horse.
#2
numbers <- c("ten", "nine", "eight", "seven", "six", "five",
"four", "three", "two", "one")
for (i in numbers) {
cat(str_c("There were ", i, " in the bed\n"))
cat("and the little one said\n")
if (i == "one") {
cat("I'm lonely...")
} else {
cat("Roll over, roll over\n")
cat("So they all rolled over and one fell out.\n")
}
cat("\n")
}
## There were ten in the bed
## and the little one said
## Roll over, roll over
## So they all rolled over and one fell out.
##
## There were nine in the bed
## and the little one said
## Roll over, roll over
## So they all rolled over and one fell out.
##
## There were eight in the bed
## and the little one said
## Roll over, roll over
## So they all rolled over and one fell out.
##
## There were seven in the bed
## and the little one said
## Roll over, roll over
## So they all rolled over and one fell out.
##
## There were six in the bed
## and the little one said
## Roll over, roll over
## So they all rolled over and one fell out.
##
## There were five in the bed
## and the little one said
## Roll over, roll over
## So they all rolled over and one fell out.
##
## There were four in the bed
## and the little one said
## Roll over, roll over
## So they all rolled over and one fell out.
##
## There were three in the bed
## and the little one said
## Roll over, roll over
## So they all rolled over and one fell out.
##
## There were two in the bed
## and the little one said
## Roll over, roll over
## So they all rolled over and one fell out.
##
## There were one in the bed
## and the little one said
## I'm lonely...
#3
bottles <- function(i) {
if (i > 2) {
bottles <- str_c(i - 1, " bottles")
} else if (i == 2) {
bottles <- "1 bottle"
} else {
bottles <- "no more bottles"
}
bottles
}
beer_bottles <- function(n) {
# should test whether n >= 1.
for (i in seq(n, 1)) {
cat(str_c(bottles(i), " of beer on the wall, ", bottles(i), " of beer.\n"))
cat(str_c("Take one down and pass it around, ", bottles(i - 1),
" of beer on the wall.\n\n"))
}
cat("No more bottles of beer on the wall, no more bottles of beer.\n")
cat(str_c("Go to the store and buy some more, ", bottles(n), " of beer on the wall.\n"))
}
beer_bottles(3)
## 2 bottles of beer on the wall, 2 bottles of beer.
## Take one down and pass it around, 1 bottle of beer on the wall.
##
## 1 bottle of beer on the wall, 1 bottle of beer.
## Take one down and pass it around, no more bottles of beer on the wall.
##
## no more bottles of beer on the wall, no more bottles of beer.
## Take one down and pass it around, no more bottles of beer on the wall.
##
## No more bottles of beer on the wall, no more bottles of beer.
## Go to the store and buy some more, 2 bottles of beer on the wall.
output <- vector("integer", 0)
for (i in seq_along(x)) {
output <- c(output, lengths(x[[i]]))
}
output
How does this affect performance? Design and execute an experiment.
How I would do this:
output <- vector("integer", 0)
system.time( for (i in seq_along(x)) {
output <- c(output, lengths(x[[i]]))
} )
## user system elapsed
## 0.004 0.000 0.004
output
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
output <- vector("integer", length(length(x)))
system.time( for (i in seq_along(x)) {
output <- c(output, lengths(x[[i]]))
} )
## user system elapsed
## 0.005 0.001 0.006
output
## [1] 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
How jarnold does this:
#not allocated
library(microbenchmark)
add_to_vector <- function(n) {
output <- vector("integer", 0)
for (i in seq_len(n)) {
output <- c(output, i)
}
output
}
microbenchmark(add_to_vector(10000), times = 3,unit = "ms")
## Unit: milliseconds
## expr min lq mean median uq max
## add_to_vector(10000) 161.2413 177.7291 199.1119 194.217 218.0472 241.8774
## neval
## 3
#pre-allocated
add_to_vector_2 <- function(n) {
output <- vector("integer", n)
for (i in seq_len(n)) {
output[[i]] <- i
}
output
}
microbenchmark(add_to_vector_2(10000), times = 3,unit = "ms")
## Unit: milliseconds
## expr min lq mean median uq
## add_to_vector_2(10000) 0.604179 0.6377025 2.011321 0.671226 2.714892
## max neval
## 4.758558 3
Dang, what a difference! I better pre-allocate…
from jarnold
df <- vector("list", length(files))
for (fname in seq_along(files)) {
df[[i]] <- read_csv(files[[i]])
}
df <- bind_rows(df)
names(x) will yield NULL if there’s no names, nothing will happen-the for loop will quietly end. If only some elements are named, then you’ll get an error. If there are duplicates, that’s fine-that’ll work.
show_mean <- function(df, digits = 2) {
# Get max length of any variable in the dataset
maxstr <- max(str_length(names(df)))
for (nm in names(df)) {
if (is.numeric(df[[nm]])) {
cat(str_c(str_pad(str_c(nm, ":"), maxstr + 1L, side = "right"),
format(mean(df[[nm]]), digits = digits, nsmall = digits),
sep = " "),
"\n")
}
}
}
show_mean(iris)
## Sepal.Length: 5.84
## Sepal.Width: 3.06
## Petal.Length: 3.76
## Petal.Width: 1.20
from jarnold:
show_mean(iris)
## Sepal.Length: 5.84
## Sepal.Width: 3.06
## Petal.Length: 3.76
## Petal.Width: 1.20
trans <- list(
disp = function(x) x * 0.0163871,
am = function(x) {
factor(x, labels = c("auto", "manual"))
}
)
for (var in names(trans)) {
mtcars[[var]] <- trans[[var]](mtcars[[var]])
}
trans is a named list of functions, and the for loop changes the same named columns in mtcars per those functions.
It generalises for loops applied to the column, then elements in the resulting vector of a matrix.
from jarnold:
col_summary2 <- function(df, fun) {
# test whether each colum is numeric
numeric_cols <- vector("logical", length(df))
for (i in seq_along(df)) {
numeric_cols[[i]] <- is.numeric(df[[i]])
}
# indexes of numeric columns
idxs <- seq_along(df)[numeric_cols]
# number of numeric columns
n <- sum(numeric_cols)
out <- vector("double", n)
for (i in idxs) {
out[i] <- fun(df[[i]])
}
out
}
df <- tibble(
a = rnorm(10),
b = rnorm(10),
c = letters[1:10],
d = rnorm(10)
)
col_summary2(df, mean)
## [1] 0.03865459 0.07936705 0.00000000 -0.87282580
from jarnold
#1
purrr::map_dbl(mtcars, mean)
## mpg cyl disp hp drat wt
## 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250
## qsec vs am gear carb
## 17.848750 0.437500 0.406250 3.687500 2.812500
#2
purrr::map(nycflights13::flights, class)
## $year
## [1] "integer"
##
## $month
## [1] "integer"
##
## $day
## [1] "integer"
##
## $dep_time
## [1] "integer"
##
## $sched_dep_time
## [1] "integer"
##
## $dep_delay
## [1] "numeric"
##
## $arr_time
## [1] "integer"
##
## $sched_arr_time
## [1] "integer"
##
## $arr_delay
## [1] "numeric"
##
## $carrier
## [1] "character"
##
## $flight
## [1] "integer"
##
## $tailnum
## [1] "character"
##
## $origin
## [1] "character"
##
## $dest
## [1] "character"
##
## $air_time
## [1] "numeric"
##
## $distance
## [1] "numeric"
##
## $hour
## [1] "numeric"
##
## $minute
## [1] "numeric"
##
## $time_hour
## [1] "POSIXct" "POSIXt"
purrr::map_chr(nycflights13::flights, typeof)
## year month day dep_time sched_dep_time
## "integer" "integer" "integer" "integer" "integer"
## dep_delay arr_time sched_arr_time arr_delay carrier
## "double" "integer" "integer" "double" "character"
## flight tailnum origin dest air_time
## "integer" "character" "character" "character" "double"
## distance hour minute time_hour
## "double" "double" "double" "double"
#3
purrr::map_int(iris, ~ length(unique(.)))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 35 23 43 22 3
#4
purrr::map(c(-10, 0, 10, 100), rnorm, n = 10)
## [[1]]
## [1] -9.714045 -9.337733 -8.292421 -9.387098 -9.041342 -9.443563
## [7] -10.318711 -8.969552 -9.422471 -10.758078
##
## [[2]]
## [1] 1.2765188 -0.8132919 -0.5986963 0.4913592 0.9580939 1.5092699
## [7] -0.3148165 1.5760822 -0.1095791 0.9415330
##
## [[3]]
## [1] 10.275683 9.787535 10.223332 10.330234 8.878974 9.231961 9.299927
## [8] 8.834318 9.746394 8.848766
##
## [[4]]
## [1] 99.03635 99.74021 101.82065 100.12630 99.49620 101.29579 99.98857
## [8] 99.59608 101.27721 101.52840
from jarnold
map_lgl(mtcars, is.factor)
## mpg cyl disp hp drat wt qsec vs am gear carb
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
purrr::map(1:5,runif)
## [[1]]
## [1] 0.1057802
##
## [[2]]
## [1] 0.7530251 0.3675610
##
## [[3]]
## [1] 0.2397398 0.7586907 0.4187146
##
## [[4]]
## [1] 0.7071993 0.5419450 0.6088576 0.5425923
##
## [[5]]
## [1] 0.6672391 0.9163422 0.2454936 0.2145345 0.4969023
applies runif( n = each_element_of_1:5 )
#1
purrr::map(-2:2, rnorm, n = 5)
## [[1]]
## [1] -3.0809401 -1.8159872 -1.0138979 -1.1328492 0.4046854
##
## [[2]]
## [1] -0.5246244 -1.7789792 -0.9934616 -0.5246491 1.5434798
##
## [[3]]
## [1] -0.3253424 0.8292342 -0.8311809 -0.6667747 -1.7118878
##
## [[4]]
## [1] 0.9946182 0.2268039 0.7243463 0.4898557 -0.3413743
##
## [[5]]
## [1] 0.9239205 2.7275130 1.7610069 2.5645018 1.2867388
#2
#purrr::map_dbl(-2:2, rnorm, n = 5)
#1 uses the mean value given in -2:2 to generate 5 rnorms, and in #2 throws an error because map_dbl expects to return a single vector-we can do that by nesting the map function call in flatten_dbl.
put mtcars into a list and the map functions can then use it.
purrr::map( list(mtcars), ~lm(mpg ~ wt, data = .) )
## [[1]]
##
## Call:
## lm(formula = mpg ~ wt, data = .)
##
## Coefficients:
## (Intercept) wt
## 37.285 -5.344
from jarnold
# Use ... to pass arguments to the function
every2 <- function(.x, .p, ...) {
for (i in .x) {
if (!.p(i, ...)) {
# If any is FALSE we know not all of then were TRUE
return(FALSE)
}
}
# if nothing was FALSE, then it is TRUE
TRUE
}
every2(1:3, function(x) {x > 1})
## [1] FALSE
every2(1:3, function(x) {x > 0})
## [1] TRUE
The function purrr::every does fancy things with .p, like taking a logical vector instead of a function, or being able to test part of a string if the elements of .x are lists.
from jarnold
#Note this question has a typo. It is referring to col_summary.
#I will use map to apply the function to all the columns, and keep to only select numeric columns.
col_sum2 <- function(df, f, ...) {
purrr::map(keep(df, is.numeric), f, ...)
}
col_sum2(iris, mean)
## $Sepal.Length
## [1] 5.843333
##
## $Sepal.Width
## [1] 3.057333
##
## $Petal.Length
## [1] 3.758
##
## $Petal.Width
## [1] 1.199333
col_sum3 <- function(df, f) {
is_num <- sapply(df, is.numeric)
df_num <- df[, is_num]
sapply(df_num, f)
}
But it has a number of bugs as illustrated with the following inputs:
df <- dplyr::tibble(
x = 1:3,
y = 3:1,
z = c("a", "b", "c")
)
# OK
col_sum3(df, mean)
# Has problems: don't always return numeric vector
#col_sum3(df[1:2], mean)
#col_sum3(df[1], mean)
#col_sum3(df[0], mean)
What causes the bugs?
from jarnold
The problem is that sapply doesn’t always return numeric vectors. If no columns are selected, instead of gracefully exiting, it returns an empty list. This causes an error since we can’t use a list with [.
sapply(df[0], is.numeric)
## named list()
sapply(df[1], is.numeric)
## a
## TRUE
sapply(df[1:2], is.numeric)
## a b
## TRUE TRUE
No Exercises
library(dplyr)
library(modelr)
sim1a <- tibble(
x = rep(1:10, each = 3),
y = x * 1.5 + 6 + rt(length(x), df = 2)
)
Let’s view this with a model using ggplot
library(ggplot2)
ggplot(sim1a, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm",se = F)
If we rerun to get a new sim1a, I’ll use the solution from jarnold that computes this well.
simt <- function(i) {
tibble(
x = rep(1:10, each = 3),
y = x * 1.5 + 6 + rt(length(x), df = 2),
.id = i
)
}
lm_df <- function(.data) {
mod <- lm(y ~ x, data = .data)
beta <- coef(mod)
tibble(intercept = beta[1], slope = beta[2])
}
sims <- purrr::map(1:100, simt) %>%
purrr::map_df(lm_df)
ggplot(sims, aes(x = intercept, y = slope)) +
geom_point()
This is showing that The slope decreases as the intercept increases, which in turn is regulated by where the data lie. So the linear models and the slopes/intercepts are determined by the variability present in the data (outliers).
measure_distance <- function(mod, data) {
diff <- data$y - make_prediction(mod, data)
mean(abs(diff))
}
Use optim() to fit this model to the simulated data above and compare it to the linear model.
First, there is no make_prediction function so I won’t be using that measure. However, I’ll still use optim to find the best distance.
sim1a <- tibble(
x = rep(1:10, each = 3),
y = x * 1.5 + 6 + rt(length(x), df = 2)
)
model1 <- function(a, data) {
a[1] + data$x * a[2]
}
measure_distance <- function(mod, data) {
diff <- data$y - model1(mod, data)
sqrt(mean(diff ^ 2))
}
#finds minimum distance from the initial values
best <- optim(par=c(0, 0), fn=measure_distance, data = sim1a)
#best model (intercept, slope) values
best$par
## [1] 8.513129 1.224304
model1 <- function(a, data) {
a[1] + data$x * a[2] + a[3]
}
Let’s see
sim1a <- tibble(
x = rep(1:10, each = 3),
y = x * 1.5 + 6 + rt(length(x), df = 2)
)
measure_distance <- function(mod, data) {
diff <- data$y - model1(mod, data)
sqrt(mean(diff ^ 2))
}
#finds minimum distance from the initial values
best <- optim(par=c(0, 0, 0), fn=measure_distance, data = sim1a)
#best model (intercept, slope) values
best$par
## [1] -5.2391455 0.9267704 15.3017119
It works. However, the way the model is doesn’t disambiguate between a[1] and a[3], meaning the model can have multiple optimum values. So that might be a problem (actually, it contradicts the term optimum since it implies only one solution).
I’ll use the solution from jarnold, but I’ll detail his process
sim1
## # A tibble: 30 x 2
## x y
## <int> <dbl>
## 1 1 4.199913
## 2 1 7.510634
## 3 1 2.125473
## 4 2 8.988857
## 5 2 10.243105
## 6 2 11.296823
## 7 3 7.356365
## 8 3 10.505349
## 9 3 10.511601
## 10 4 12.434589
## # ... with 20 more rows
sim1_loess <- loess(y ~ x, data = sim1)
sim1_loess
## Call:
## loess(formula = y ~ x, data = sim1)
##
## Number of Observations: 30
## Equivalent Number of Parameters: 4.19
## Residual Standard Error: 2.262
grid_loess <- sim1 %>%
add_predictions(sim1_loess)
grid_loess
## # A tibble: 30 x 3
## x y pred
## <int> <dbl> <dbl>
## 1 1 4.199913 5.338000
## 2 1 7.510634 5.338000
## 3 1 2.125473 5.338000
## 4 2 8.988857 8.274913
## 5 2 10.243105 8.274913
## 6 2 11.296823 8.274913
## 7 3 7.356365 10.809582
## 8 3 10.505349 10.809582
## 9 3 10.511601 10.809582
## 10 4 12.434589 12.779762
## # ... with 20 more rows
sim1 <- sim1 %>%
add_residuals(sim1_loess, var = "resid_loess") %>%
add_predictions(sim1_loess, var = "pred_loess")
sim1
## # A tibble: 30 x 4
## x y resid_loess pred_loess
## <int> <dbl> <dbl> <dbl>
## 1 1 4.199913 -1.1380871 5.338000
## 2 1 7.510634 2.1726340 5.338000
## 3 1 2.125473 -3.2125273 5.338000
## 4 2 8.988857 0.7139447 8.274913
## 5 2 10.243105 1.9681928 8.274913
## 6 2 11.296823 3.0219105 8.274913
## 7 3 7.356365 -3.4532176 10.809582
## 8 3 10.505349 -0.3042329 10.809582
## 9 3 10.511601 -0.2979815 10.809582
## 10 4 12.434589 -0.3451730 12.779762
## # ... with 20 more rows
plot_sim1_loess <-
ggplot(sim1, aes(x = x, y = y)) +
geom_point() +
geom_line(aes(x = x, y = pred), data = grid_loess, colour = "red")
plot_sim1_loess
you could also use the loess method within geom_smooth.
From the help of add_predictions:
add_prediction adds a single new column, .pred, to the input data. spread_predictions adds one column for each model. gather_prections adds two columns .model and .pred, and repeats the input rows for each model.
So gather makes a model column as well, there’s one column for the predictions called pred. Spread does not but has the model name as the column name.
It comes from ggplot2, and it just adds a line (specified) that allows easy comparison with the plotted data. It’s good to use with showing residuals cause then you can see if your model is appropriate for your data (kinda random residuals) versus not (higher residuals in certain places). This can help you choose a useful model.
Like I mentioned in Q3, it would be useful to see what the distribution of residuals is. Geom_freqpoly helps with that-so visually any patterns pop out at you.
mod2 <- lm(y ~ x, data = sim2)
grid <- sim2 %>%
data_grid(x) %>%
add_predictions(mod2)
ggplot(sim2, aes(x)) +
geom_point(aes(y = y)) +
geom_point(data = grid, aes(y = pred), colour = "red", size = 4) +
ggtitle("With intercept")
mod2 <- lm(y ~ x - 1, data = sim2)
grid <- sim2 %>%
data_grid(x) %>%
add_predictions(mod2)
ggplot(sim2, aes(x)) +
geom_point(aes(y = y)) +
geom_point(data = grid, aes(y = pred), colour = "red", size = 4) +
ggtitle("Without intercept")
Nothing happens to the predictions. And the intercept column is just taken out from the model equation. Without the intercept, the other predictor is included in the equation because then the intercept can’t be used to make that lost column (xa) with the intercept.
model_matrix(sim2, y ~ x)
## # A tibble: 40 x 4
## `(Intercept)` xb xc xd
## <dbl> <dbl> <dbl> <dbl>
## 1 1 0 0 0
## 2 1 0 0 0
## 3 1 0 0 0
## 4 1 0 0 0
## 5 1 0 0 0
## 6 1 0 0 0
## 7 1 0 0 0
## 8 1 0 0 0
## 9 1 0 0 0
## 10 1 0 0 0
## # ... with 30 more rows
model_matrix(sim2, y ~ x - 1)
## # A tibble: 40 x 4
## xa xb xc xd
## <dbl> <dbl> <dbl> <dbl>
## 1 1 0 0 0
## 2 1 0 0 0
## 3 1 0 0 0
## 4 1 0 0 0
## 5 1 0 0 0
## 6 1 0 0 0
## 7 1 0 0 0
## 8 1 0 0 0
## 9 1 0 0 0
## 10 1 0 0 0
## # ... with 30 more rows
model_matrix(sim3, y ~ x1 + x2) %>% head()
## # A tibble: 6 x 5
## `(Intercept)` x1 x2b x2c x2d
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 0 0 0
## 2 1 1 0 0 0
## 3 1 1 0 0 0
## 4 1 1 1 0 0
## 5 1 1 1 0 0
## 6 1 1 1 0 0
model_matrix(sim3, y ~ x1 * x2) %>% head()
## # A tibble: 6 x 8
## `(Intercept)` x1 x2b x2c x2d `x1:x2b` `x1:x2c` `x1:x2d`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 0 0 0 0 0 0
## 2 1 1 0 0 0 0 0 0
## 3 1 1 0 0 0 0 0 0
## 4 1 1 1 0 0 1 0 0
## 5 1 1 1 0 0 1 0 0
## 6 1 1 1 0 0 1 0 0
model_matrix(sim4, y ~ x1 + x2) %>% head()
## # A tibble: 6 x 3
## `(Intercept)` x1 x2
## <dbl> <dbl> <dbl>
## 1 1 -1 -1.0000000
## 2 1 -1 -1.0000000
## 3 1 -1 -1.0000000
## 4 1 -1 -0.7777778
## 5 1 -1 -0.7777778
## 6 1 -1 -0.7777778
model_matrix(sim4, y ~ x1 * x2) %>% head()
## # A tibble: 6 x 4
## `(Intercept)` x1 x2 `x1:x2`
## <dbl> <dbl> <dbl> <dbl>
## 1 1 -1 -1.0000000 1.0000000
## 2 1 -1 -1.0000000 1.0000000
## 3 1 -1 -1.0000000 1.0000000
## 4 1 -1 -0.7777778 0.7777778
## 5 1 -1 -0.7777778 0.7777778
## 6 1 -1 -0.7777778 0.7777778
It’s a good shorthand because then I don’t have to manually specify.
#a_0 + a_1 * x1 + a_2 * x2
mod1 <- lm(y ~ x1 + x2, data = sim3)
#a_0 + a_1 * x1 + a_2 * x2 + a_12 * x1:x2
mod2 <- lm(y ~ x1 * x2, data = sim3)
I’m not making the functions, but basically for each predicting you make a column and the value is the value for the predictor, but only if it’s non-categorical. For categorical predictors, you spread them into separate columns for each factor. For interactions, you give 1 if both are 1 for the predictors and 0 otherwise.
To better see the pattern, instead of points I’d use the loess method in ggplot to see how the line deviates from 0, indicating the residuals are larger in particular areas which indicates a bad model.
mod1 <- lm(y ~ x1 + x2, data = sim3)
mod2 <- lm(y ~ x1 * x2, data = sim3)
sim3 <- sim3 %>%
gather_residuals(mod1, mod2)
ggplot(sim3, aes(x1, resid, colour = x2)) +
geom_smooth(method="loess") +
facet_grid(model ~ x2)
The bright vertical stripes in the lprice vs. lcarat plot represent areas of high counts (very dense) of diamonds. These areas are actually at round or human-friendly carat cuts.
from jarnold,
An 1% increase in carat is associated with an a1a1 % increase in price.
library(dplyr)
library(ggplot2)
library(modelr)
diamonds2 <- diamonds %>%
filter(carat <= 2.5) %>%
mutate(lprice = log2(price), lcarat = log2(carat))
mod_diamond <- lm(lprice ~ lcarat, data = diamonds2)
diamonds2 <- diamonds2 %>%
add_residuals(mod_diamond, "lresid")
resid_quants <- quantile(diamonds2$lresid)
filtered <- diamonds2 %>%
filter(
!((lresid < resid_quants[["25%"]]) & (lresid > resid_quants[["75%"]]))
)
ggplot(filtered, aes(cut, price)) + geom_boxplot()
ggplot(filtered, aes(color, price)) + geom_boxplot()
ggplot(filtered, aes(clarity, price)) + geom_boxplot()
From the plots, they don’t seem to be pricing errors since there aren’t any strong trends of cut, color or clarity with price after filtering for extreme values of residuals in the predictions.
mod_diamond2 <- lm(lprice ~ lcarat + color + cut + clarity, data = diamonds2)
diamonds2 <- diamonds2 %>%
add_residuals(mod_diamond2, "lresid2")
diamonds2 %>% ggplot() +
geom_histogram(aes(lresid),binwidth=.05)
from jarnold,
diamonds2 %>%
add_predictions(mod_diamond2) %>%
add_residuals(mod_diamond2) %>%
summarise(sq_err = sqrt(mean(resid^2)),
abs_err = mean(abs(resid)),
p975_err = quantile(resid, 0.975),
p025_err = quantile(resid, 0.025))
## # A tibble: 1 x 4
## sq_err abs_err p975_err p025_err
## <dbl> <dbl> <dbl> <dbl>
## 1 0.191524 0.1491158 0.3844299 -0.3692446
The average squared and absolute errors are 20.19=1.14 and 20.10 so on average, the error is ±10−−15%. And the 95% range of residuals is about 20.37=1.3 so within ±30%. This doesn’t seem terrible to me.
from jarnold,
These are the Sundays before Monday holidays Martin Luther King Day, Memorial Day, and Labor Day.
library("nycflights13")
daily <- flights %>%
mutate(date = make_date(year, month, day)) %>%
group_by(date) %>%
summarise(n = n())
daily <- daily %>%
mutate(wday = wday(date, label = TRUE))
term <- function(date) {
cut(date,
breaks = ymd(20130101, 20130605, 20130825, 20140101),
labels = c("spring", "summer", "fall")
)
}
daily <- daily %>%
mutate(term = term(date))
mod3 <- MASS::rlm(n ~ wday * term, data = daily)
daily <- daily %>%
add_residuals(mod3, "resid")
daily %>%
top_n(3, resid)
## # A tibble: 3 x 5
## date n wday term resid
## <date> <int> <ord> <fctr> <dbl>
## 1 2013-11-30 857 Sat fall 160.1007
## 2 2013-12-21 811 Sat fall 114.1007
## 3 2013-12-28 814 Sat fall 117.1007
These three days represent saturdays before or after major holidays (thanksgiving and christmas). Traveling on those days from home is probably particularly high.
from jarnold,
daily <- daily %>%
mutate(wday2 =
case_when(.$wday == "Sat" & .$term == "summer" ~ "Sat-summer",
.$wday == "Sat" & .$ term == "fall" ~ "Sat-fall",
.$wday == "Sat" & .$term == "spring" ~ "Sat-spring",
TRUE ~ as.character(.$wday)))
mod2 <- lm(n ~ wday * term, data = daily)
mod4 <- lm(n ~ wday2, data = daily)
daily %>%
gather_residuals(sat_term = mod4, all_interact = mod2) %>%
ggplot(aes(date, resid, colour = model)) +
geom_line(alpha = 0.75)
I think the overlapping plot is hard to understand. If we are interested in the differences, it is better to plot the differences directly. In this code I use spread_residuals to add one column per model, rather than gather_residuals which creates a new row for each model.
daily %>%
spread_residuals(sat_term = mod4, all_interact = mod2) %>%
mutate(resid_diff = sat_term - all_interact) %>%
ggplot(aes(date, resid_diff)) +
geom_line(alpha = 0.75)
The model with terms x Saturday has higher residuals in the fall, and lower residuals in the spring than the model with all interactions.
Using overall model comparison terms, mod4 has a lower R2 and regression standard error, σ^ , despite using fewer variables. More importantly for prediction purposes, it has a higher AIC - which is an estimate of the out of sample error.
library(broom)
glance(mod4) %>% select(r.squared, sigma, AIC, df)
## r.squared sigma AIC df
## 1 0.7356615 47.35969 3862.885 9
glance(mod2) %>% select(r.squared, sigma, AIC, df)
## r.squared sigma AIC df
## 1 0.757289 46.16568 3855.73 21
from jarnold,
The question is unclear how to handle the public holidays. We could include a dummy for all public holidays? or the Sunday before public holidays?
Including a level for the public holidays themselves is insufficient because (1) public holiday’s effects on travel varies dramatically, (2) the effect can occur on the day itself or the day before and after, and (3) with Thanksgiving and Christmas there are increases in travel as well.
daily <- daily %>%
mutate(wday3 =
case_when(
.$date %in% lubridate::ymd(c(20130101, # new years
20130121, # mlk
20130218, # presidents
20130527, # memorial
20130704, # independence
20130902, # labor
20131028, # columbus
20131111, # veterans
20131128, # thanksgiving
20131225)) ~
"holiday",
.$wday == "Sat" & .$term == "summer" ~ "Sat-summer",
.$wday == "Sat" & .$ term == "fall" ~ "Sat-fall",
.$wday == "Sat" & .$term == "spring" ~ "Sat-spring",
TRUE ~ as.character(.$wday)))
mod5 <- lm(n ~ wday3, data = daily)
daily %>%
spread_residuals(mod5) %>%
arrange(desc(abs(resid))) %>%
slice(1:20) %>% select(date, wday, resid)
## # A tibble: 20 x 3
## date wday resid
## <date> <ord> <dbl>
## 1 2013-11-28 Thurs -347.53717
## 2 2013-11-29 Fri -321.43256
## 3 2013-07-04 Thurs -257.03213
## 4 2013-12-25 Wed -248.56250
## 5 2013-12-24 Tues -200.14703
## 6 2013-12-31 Tues -185.14703
## 7 2013-09-01 Sun -182.08451
## 8 2013-07-05 Fri -170.03213
## 9 2013-05-26 Sun -164.49563
## 10 2013-11-30 Sat 160.10068
## 11 2013-12-28 Sat 117.10068
## 12 2013-12-21 Sat 114.10068
## 13 2013-01-01 Tues -110.83875
## 14 2013-01-20 Sun -107.49563
## 15 2013-12-01 Sun 86.91549
## 16 2013-02-03 Sun -79.49563
## 17 2013-01-19 Sat -72.81905
## 18 2013-01-27 Sun -70.49563
## 19 2013-01-26 Sat -66.81905
## 20 2013-01-13 Sun -65.49563
You don’t have many samples (4-5) per month so your power to predict is pretty small.
There isn’t really a weekday trend across months, but really a trend around holidays and other miscallaneous things so the model wouldn’t really take advantage of that.
from jarnold,
flights %>%
mutate(date = make_date(year, month, day),
wday = wday(date, label = TRUE)) %>%
group_by(wday) %>%
summarise(dist_mean = mean(distance),
dist_median = median(distance)) %>%
ggplot(aes(y = dist_mean, x = wday)) +
geom_point()
flights %>%
mutate(date = make_date(year, month, day),
wday = wday(date, label = TRUE)) %>%
group_by(wday, hour) %>%
summarise(dist_mean = mean(distance),
dist_median = median(distance)) %>%
ggplot(aes(y = dist_mean, x = hour, colour = wday)) +
geom_point() +
geom_line()
Maybe someone can look at the Sunday evening/Monday scheduled arrivals?
monday_first <- function(x) {
forcats::fct_relevel(x, levels(x)[-1])
}
daily <- daily %>%
mutate(wday = wday(date, label = TRUE))
ggplot(daily, aes(monday_first(wday), n)) +
geom_boxplot() +
labs(x = "Day of Week", y = "Number of flights")
print("here")
## [1] "here"
library(modelr)
library(tidyverse)
if(!require(gapminder)){install.packages("gapminder")}
library(gapminder)
by_country <- gapminder %>%
group_by(country, continent) %>%
nest()
by_country <- by_country %>%
mutate(
one_mod = purrr::map(data, ~ lm(lifeExp ~ poly(year,1),data = .)),
two_mod = purrr::map(data, ~ lm(lifeExp ~ poly(year,2),data = .)),
three_mod = purrr::map(data, ~ lm(lifeExp ~ poly(year,3),data = .))
)
by_country <- by_country %>%
mutate(
one_resids = purrr::map2(data, one_mod, add_residuals),
two_resids = purrr::map2(data, two_mod, add_residuals),
three_resids = purrr::map2(data, three_mod, add_residuals)
)
one_resids <- unnest(by_country, one_resids)
two_resids <- unnest(by_country, two_resids)
three_resids <- unnest(by_country, three_resids)
plot <- function(df,title){
df %>%
ggplot(aes(year, resid)) +
geom_line(aes(group = country), alpha = 1 / 3) +
geom_smooth(se = FALSE) +
ggtitle(title)
}
plot(one_resids,"Linear model")
plot(two_resids,"Degree 2 polynomial model")
plot(three_resids,"Degree 3 polynomial model")
glance <- by_country %>%
mutate(one_glance = purrr::map(one_mod, broom::glance),
two_glance = purrr::map(two_mod, broom::glance),
three_glance = purrr::map(three_mod, broom::glance)) %>%
unnest(one_glance,two_glance,three_glance, .drop = TRUE)
glance %>%
ggplot() +
geom_point(aes("model_degree1",median(r.squared)),color="blue") +
geom_point(aes("model_degree2",median(r.squared1)),color="red") +
geom_point(aes("model_degree3",median(r.squared2)),color="green")
I think I violated the “no copy/paste more than two times” rule. But that’s learning for another time, the point is you can see the model has a better fit when increasing the polynomial degree in the model fitted to the data. How can the coefficients be interpreted? They allow extra flexibilty in fitting the data-so the quadratic terms account for deviations from a linear trend.
from jarnold,
library(gapminder)
country_model <- function(df) {
lm(lifeExp ~ poly(year - median(year), 2), data = df)
}
by_country <- gapminder %>%
group_by(country, continent) %>%
nest()
by_country <- by_country %>%
mutate(model = purrr::map(data, country_model))
library("ggbeeswarm")
by_country %>%
mutate(glance = purrr::map(model, broom::glance)) %>%
unnest(glance, .drop = TRUE) %>%
ggplot(aes(continent, r.squared)) +
geom_beeswarm()
glance <- by_country %>%
mutate(glance = purrr::map(model, broom::glance)) %>%
unnest(glance)
glance
## # A tibble: 142 x 15
## country continent data model r.squared
## <fctr> <fctr> <list> <list> <dbl>
## 1 Afghanistan Asia <tibble [12 x 4]> <S3: lm> 0.9892058
## 2 Albania Europe <tibble [12 x 4]> <S3: lm> 0.9615523
## 3 Algeria Africa <tibble [12 x 4]> <S3: lm> 0.9920480
## 4 Angola Africa <tibble [12 x 4]> <S3: lm> 0.9786354
## 5 Argentina Americas <tibble [12 x 4]> <S3: lm> 0.9955863
## 6 Australia Oceania <tibble [12 x 4]> <S3: lm> 0.9927021
## 7 Austria Europe <tibble [12 x 4]> <S3: lm> 0.9946153
## 8 Bahrain Asia <tibble [12 x 4]> <S3: lm> 0.9973157
## 9 Bangladesh Asia <tibble [12 x 4]> <S3: lm> 0.9972774
## 10 Belgium Europe <tibble [12 x 4]> <S3: lm> 0.9957497
## # ... with 132 more rows, and 10 more variables: adj.r.squared <dbl>,
## # sigma <dbl>, statistic <dbl>, p.value <dbl>, df <int>, logLik <dbl>,
## # AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>
We can give the data variable directly to ggplot
country_model <- function(df) {
lm(lifeExp ~ poly(year,1), data = df)
}
by_country <- by_country %>%
mutate(model = purrr::map(data, country_model))
glance <- by_country %>%
mutate(glance = purrr::map(model, broom::glance)) %>%
unnest(glance)
glance %>%
filter(r.squared < 0.25) %>%
select(country,data) %>%
unnest() %>%
ggplot() +
geom_line(aes(year,lifeExp,color=country))
Interesting how when the polynomial degree is 1 then these 6 countries are below .25 r2 but when the model is degree 2 only rwanda is there. Thus the degree 2 model allows for greater flexibility to fit those 5 countries’ data. Interesting.
from jarnold, many of the stringr functions
range, fivenum etc.
mtcars %>%
group_by(cyl) %>%
summarise(q = list(quantile(mpg))) %>%
unnest()
## # A tibble: 15 x 2
## cyl q
## <dbl> <dbl>
## 1 4 21.40
## 2 4 22.80
## 3 4 26.00
## 4 4 30.40
## 5 4 33.90
## 6 6 17.80
## 7 6 18.65
## 8 6 19.70
## 9 6 21.00
## 10 6 21.40
## 11 8 10.40
## 12 8 14.40
## 13 8 15.20
## 14 8 16.25
## 15 8 19.20
What’s missing is the list of values given from quantile(). unnest() drops the names of the quantiles so we lose those labels.
mtcars %>%
group_by(cyl) %>%
summarise_each(funs(list))
## # A tibble: 3 x 11
## cyl mpg disp hp drat wt qsec
## <dbl> <list> <list> <list> <list> <list> <list>
## 1 4 <dbl [11]> <dbl [11]> <dbl [11]> <dbl [11]> <dbl [11]> <dbl [11]>
## 2 6 <dbl [7]> <dbl [7]> <dbl [7]> <dbl [7]> <dbl [7]> <dbl [7]>
## 3 8 <dbl [14]> <dbl [14]> <dbl [14]> <dbl [14]> <dbl [14]> <dbl [14]>
## # ... with 4 more variables: vs <list>, am <list>, gear <list>,
## # carb <list>
Seems like this puts of of the non-grouped variables data together while keeping the grouped variable present but still representing the data from all the other variables. I think this might streamline analyses in parallel but not sure the advantage except for compacting data into view.
from jarnold,
The lengths() function gets the lengths of each element in a list. It could be useful for testing whether all elements in a list-column are the same length. You could get the maximum length to determine how many atomic vector columns to create. It is also a replacement for something like map_int(x, length) or sapply(x, length).
Most common types would be character, integer, numeric, logical, and factor. Lists are non-atomic, meaning they are a mixture of the mentioned object types. # Part IV Communicate # 26. Introduction
No Exercises
See _MyNotebook.Rmd
Similar output: Both produce an html-rendered file.
Different output: The notebook html ending is *.nb.html where the RMarkdown ending is *.html. Also, the notebook has a Code button in the top right to hide/show the code or download the Rmd.
Similar input: Same 3 components: YAML header, mixed prose and embedded code.
Different input: YAML header in the notebook has output: html_notebook and the RMarkdown variant is output: html_document.
Similar output: Some file format with content rendered.
Different output: html, word, or pdf document.
Similar input: Same content.
Different input: In the YAML, the output was either html_document, word_document, or pdf_document.
See BriefCV.Rmd
See BriefCV.Rmd
See Diamond-sizes.Rmd
See section Exercise 27.4.7 #1 description in Diamond-sizes.Rmd.
See section Exercise 27.4.7 #2 description in Diamond-sizes.Rmd.
I couldn’t get a global commas command unfortunately…
See section Exercise 27.4.7 #2 description in Diamond-sizes.Rmd.
sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] microbenchmark_1.4-2.1 lubridate_1.6.0 stringr_1.2.0
## [4] forcats_0.2.0 viridis_0.4.0 viridisLite_0.2.0
## [7] ggbeeswarm_0.6.0 lvplot_0.2.0 ggstance_0.3
## [10] nycflights13_0.2.2 bindrcpp_0.2 maps_3.2.0
## [13] dplyr_0.7.2 purrr_0.2.3 readr_1.1.1
## [16] tidyr_0.6.3 tibble_1.3.3 ggplot2_2.2.1
## [19] tidyverse_1.1.1
##
## loaded via a namespace (and not attached):
## [1] beeswarm_0.2.3 reshape2_1.4.2 haven_1.1.0 lattice_0.20-35
## [5] colorspace_1.3-2 htmltools_0.3.6 yaml_2.1.14 rlang_0.1.1
## [9] foreign_0.8-69 glue_1.1.1 modelr_0.1.1 readxl_1.0.0
## [13] bindr_0.1 plyr_1.8.4 munsell_0.4.3 gtable_0.2.0
## [17] cellranger_1.1.0 rvest_0.3.2 codetools_0.2-15 psych_1.7.5
## [21] evaluate_0.10.1 knitr_1.16 vipor_0.4.5 parallel_3.4.1
## [25] broom_0.4.2 Rcpp_0.12.12 scales_0.4.1 backports_1.1.0
## [29] jsonlite_1.5 gridExtra_2.2.1 mnormt_1.5-5 hms_0.3
## [33] digest_0.6.12 stringi_1.1.5 grid_3.4.1 rprojroot_1.2
## [37] tools_3.4.1 magrittr_1.5 lazyeval_0.2.0 pkgconfig_2.0.1
## [41] xml2_1.1.1 assertthat_0.2.0 rmarkdown_1.6 httr_1.2.1
## [45] R6_2.2.2 nlme_3.1-131 compiler_3.4.1