Title: | Create Datasets with Identical Summary Statistics |
---|---|
Description: | Anscombe's quartet are a set of four two-variable datasets that have several common summary statistics but which have very different joint distributions. This becomes apparent when the data are plotted, which illustrates the importance of using graphical displays in Statistics. This package enables the creation of datasets that have identical marginal sample means and sample variances, sample correlation, least squares regression coefficients and coefficient of determination. The user supplies an initial dataset, which is shifted, scaled and rotated in order to achieve target summary statistics. The general shape of the initial dataset is retained. The target statistics can be supplied directly or calculated based on a user-supplied dataset. The 'datasauRus' package <https://cran.r-project.org/package=datasauRus> provides further examples of datasets that have markedly different scatter plots but share many sample summary statistics. |
Authors: | Paul J. Northrop [aut, cre, cph] |
Maintainer: | Paul J. Northrop <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.1.0 |
Built: | 2024-11-23 05:33:27 UTC |
Source: | https://github.com/paulnorthrop/anscombiser |
Anscombe's quartet (Anscombe, 1973) are a set of four two-variable datasets that have several common summary statistics but which have very different joint distributions. This becomes apparent when the data are plotted, which illustrates the importance of using graphical displays in Statistics. This package enables the creation of datasets that have identical marginal sample means and sample variances, sample correlation, least squares regression coefficients and coefficient of determination. The user supplies an initial dataset, which is shifted, scaled and rotated in order to achieve target summary statistics. The general shape of the initial dataset is retained. The target statistics can be supplied directly or calculated based on a user-supplied dataset.
The main functions in anscombiser
are
anscombise
, which modifies a user-supplied dataset so that it shares
sample summary statistics with Anscombe's quartet.
mimic
, which modified a user-supplied dataset so that is shares
sample summary statistics with another user-supplied dataset.
See vignette("intro-to-anscombiser", package = "anscombiser")
for
an overview of the package.
Maintainer: Paul J. Northrop [email protected] [copyright holder]
Anscombe, F. J. (1973). Graphs in Statistical Analysis. The American Statistician 27 (1): 17–21. doi:10.1080/00031305.1973.10478966
anscombise
and mimic
Provides Anscombe's Quartet as separate data frames.
anscombe1 anscombe2 anscombe3 anscombe4
anscombe1 anscombe2 anscombe3 anscombe4
All datasets are objects of class data.frame
with 11 rows and 2
columns.
Anscombe's Quartet of 'Identical' Simple Linear Regressions:
datasets::anscombe
in the datasets
package. The i
th dataset is datasets::anscombe[, c(i, i + 4)]
.
Anscombe, Francis J. (1973). Graphs in statistical analysis. The American Statistician, 27, 17-21. doi:10.2307/2682899
Modifies a dataset x
so that it shares sample summary statistics with
Anscombe's quartet.
anscombise(x, which = 1, idempotent = TRUE)
anscombise(x, which = 1, idempotent = TRUE)
x |
A numeric matrix or data frame. Each column contains observations on a different variable. Missing observations are not allowed. |
which |
An integer in {1, 2, 3, 4}. Which of Anscombe's datasets to use as the target dataset. Obviously, this makes very little difference. |
idempotent |
A logical scalar. If |
The input dataset x
is modified by shifting, scaling and rotating
it so that its sample mean and covariance matrix match those of the
Anscombe quartet.
The rotation is based on the square root of the sample correlation matrix.
If idempotent = FALSE
then this square root is based on the Cholesky
decomposition this matrix, using chol
. If idempotent = TRUE
the
square root is based on the spectral decomposition of this matrix, using
the output from eigen
. This is a minimal rotation square root,
which means that if the input data x
already have the
exactly/approximately the required summary statistics then the returned
dataset is exactly/approximately the same as the target dataset.
An object of class c("anscombe", "matrix", "array")
with
plot and print methods. This returned
dataset has the following summary statistics in common with Anscombe's
quartet.
The sample means of each variable.
The sample variances of each variable.
The sample correlation matrix.
The estimated regression coefficients from least squares linear regressions of each variable on each other variable.
The target and new summary statistics are returned as attributes
old_stats
and new_stats
and the chosen Anscombe's quartet dataset as
an attribute old_data
.
mimic
to modify a dataset to share sample summary statistics
with another dataset.
datasets::anscombe
for Anscombe's Quartet and anscombe
for
Anscombe's Quartet as 4 separate datasets.
input_datasets
: input1
to input8
for some input datasest
of the same size as those in Anscombe's quartet.
# Produce Anscombe-like datasets using input1 to input8 a1 <- anscombise(input1, idempotent = FALSE) plot(a1) a2 <- anscombise(input2) plot(a2) a3 <- anscombise(input3, idempotent = FALSE) plot(a3) a4 <- anscombise(input4, idempotent = FALSE) plot(a4) a5 <- anscombise(input5, idempotent = FALSE) plot(a5) a6 <- anscombise(input6) plot(a6) a7 <- anscombise(input7, idempotent = FALSE) plot(a7) a8 <- anscombise(input8, idempotent = FALSE) plot(a8) # Old faithful to new faithful new_faithful <- anscombise(datasets::faithful, which = 4) plot(new_faithful) # Then check that the sample summary statistics are the same plot(new_faithful, input = TRUE) # Map of Italy got_maps <- requireNamespace("maps", quietly = TRUE) if (got_maps) { italy <- mapdata("Italy") new_italy <- anscombise(italy, which = 4) plot(new_italy) }
# Produce Anscombe-like datasets using input1 to input8 a1 <- anscombise(input1, idempotent = FALSE) plot(a1) a2 <- anscombise(input2) plot(a2) a3 <- anscombise(input3, idempotent = FALSE) plot(a3) a4 <- anscombise(input4, idempotent = FALSE) plot(a4) a5 <- anscombise(input5, idempotent = FALSE) plot(a5) a6 <- anscombise(input6) plot(a6) a7 <- anscombise(input7, idempotent = FALSE) plot(a7) a8 <- anscombise(input8, idempotent = FALSE) plot(a8) # Old faithful to new faithful new_faithful <- anscombise(datasets::faithful, which = 4) plot(new_faithful) # Then check that the sample summary statistics are the same plot(new_faithful, input = TRUE) # Map of Italy got_maps <- requireNamespace("maps", quietly = TRUE) if (got_maps) { italy <- mapdata("Italy") new_italy <- anscombise(italy, which = 4) plot(new_italy) }
Create an animation to show datasets that share sample summary statistics with Anscombe's quartet.
anscombise_gif( x, which = 1, idempotent = TRUE, theme_name = "classic", ease = "cubic-in-out", transition_length = 3, state_length = 1, wrap = TRUE )
anscombise_gif( x, which = 1, idempotent = TRUE, theme_name = "classic", ease = "cubic-in-out", transition_length = 3, state_length = 1, wrap = TRUE )
x |
A list of input datasets. Each one must be a suitable argument
|
which , idempotent
|
Vectors that provide the arguments of the same names
to |
theme_name |
A character scalar used to set the
|
ease |
A character scalar passed to |
transition_length , state_length , wrap
|
Arguments passed to
|
For this function to work the packages
ggplot2
and
gganimate
must be installed.
An object of class c("gganim", "gg", "ggplot")
with an additional
attribute new_data
that is a data frame with 3 variables, x
, y
and
dataset
containing the datasets output from anscombise
.
The returned object may be displayed using by typing its name,
e.g., anim
or saved as a GIF file using
anim_save
, e.g.,
gganimate::anim_save("anscombe.gif", anim)
.
anscombise
modifies a dataset so that it shares sample summary
statistics with Anscombe's quartet.
input_datasets
: input1
to input8
for some input datasets
of the same size as those in Anscombe's quartet.
# Animate some Anscombe-like datasets produced using input1 to input8 x <- list(input1, input2, input3, input4, input5, input6, input7, input8) idem <- c(FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE) anim <- anscombise_gif(x, idempotent = idem)
# Animate some Anscombe-like datasets produced using input1 to input8 x <- list(input1, input2, input3, input4, input5, input6, input7, input8) idem <- c(FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE) anim <- anscombise_gif(x, idempotent = idem)
Calculates a particular set of summary statistics for a dataset.
get_stats(x)
get_stats(x)
x |
a numeric matrix or data frame with at least 2 columns/variables. Each column contains observations on a different variable. Missing observations are not allowed. |
A named list of summary statistics containing
n
The sample size.
means
The sample means of each variable.
variances
The sample means of each variable.
correlation
The sample correlation matrix.
intercepts
,slopes
,rsquared
Matrices whose (i,j)th entries are the
estimated regression coefficients in a regression of x[, i]
on
x[, j]
and the resulting coefficient of determination .
get_stats(anscombe[, c(1, 5)])
get_stats(anscombe[, c(1, 5)])
Provides input datasets from which anscombe
will produce transformed datasets that behave like Anscombe's quartet of datasets, that is, with the same traditional statistical properties but different general behaviours. Use plot(input1)
, for example, to see the behaviours of the datasets.
input1 input2 input3 input4 input5 input6 input7 input8
input1 input2 input3 input4 input5 input6 input7 input8
All datasets are objects of class matrix
(inherits from array
) with 11 rows and 2 columns.
None. Created for use in 'anscombiser'.
Anscombe, Francis J. (1973). Graphs in statistical analysis. The American Statistician, 27, 17-21. doi:10.2307/2682899
Extracts longitude and latitude values for a particular region from the world map supplied by the maps package.
mapdata(region = ".", map = "world", exact = FALSE, ...)
mapdata(region = ".", map = "world", exact = FALSE, ...)
region |
Passed to |
map |
Passed to |
exact |
The argument |
... |
Additional arguments to be passed to |
A dataframe with two columns: long
and lat
for longitude and
latitude.
See the examples in mimic
.
Modifies a dataset x
so that it shares sample summary statistics with
a target dataset x2
.
mimic(x, x2, idempotent = TRUE, ...)
mimic(x, x2, idempotent = TRUE, ...)
x , x2
|
Numeric matrices or data frames. Each column contains observations
on a different variable. Missing observations are not allowed.
|
idempotent |
A logical scalar. If |
... |
Additional arguments to be passed to |
The input dataset x
is modified by shifting, scaling and rotating
it so that its sample mean and covariance matrix match those of the target
dataset x2
.
The rotation is based on the square root of the sample correlation matrix.
If idempotent = FALSE
then this square root is based on the Cholesky
decomposition this matrix, using chol
. If idempotent = TRUE
the
square root is based on the spectral decomposition of this matrix, using
the output from eigen
. This is a minimal rotation square root,
which means that if the input data x
already have the
exactly/approximately the required summary statistics then the returned
dataset is exactly/approximately the same as the target dataset x2
.
An object of class c("anscombe", "matrix", "array")
with
plot and print methods. This returned
dataset has the following summary statistics in common with x2
.
The sample means of each variable.
The sample variances of each variable.
The sample correlation matrix.
The estimated regression coefficients from least squares linear regressions of each variable on each other variable.
The target and new summary statistics are returned as attributes
old_stats
and new_stats
.
If x2
is supplied then it is returned as a attribute old_data
.
anscombise
modifies a dataset so that it shares sample summary
statistics with Anscombe's quartet.
### 2D examples # The UK and a dinosaur got_maps <- requireNamespace("maps", quietly = TRUE) got_datasauRus <- requireNamespace("datasauRus", quietly = TRUE) if (got_maps && got_datasauRus) { library(maps) library(datasauRus) dino <- datasaurus_dozen_wide[, c("dino_x", "dino_y")] UK <- mapdata("UK") new_UK <- mimic(UK, dino) plot(new_UK) } # Trump and a dinosaur if (got_datasauRus) { library(datasauRus) dino <- datasaurus_dozen_wide[, c("dino_x", "dino_y")] new_dino <- mimic(dino, trump) plot(new_dino) } ## Examples of passing summary statistics # The default is zero mean, unit variance and no correlation new_faithful <- mimic(faithful) plot(new_faithful) # Change the correlation mat <- matrix(c(1, -0.9, -0.9, 1), 2, 2) new_faithful <- mimic(faithful, correlation = mat) plot(new_faithful) ### A 3D example new_randu <- mimic(datasets::randu, datasets::trees) # The samples summary statistics are equal get_stats(new_randu) get_stats(datasets::trees)
### 2D examples # The UK and a dinosaur got_maps <- requireNamespace("maps", quietly = TRUE) got_datasauRus <- requireNamespace("datasauRus", quietly = TRUE) if (got_maps && got_datasauRus) { library(maps) library(datasauRus) dino <- datasaurus_dozen_wide[, c("dino_x", "dino_y")] UK <- mapdata("UK") new_UK <- mimic(UK, dino) plot(new_UK) } # Trump and a dinosaur if (got_datasauRus) { library(datasauRus) dino <- datasaurus_dozen_wide[, c("dino_x", "dino_y")] new_dino <- mimic(dino, trump) plot(new_dino) } ## Examples of passing summary statistics # The default is zero mean, unit variance and no correlation new_faithful <- mimic(faithful) plot(new_faithful) # Change the correlation mat <- matrix(c(1, -0.9, -0.9, 1), 2, 2) new_faithful <- mimic(faithful, correlation = mat) plot(new_faithful) ### A 3D example new_randu <- mimic(datasets::randu, datasets::trees) # The samples summary statistics are equal get_stats(new_randu) get_stats(datasets::trees)
Create an animation to show datasets that mimic a target dataset x2
.
mimic_gif( x, x2, idempotent = TRUE, theme_name = "classic", ease = "cubic-in-out", transition_length = 3, state_length = 1, wrap = TRUE )
mimic_gif( x, x2, idempotent = TRUE, theme_name = "classic", ease = "cubic-in-out", transition_length = 3, state_length = 1, wrap = TRUE )
x |
A list of input datasets. Each one must be suitable argument
|
x2 |
A suitable argument |
idempotent |
A logical vector that provides the argument of the same
names to |
theme_name |
A character scalar used to set the
|
ease |
A character scalar passed to |
transition_length , state_length , wrap
|
Arguments passed to
|
For this function to work the packages
ggplot2
and
gganimate
must be installed.
An object of class c("gganim", "gg", "ggplot")
with an additional
attribute new_data
that is a data frame with 3 variables, x
, y
and
dataset
containing the datasets output from mimc
.
The returned object may be displayed using by typing its name,
e.g., anim
or saved as a GIF file using
anim_save
, e.g.,
gganimate::anim_save("anscombe.gif", anim)
.
mimic
to modify a dataset to share sample summary statistics
with another dataset.
input_datasets
: input1
to input8
for some input datasets
of the same size as those in Anscombe's quartet.
# Create 8 datasets that mimic Anscombe's first dataset x <- list(input1, input2, input3, input4, input5, input6, input7, input8) anim <- mimic_gif(x, anscombe1)
# Create 8 datasets that mimic Anscombe's first dataset x <- list(input1, input2, input3, input4, input5, input6, input7, input8) anim <- mimic_gif(x, anscombe1)
plot
method for objects inheriting from class "anscombe"
.
## S3 method for class 'anscombe' plot(x, input = FALSE, stats = TRUE, digits = 3, legend_args = list(), ...)
## S3 method for class 'anscombe' plot(x, input = FALSE, stats = TRUE, digits = 3, legend_args = list(), ...)
x |
an object of class |
input |
A logical scalar. Should the old, input data, that is, the
Anscombe's dataset chosen for |
stats |
A logical scalar. Should the sample summary statistics
|
digits |
An integer. The argument |
legend_args |
A list of arguments to be passed to
|
... |
Further arguments to be passed to |
This function is only applicable in 2 dimensions, that is,
when length(attr(x, "new_stats")$means)
= 2.
Nothing is returned.
See the examples in anscombise
and mimic
.
anscombise
and mimic
.
print
method for class "anscombe".
## S3 method for class 'anscombe' print(x, ...)
## S3 method for class 'anscombe' print(x, ...)
x |
an object of class "anscombe", a result of a call to |
... |
Additional optional arguments to be passed to
|
Just extracts the new dataset from x
and prints it using
print
.
The argument x
, invisibly.
anscombise
and mimic
Creates a list of summary statistics to pass to mimic
.
set_stats(d = 2, means = 0, variances = 1, correlation = diag(2))
set_stats(d = 2, means = 0, variances = 1, correlation = diag(2))
d |
An integer that is no smaller than 2. |
means |
A numeric vector of sample means. |
variances |
A numeric vector of positive sample variances. |
correlation |
A numeric correlation matrix. None of the off-diagonal
entries in |
The vectors means
and variances
are recycled using
rep_len
to have length d
.
A list containing the following components.
means
a d
-vector of sample means.
variances
a d
-vector sample variances.
correlation
a d
by d
correlation matrix.
# Uncorrelated with zero means and unit variances set_stats() # Sample correlation = 0.9 set_stats(correlation = matrix(c(1, 0.9, 0.9, 1), 2, 2))
# Uncorrelated with zero means and unit variances set_stats() # Sample correlation = 0.9 set_stats(correlation = matrix(c(1, 0.9, 0.9, 1), 2, 2))
A dataset that provides an image of Donald Trump's face.
trump
trump
A matrix with 4885 rows and 2 columns: x
and y
.
This image was created by Accentaur from the Noun Project. https://thenounproject.com/term/donald-trump/727774/
plot(trump)
plot(trump)