Generate a list of descriptive statistics. By default, the function calculates summary statistics such as mean, standard deviation, quantiles, minimum and maximum for continuous variables and relative and absolute frequencies for categorical variables. Also calculates p-values for an appropriately chosen statistical test. For two-group comparisons, confidence intervals for appropriate summary measures of group differences are calculated aswell. In particular, Wald confidence intervals from prop.test are used for categorical variables with 2 levels, confidence intervals from t.test are used for continuous variables and confidence intervals for the Hodges-Lehman estimator [1] from wilcox.test are used for ordinal variables.
Usage
descr(
dat,
group = NULL,
group_labels = list(),
var_labels = list(),
var_options = list(),
summary_stats_cont = list(N = DescrTab2:::.N, Nmiss = DescrTab2:::.Nmiss, mean =
DescrTab2:::.mean, sd = DescrTab2:::.sd, median = DescrTab2:::.median, Q1 =
DescrTab2:::.Q1, Q3 = DescrTab2:::.Q3, min = DescrTab2:::.min, max =
DescrTab2:::.max),
summary_stats_numeric_ord = list(N = DescrTab2:::.factorN, Nmiss =
DescrTab2:::.factorNmiss, mean = DescrTab2:::.factormean, sd = DescrTab2:::.factorsd,
median = DescrTab2:::.factormedian, Q1 = DescrTab2:::.factorQ1, Q3 =
DescrTab2:::.factorQ3, min = DescrTab2:::.factormin, max = DescrTab2:::.factormax),
summary_stats_cat = list(),
format_summary_stats = list(N = function(x) {
format(x, digits = 3, scientific =
4)
}, mean = function(x) {
format(x, digits = 3, scientific = 4)
}, sd =
function(x) {
format(x, digits = 3, scientific = 4)
}, median = function(x) {
format(x, digits = 3, scientific = 4)
}, Q1 = function(x) {
format(x, digits
= 3, scientific = 4)
}, Q3 = function(x) {
format(x, digits = 3, scientific =
4)
}, min = function(x) {
format(x, digits = 3, scientific = 4)
}, max =
function(x) {
format(x, digits = 3, scientific = 4)
}, CI = function(x) {
format(x, digits = 3, scientific = 4)
}),
format_p = scales::pvalue_format(),
format_options = list(print_Total = NULL, print_p = TRUE, print_CI = FALSE,
combine_mean_sd = FALSE, combine_median_Q1_Q3 = FALSE, omit_factor_level = "none",
omit_Nmiss_if_0 = TRUE, omit_missings_in_group = TRUE, percent_accuracy = 0.1,
percent_suffix = "%", row_percent = FALSE, Nmiss_row_percent = FALSE,
absolute_relative_frequency_mode = c("both", "only_absolute", "only_relative"),
omit_missings_in_categorical_var = FALSE, categorical_missing_percent_mode =
c("no_missing_percent", "missing_as_regular_category",
"missing_as_separate_category"), caption = NULL, replace_empty_string_with_NA = TRUE,
categories_first_summary_stats_second = FALSE, max_first_col_width = 7.5),
test_options = list(paired = FALSE, nonparametric = FALSE, exact = FALSE, var_equal =
FALSE, indices = c(), guess_id = FALSE, include_group_missings_in_test = FALSE,
include_categorical_missings_in_test = FALSE, test_override = NULL,
additional_test_args = list(), boschloo_max_n = 200),
reshape_rows = list(`Q1 - Q3` = list(args = c("Q1", "Q3"), fun = function(Q1, Q3) {
paste0(Q1, " -- ", Q3)
}), `min - max` = list(args = c("min", "max"), fun =
function(min, max) {
paste0(min, " -- ", max)
})),
...
)
Arguments
- dat
Data frame or tibble. The data set to be analyzed. Can contain continuous or factor (also ordered) variables.
- group
name (as character) of the group variable in dat.
- group_labels
named list of labels for the levels of the group variable in dat.
- var_labels
named list of variable labels.
- var_options
named list of lists. For each variable, you can have special options that apply only to that variable. These options are specified in this argument. See the details and examples for more explanation.
- summary_stats_cont
named list of summary statistic functions to be used for numeric variables.
- summary_stats_numeric_ord
named list of summary statistic function to be used for ordered factor variables which can be converted to numeric.
- summary_stats_cat
named list of summary statistic function to be used for categorical variables.
- format_summary_stats
named list of formatting functions for summary statistics.
- format_p
formatting function for p-values.
- format_options
named list of formatting options.
- test_options
named list of test options.
- reshape_rows
named list of lists. Describes how to combine different summary statistics into the same row.
- ...
further argument to be passed along
Value
Returns a A DescrList
object, which is a named list of descriptive statistics
which can be passed along to the print function to create
pretty summary tables.
Labels
group_labels
and var_labels
need to be named lists of character elements. The names of the list elements have to match the variable
names in your dataset. The values of the list elements are the labels that will be assigned to these variables when printing.
Custom summary statistics
summary_stats_cont
and summary_stats_cat
are both named lists of functions. The names of the list elements are
what will be displayed in the leftmost column of the descriptive table. These functions should take a vector and return
a value.
Each summary statistic has to have an associated formatting function in the format_summary_stats
list.
The functions in format_summary_stats
take a numeric value and convert it to a character string, e.g. 0.2531235 -> "0.2".
The format_p
function converts p-values to character strings, e.g. 0.05 -> "0.05" or 0.000001 -> "<0.001".
Formatting options
Further formatting options can be specified in the format_options
list. It contains the following members:
print_Total
(logical) controls whether to print the "Total" column. If print_Total = NULL, print_Total will be set to TRUE if test_options$paired == FALSE, else it will be set to FALSE.print_p
(logical) controls whether to print the p-value column.print_CI
(logical) controls whether to print the confidence intervals for group-differences.combine_mean_sd
(logical) controls whether to combine the mean and sd row into one mean ± sd row. This is a shortcut argument for the specification of an appropriate entry in thereshape_rows
argument.combine_median_Q1_Q3
(logical) controls whether to combine the median, Q1 and Q3 row into one median (Q1, Q3) row. This is a shortcut argument for the specification of an appropriate entry in thereshape_rows
argument.omit_Nmiss_if_0
(logical) controls whether to omit the Nmiss row in continuous variables there are no missings in the variable.omit_missings_in_group
(logical) controls whether to omit all observations where the group variable is missing.percent_accuracy
(numeric) A number to round to. Use (e.g.) 0.01 to show 2 decimal places of precision. If NULL, the default, uses a heuristic that should ensure breaks have the minimum number of digits needed to show the difference between adjacent values. See documentation of scales::label_percentpercent_suffix
(character) the symbol to be used where "%" is appropriate, sensible choices are usually "%" (default) or "" (i.e., empty string)row_percent
(logical) controls wheter percentages of regular categorical variables should be calculated column-wise (default) or row-wiseNmiss_row_percent
(logical) controls whether percentages of the "Nmiss"-statistic (number of missing values) should be calculated column-wise (default) or row-wiseabsolute_relative_frequency_mode
(character) controls how to display frequencies. It may be set to one of the following options:"both"
will display absolute and relative frequencies."only_absolute"
will only display absolute frequencies."only_relative"
will only display relative frequencies.
omit_missings_in_categorical_var
(logical) controls whether to omit missing values in categorical variables completely.categorical_missing_percent_mode
(character) controls how to display percentages in categorical variables with a (Missing) category. It may be set to one of the following options:"no_missing_percent"
omits a percentage in the missing category entirely."missing_as_regular_category"
treats (Missing) as a regular category for %-calculation This means that if You have three categories: "A" with 10 counts, "B" with 10 counts and "(Missing)" with 10 counts, they will become "A": 10 (33%), "B": 10 (33%), "(Missing)": 10 (33% purposes.)"missing_as_separat_category"
calculates (Missing) percentages with respect to all observations (i.e. #(Missing) / N), but calculates all other catetgory percentages with respect to the non-missing observations (e.g. #A / N_nonmissing). This means that if You have three categories: "A" with 10 counts, "B" with 10 counts and "(Missing)" with 10 counts, they will become "A": 10 (50%), "B": 10 (50%), "(Missing)": 10 (33%)
"caption"
adds a table caption to the LaTeX, Word or PDf documentreplace_empty_string_with_NA
(logical) controls whether empty strings ("") should be replaced with missing value (NA_character_
).categories_first_summary_stats_second
(logical) controls whether the categories should be printed first in the summary statistics table.max_first_col_width
(numeric) controls the maximum width of the first column in LaTeX tables.
Test options
test_options
is a named list with test options. It's members paired
, nonparametric
, and
exact
(logicals) control which test in the corresponding situation. For details, check out the vignette:
https://imbi-heidelberg.github.io/DescrTab2/articles/b_test_choice_tree_pdf.pdf. The test_options = list(test_override="<some test name>")
option can be specified to force usage of a
specific test. This will produce errors if the data does not allow calculation of that specific test, so be wary.
Use print_test_names()
to see a list of all available test names. If paired = TRUE
is specified, you need to supply an index variable
indices
that specifies which datapoints in your dataset are paired. indices
may either be a length one character vector that describes
the name of the index variable in your dataset, or a vector containing the respective indices.
If you have guess_id
set to TRUE
(the default), DescrTab2
will try to guess
the ID variable from your dataset and report a warning if it succeedes.
See https://imbi-heidelberg.github.io/DescrTab2/articles/a_usage_guide.html#Paired-observations-1
for a bit more explanation. The optional list additional_test_args
can be used to pass arguments along to test functions,
e.g. additional_test_args=list(correct=TRUE)
will request continuity correction if available.
Customization for single variables
The var_options
list can be used to conduct customizations that should only apply to a single variable and leave
the rest of the table unchanged. var_options
is a list of named lists. This means that each member of var_options
is itself a list again.
The names of the list elements of var_options
determine the variables to which the options will apply.
Let's say you have an age
variable in your dataset. To change 'descr' options only for age
, you will need to pass
a list of the form var_options = list(age = list(<Your options here>))
.
You can replace <Your options here>
with the following options:
label
a character string containing the label for the variablesummary_stats
a list of summary statistics. See section "Custom summary statistics"format_summary_stats
a list of formatting functions for summary statistics. See section "Custom summary statistics"format_p
a function to format p-values. See section "Custom summary statistics"format_options
a list of formatting options. See section "Formatting options"test_options
a list of test options. See section "Test options"test_override
manually specify the name of the test you want to apply. You can see a list of choices by typingprint_test_names()
. Possible choices are:"Cochran's Q test"
"McNemar's test"
"Chi-squared goodness-of-fit test"
"Pearson's chi-squared test"
"Exact McNemar's test"
"Boschloo's test"
"Wilcoxon's one-sample signed-rank test"
"Mann-Whitney's U test"
"Kruskal-Wallis's one-way ANOVA"
"Student's paired t-test"
"Mixed model ANOVA"
"Student's one-sample t-test"
"Student's two-sample t-test"
"Welch's two-sample t-test"
"F-test (ANOVA)"
Combining rows
The reshape_rows
argument offers a framework for combining multiple rows of the output table into a single one.
reshape_rows
is a named list of lists. The names of it's member-lists determine the name that will be displayed
as the name of the combined summary stats in the table (e.g. "mean ± sd "). The member lists need to contain two
elements: args
, contains the names of the summary statistics to be combined as characters, and
fun
which contains a function to combine these summary stats. The argument names of this function need to match
the character strings specified in args
. Check out the default options for an exemplary definition.
References
[1] Hodges, J. L.; Lehmann, E. L. (1963). "Estimation of location based on ranks". Annals of Mathematical Statistics. 34 (2): 598-611. doi:10.1214/aoms/1177704172. JSTOR 2238406. MR 0152070. Zbl 0203.21105. PE euclid.aoms/1177704172
Examples
descr(iris)
#> Variable Totl p Test
#> 1 Sp.L
#> 2 N 150 <0.001 tt1
#> 3 mean 5.84
#> 4 sd 0.828
#> 5 median 5.8
#> 6 Q1 - Q3 5.1 -- 6.4
#> 7 min - max 4.3 -- 7.9
#> 8 Sp.W
#> 9 N 150 <0.001 tt1
#> 10 mean 3.06
#> 11 sd 0.436
#> 12 median 3
#> 13 Q1 - Q3 2.8 -- 3.3
#> 14 min - max 2 -- 4.4
#> 15 Pt.L
#> 16 N 150 <0.001 tt1
#> 17 mean 3.76
#> 18 sd 1.77
#> 19 median 4.35
#> 20 Q1 - Q3 1.6 -- 5.1
#> 21 min - max 1 -- 6.9
#> 22 Pt.W
#> 23 N 150 <0.001 tt1
#> 24 mean 1.2
#> 25 sd 0.762
#> 26 median 1.3
#> 27 Q1 - Q3 0.3 -- 1.8
#> 28 min - max 0.1 -- 2.5
#> 29 Spcs
#> 30 sets 50 (33.3%) >0.999 chi1
#> 31 vrsc 50 (33.3%)
#> 32 vrgn 50 (33.3%)
DescrList <- descr(iris)
DescrList$variables$results$Sepal.Length$Total$mean
#> NULL
print(DescrList)
#> Variable Totl p Test
#> 1 Sp.L
#> 2 N 150 <0.001 tt1
#> 3 mean 5.84
#> 4 sd 0.828
#> 5 median 5.8
#> 6 Q1 - Q3 5.1 -- 6.4
#> 7 min - max 4.3 -- 7.9
#> 8 Sp.W
#> 9 N 150 <0.001 tt1
#> 10 mean 3.06
#> 11 sd 0.436
#> 12 median 3
#> 13 Q1 - Q3 2.8 -- 3.3
#> 14 min - max 2 -- 4.4
#> 15 Pt.L
#> 16 N 150 <0.001 tt1
#> 17 mean 3.76
#> 18 sd 1.77
#> 19 median 4.35
#> 20 Q1 - Q3 1.6 -- 5.1
#> 21 min - max 1 -- 6.9
#> 22 Pt.W
#> 23 N 150 <0.001 tt1
#> 24 mean 1.2
#> 25 sd 0.762
#> 26 median 1.3
#> 27 Q1 - Q3 0.3 -- 1.8
#> 28 min - max 0.1 -- 2.5
#> 29 Spcs
#> 30 sets 50 (33.3%) >0.999 chi1
#> 31 vrsc 50 (33.3%)
#> 32 vrgn 50 (33.3%)
descr(iris, "Species")
#> Variable sets vrsc vrgn Totl p Test
#> 1 Sp.L
#> 2 N 50 50 50 150 <0.… F
#> 3 mean 5.01 5.94 6.59 5.84
#> 4 sd 0.352 0.516 0.636 0.828
#> 5 median 5 5.9 6.5 5.8
#> 6 Q1 - Q3 4.8 -- 5.2 5.6 -- 6.3 6.2 -- 6.9 5.1 -- 6.4
#> 7 min - max 4.3 -- 5.8 4.9 -- 7 4.9 -- 7.9 4.3 -- 7.9
#> 8 Sp.W
#> 9 N 50 50 50 150 <0.… F
#> 10 mean 3.43 2.77 2.97 3.06
#> 11 sd 0.379 0.314 0.322 0.436
#> 12 median 3.4 2.8 3 3
#> 13 Q1 - Q3 3.2 -- 3.7 2.5 -- 3 2.8 -- 3.2 2.8 -- 3.3
#> 14 min - max 2.3 -- 4.4 2 -- 3.4 2.2 -- 3.8 2 -- 4.4
#> 15 Pt.L
#> 16 N 50 50 50 150 <0.… F
#> 17 mean 1.46 4.26 5.55 3.76
#> 18 sd 0.174 0.47 0.552 1.77
#> 19 median 1.5 4.35 5.55 4.35
#> 20 Q1 - Q3 1.4 -- 1.6 4 -- 4.6 5.1 -- 5.9 1.6 -- 5.1
#> 21 min - max 1 -- 1.9 3 -- 5.1 4.5 -- 6.9 1 -- 6.9
#> 22 Pt.W
#> 23 N 50 50 50 150 <0.… F
#> 24 mean 0.246 1.33 2.03 1.2
#> 25 sd 0.105 0.198 0.275 0.762
#> 26 median 0.2 1.3 2 1.3
#> 27 Q1 - Q3 0.2 -- 0.3 1.2 -- 1.5 1.8 -- 2.3 0.3 -- 1.8
#> 28 min - max 0.1 -- 0.6 1 -- 1.8 1.4 -- 2.5 0.1 -- 2.5