IN SILICO ANALYSIS TECHNIQUES (BIOINFORMATICS) - STATISTICS AND EPIDEMIOLOGY

statistics : a discipline devoted to the collection, analysis, and interpretation of numerical data using the theory of probability, concerned particularly with methods for drawing inferences about characteristics of a population from examination of a random sample.

• number : a symbol, as a figure or word, expressive of a certain value or of a specified quantity determined by count
• coefficient : a unitless statistical parameter indicating the amount of change in an outcome under given conditions
• approximation : the act or process of bringing closer together or into apposition.
• measure (x) = m + (e + z); a specific extent or quantity of a substance
• equation : an expression made up of 2 members connected by the sign of equality
• value : a quantitative measurement of the activity, concentration, or some other quality of a substance
• expected value : in statistics, the value of an estimate that is the mean of its sampling distribution.
• class : in statistics, a subgroup of a population for which certain variables measured for individuals in the population fall within specific limits
• degrees of freedom (DF / n) : the number of osservations or frequency classes whose frequency can be determinated arbitrarily without the total of the observations or of frequency classes is alterated. It is calculated by ...
• subtracting  the number of estimated parameters to the total of the observations
• in a 2-entry tabel, multiplying the (number of lines- 1) for the (number of columns - 1).
The number of ways the members of a sample can vary independently; a numerical index of a family of probability distributions that corresponds to the number of independent variables in the definition of each member, e.g., the chi-squared distribution with n degrees of freedom is the distribution of the sum of squares of n standard normal deviations
• frequency (f) : the number of occurrences of a particular event or the number of members of a population or statistical sample falling in a particular class
• class frequency : in statistics, the number of variables contained in a class
• density or frequency distribution : a presentation, such as a table or graph, describing the relative frequency or theoretical probability of a random variable assuming any value in the range of possible values.
• variable : in mathematics, a symbol that represents an arbitrary number or an arbitrary element of a set.
• quantitative variable : frequency distributions are graphically represented with a
• ordinate (y) : in a 2-dimensional coordinate system, the distance of a point from the horizontal (x) axis, measured along a line parallel to the y-axis
• abscissa (x) : the horizontal coordinate in a 2-dimensional coordinate system; the horizontal distance of a point from y- (or vertical) axis
• histogram : a series of rectangles dividing the data into classes, the height of a rectangle indicating the number of values that are contained in that class (class frequency) and the width of each base being the size of the intervals into which the classes have been divided.
• age pyramid :
• frequency polygon : obtained from a histogram by placing a dot at the midpoint of the top of each bar, with a line connecting the dots outlining the polygon
• pointed diagram
• areal diagram
• stem-and-leaf plot : each number is divided into the first (and eventually other) numeral (stem) and the following (leaf)
• scatterplot / scatter diagram / scattergram : a plot in rectangular coordinates of paired observations of 2 random variables, each observation plotted as one point on the graph; the scatter or clustering of points provides an indication of the relationship between the 2 variables
• continuous variable : a variable that can assume the complete continuum of values (a theoretically infinite variety) through its distribution (e.g. numbers)
• continuous variable over a given interval (e.g. p)
• qualitative variable : frequencies are graphically represented with a
• separated bar histogram
• box plot : a graphic representation of a frequency distribution of a set of data; for each group is drawn a rectangle with upper and lower limits representing the interquartile range, horizontal line within the rectangle representing the median, and vertical tails (whiskers) extending above and below the rectangle representing the minimum and maximum values.
• areal or cake diagram
• scale : a scheme or device by which some property may be evaluated or measured, such as a linear surface bearing marks at regular intervals, representing certain predetermined units.
• ranked scale : a scale in which the adjacent categories are arranged according to a progressively ascending or descending magnitude, as an ordinal scale or interval scale
• rank : in statistics, the position of a sample observation (or population value) in the sequence of sample values (or population values) arranged in order, usually from lowest to highest
• nonlinear scale : one in which the divisions corresponding to the steps are unequal, e.g., a scale with divisions showing logarithmic or exponential growth or change.
• dimensional or interval scale : one used to classify data in which the values have intrinsic order and all intervals have an inherent and equal distance between, e.g., age or temperature scales
• categorical or discrete variable : an experimental variable that can assume only certain specific values in its distribution; the possible list of values is finite and often countable
• ordinal scale : a scale used to classify data into qualitative ordered categories, e.g., defining socioeconomic status as low, medium, or high; the values have a distinct order but intervals are created arbitrarily and lack an intrinsic numerical equality (e.g. strong, intermediate and mild smokers).
• nonordinal or nominal scale : the weakest qualitative, not quantitative or ordered, classification of the samples into separate categories so that each possible result belongs to only one category, with the categories not able to be ordered relative to each other, e.g., one dealing with religion or sex and not size, weight, or temperature
• binary or dichotomous scale : a nominal scale with 2 categories (e.g. dead/alive)
• polychotomous scale : a nominal scale with more than 2 categories (e.g. blood groups)
Cohen coefficient K for agreement in defining nominal scales (e.g. 2 pathologists interpreting the same specimens) (Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20:37-46)
• random variable : an outcome of a random process that has a numerical value.
• confounding variable / confounder : a third variable that can indirectly distort the statistical relationship between 2 variables under manipulation or observation
• confounding : interference by a third variable so as to distort the association being studied between two other variables, because of a strong relationship with both of the other variables; a relationship between two causal factors such that their individual contributions can not be separated
• Simpson's paradox : a form of extreme confounding such that an association between 2 variables is actually reversed after adjusting for a third.
• dependent variable (usually denoted with y) : in a mathematical equation or relationship between 2 or more variables, a variable whose value depends on those of others; e.g., in the formula x = 3y + z2, x is the dependent variable.
• independent variable (usually denoted with x) : in a mathematical equation or relationship between 2 or more variables, any variable whose value determines that of others; e.g., in the formula x = 3y + z2, y and z are the independent variables.
• outcome variable : one that measures consequences or results; it may be primary, ancillary, or incidental to a particular study.
• probit : a normal variate having mean 5 and standard deviation 1. In quantal biologic assays, the observed responses are often converted to probits (the fraction responding is converted to the probit that cuts off the same fraction of the area under the normal frequency curve) in order to fit a linear log doseresponse curve, a procedure based on the assumption that the response thresholds are normally distributed.
• decision analysis : a statistical method used for delineating the probabilities of various outcomes by determining the probabilities of each option available at each point where a decision can be made; often graphed as a decision tree to display the array of choices and outcomes as nodes and branches.
• population : in statistics, a theoretical concept used to describe an entire group or collection of units, finite or infinite; from it a sample can be drawn
• parameter : in statistics, a value that specifies one of the various members of a family of probability distributions. A parameter is often thought of as the true value or population value as opposed to the observed value or sample value
 position parameters (in the normal distribution mean = mode = median) dispersion parameters shape parameters normal or gaussian distribution mean (m) : in probability and statistics, the expected value (mathematical expectation) of a random variable, the limiting value to which the sample mean converges as the sample size is increased indefinitely (if the limit exists) population mean : the mean of the probability distribution characterizing a specified population; for a finite population, the arithmetic mean of the population values arithmetic mean : the sum of n numbers divided by n. m = limn=>+oo[sum(xi)/N] geometric mean : the nth root of the product of n numbers, e.g., the geometric mean of [2,8,32] is (2 x 8 x 32)1/3 = 8. harmonic mean (m) : reciprocal of the mean of the reciprocals of the individual values in a given set; e.g., for the set [10, 40, 60] the harmonic mean is 1 / [ 1/3 ( 1/10 + 1/40 + 1/60 )] = 21.2. average absolute difference = 1/n . summation(|xi-x-|) coefficient of variation (CV) : the standard deviation divided by the mean, sometimes multiplied by 100; a unitless quantity indicating the variability around the mean in relation to the size of the mean deviance / average quadratic difference = 1/n . summation[(xi-x-)2] variance (s2) = 1/(n-1) . sum(xi-x-)2 : in statistics, a measure of the variation shown by a set of observations: the average of the squared deviations from the mean; it is the square of the standard deviation s2(x+a) = s2 (x) s2(bx) = b2s2(x) => s2(a+bx) = b2s2(x)  range : an interval in which values sampled from a population, or the values in the population itself, are known to lie = xmax - xmin population standard deviation (SD / s) = [1/(n-1) . summation(xi-x-)2]1/2 :  in statistics a measure of the amount by which each value deviates from the mean; equal to the square root of the variance, i.e., the square root of the average of the squared deviations from the mean. It is the most commonly used measure of dispersion of statistical data skewness of a probability distribution, lack of symmetry about the mean, or any measure of the lack of symmetry = [summation(xi-x-)2]/ns2 > 0 : right tail = 0 : symmetrical < 0 : left tail kurtosis : the degree of peakedness or flatness of a probability distribution, relative to the normal distribution with the same variance = [sum (xi-x-)2]/ns4 < 3 : leptokurtic : pertaining to a probability distribution more heavily concentrated around the mean, i.e., having a sharper, narrower peak, than the normal distribution with the same variance = 3 : normal distribution > 3 : platykurtic : pertaining to a probability distribution less concentrated about the mean, i.e., having a broader, flatter peak than the normal distribution with the same variance skew distribution : a frequency distribution that is asymmetric. mode : the most frequently occurring value or item in a distribution; when data are grouped, it is the midpoint of the grouping with the highest frequency. A distribution with 2 peaks is bimodal median (Me) : any value that divides the probability distribution of a random variable in half, i.e., the probability of observing a value above the median and the probability of observing a value below the median are both less than or equal to one half. For a finite population or sample, the median is the middle value of an odd number of values (arranged in ascending order) any value between the 2 middle values of an even number of values; in the latter case it is conventional to use the average of the 2 middle values The median is the value for which the sum of |xi - Me| is minimal. Its value doesn't change when values of x < Me or x > Me are changed and can be used also for ordinal continuous variables. quantile : any of the values that divide the range of an observed or theoretical probability distribution into a given number of equal, ordered parts. Each value divides the range into 2 specified parts, with the part below the value corresponding to a prescribed fraction p and the part above to 1 - p. percentile : any one of the 99 values that divide the range of a probability distribution or sample into 100 intervals of equal probability or frequency, e.g., 45% of a population scores below the 45th percentile quartile : any of the 3 values that divide the range of a probability distribution into 4 parts of equal probability; i.e., the 1st (Q1), 2nd (Q2), and 3rd (Q3) quartiles are the 25th, 50th, and 75th percentiles. quintile : any of the 5 values that divide the range of a probability distribution into 5 parts of equal probability, i.e., the 1st, 2nd, 3rd, and 4th quintiles are the 20th, 40th, 60th, and 80th percentiles interquartile range : the difference between the data values at the 75th and 25th percentiles (Q3 - Q1), encompassing the middle 50 percent of the data
• sample of size n : a subset of a population that is selected for inclusion in a research study.
• random sample : a sample chosen from a population in such a way that each choice is independent of the other choices and every member of the population has a fixed and determinate probability of being chosen (usually an equal probability).
• stratified sample : that in which the population is first divided into multiple mutually exclusive groups or strata prior to choosing the sample
• stratified random sample : a homogeneously stratified sample in which random samples are chosen within each stratum (usually in relation to sex)
• data : the material or collection of facts on which a discussion or an inference is based.
• censored data : in statistics, observations whose final outcomes are not completely determined in a study, as, for example, data for patients who have not yet reached the study's endpoint (e.g., relapse or death) when the data are analyzed or who drop out of the study before reaching that endpoint
• estimates from a limited sample : all test increase their value by increasing the difference between samples.
• consistency of an estimator : the property of approaching the value of a population parameter as the sample size increases ad infinitum
• sample mean (x-) = arithmetic mean is the best estimator of the population mean as it minimize quadratic differences (minimal square criterion : deviance = summation[(xi-m)2] = summation(xi2 - 2 . m . xi + m2) = summation(xi2) - 2 . m . summation(xi) + n . m2. f'(m) = - 2 . summation(xi) + 2 . n . m, for m = summation(xi)/n.
• sample standard deviation (s) or average quadratic difference = {summation[(xi-m)2]/(n-1)}1/2 = {[summation(xi2) + n . m2 - 2 . m . summation(xi)]/(n-1)}1/2 = {[summation(xi2) + n . summation2(xi)/n2 - 2 . summation(xi)/n . summation(xi)]/(n-1)}1/2= {[summation(xi2) - summation2(xi)/n]/(n-1)]}1/2 ; as the sample is smaller than the population, its variability is lower than that of the population and n-1 is used (rather than n) as a denomintor in order to correct the underestimate
• sample variance (s2) distributes as c21 = summation(observed-espected)/expected
• central limit theorem :
• if random samples of size n are taken from a population having a normally distributed variable with mean m and standard deviation s,
• the distribution of the sample means is approximatively normal, independently from the distribution of values within the original population from which the samples have been extracted. The confidence coefficient of the sample mean is :
• a = 0.05
• 66% in the CI [m - s; m + s]
• 95% in the CI [m - 2s; m + 2s]
• 99% in the CI [m - 3s; m + 3s]
• a = 0.01
• 66% in the CI [m - s; m + s]
• 95% in the CI [m - 2s; m + 2s]
• 99% in the CI [m - 3s; m + 3s]
• the mean of all the (N n) possible sample means of samples of n members = m of the original population
• standard error of the mean (SEM) / average sampling error : the standard deviation of all the (N n) possible sample means of samples of n members. As n . x- = x1 + x2 + ... + xn => s2n . x- = n . s2 => sn . x- = s.n1/2 => snx-/n = SEM = s .n1/2/n= s/(n1/2) = [summation(xi - x-)2]/[n. (n-1)]}1/2. SEM is always lower than s : while the first quantity describes the degree of uncertainty with which a sample mean estimates the actual population mean, the second describes the variability of the population.
• P(m - za/2 . s/n1/2 < x-m + za/2 . s/n1/2} = (1-a)
SEM varies with n. To demonstrate a given effect n = (za/2 . s/D)2
• if the variable in the population is not normally distributed, the sampling distribution of means approximates the normal distribution and the approximation gets better as the sample size increase
• descriptive statistics : that measuring and describing characteristics of groups, without drawing inferences about the population in general.
• inferential statistics : that generalizing conclusions from sample data, using theories of probability to estimate population parameters.
• bootstrap : in statistics, a method for computing the distribution of values based on random resampling from the observed data
• sequential analysis : a statistical technique in which the sample size is not fixed in advance, rather, sampling is stopped as soon as significant results are observed. The criteria for stopping the trials at each sample size are set so that the overall probability (for all sample sizes) of falsely rejecting the null hypothesis at any step is held to a preset level.
• confidence interval (CI) : a type of statistical interval estimate for an unknown parameter: a range of values believed to contain the parameter, with a predetermined degree of confidence. Its endpoints are the confidence limits and it has a stated probability (the confidence coefficient (1-a)) of containing the true value of the population parameter. For example, if the confidence coefficient is 0.95, 95% of the confidence intervals so calculated for each of a large number of random samples would contain the parameter. Higher confidence coefficients require larger confidence intervals
• conservative confidence interval : a confidence interval having a confidence coefficient at least as great as a stated nominal value
• hypothesis test : an abstract procedure for determining whether a set of observations is consistent with a hypothesis under consideration; it is the theoretical basis of most statistical tests. A hypothesis test decides between 2 hypotheses, one stating that the effect under investigation does not exist (the null hypothesis, H0), and the other that some specified effect does exist (the alternative hypothesis, Ha or H1), based on the observed value of a test statistic whose sampling distribution is completely determined by H0. When the test statistic falls in a set of values known as the critical region, H0 is rejected. The level of probability of incorrectly rejecting H0 may be set before the data are collected, usually at 0.05 or 0.01; this is called the significance or confidence level / level of significance (a) = 1 - confidence coefficient. It is now more common to report the smallest a at which the null hypothesis can be rejected; this is called the significance probability / P value (the probability of obtaining by chance a result at least as extreme as that observed, even when H0 is true and no real difference exists; when P < 0.05 the sample results are usually deemed significant at a statistically important level and H0 is rejected).
• one-tailed test : a hypothesis test in which the critical region is one tail of the distribution of the test statistic and H0 is tested against a 1-sided alternative that includes deviations from H0 only in one direction, deviations in the other direction being of no consequence.
• two-tailed test : hypothesis test in which the critical region comprises both tails of the distribution of the test statistic and H0 is tested against a 2-sided alternative that includes deviation from H0 in both directions.
• analysis of variance (ANOVA) : a statistical method for analyzing the effects of each of one or more categorical (nominal, ordinal, or dichotomous) independent variables on a continuous dependent variable as well as on each other, examining more than 2 groups simultaneously; if the null hypothesis that the variables' effects do not differ and all outcomes are drawn from the same population is true, then the means of all outcome groups approximate each other. ANOVA doesn't tell which sample(s) is/are more likely to differ from others and the significativity of the difference. When a single independent variable is tested the method is sometimes called one-way ANOVA (possible only for normally distributed populations); when multiple independent variables are tested, N-way ANOVA / multivariate ANOVA (MANOVA). To test the hypothesis, the variability between group means is compared to that within groups using the Fisher-Snedecor (F)-test : assuming the samples are homogeneous,
• the within-group (wg) variance (s2wg) = summation[(si)2]/k with n = ndenominator = k . (n-1)
• the between-groups (bg) variance (s2bg) = n . s2 = n . summation(xi - x-)2/(k-1), with k = number of samples and n = nnumerator = k-1
• ndenominator + nnumerator = k . n - 1
• F-ratio = Fnnumerator,ndenominator (a) = s2bg / s2wg.If F-ratio value is high (that value behind that only 5% of results are included), variability between sample means is greater than expected on the basis of the variability within the single samples, so the null hypothesis that all observations came from the same population (equality hypothesis) is rejected and the alternative hypothesis is approved, whereas under the alternative hypothesis the F-ratio is expected to approximate 1.0
 Conclusions from observations : treatment actually effective treatment actually uneffective effective treatment => reject H0 and accept Ha true positive (1-b) : test potency false positive (a) : type I error (significance or confidence level / level of significance) uneffective treatment => H0 not rejected false negative (b) : type II error true negative (1-a) : confidence coefficient
When the size of samples grows, the SEM decreases (SEM = s/n1/2) and consequently the variability of the estimate of the difference of the means of the 2 samples evaluated decreases respect to the actual difference of the means : that means we have less trust in the no effect hypothesis (H0) for the group discriminating agent (e.g. a drug) and becomes probable that the 2 samples should be considered as extracted from 2 different populations.
• critical ratio : any of a class of tests of statistical significance in which a parameter is divided by its standard error; e.g., used on Student's t-test, the critical ratio is the difference between 2 means divided by the standard error of that difference. The larger the ratio, the more likely the difference is significant.
• Student's t-test for independent samples (Student was the pseudonym of Gosset) : a statistical hypothesis test for a difference between the means of 2 groups based on the t-distribution : the probability distribution of the statistic t = (difference of sample means - 0) / [standard error of the difference between sample means (sx-1 - x-2)] = (x-1 - x-2) / (SEM12 + SEM22)1/2 = (x-1 - x-2) / [(s12/n1 + s22/n2)]1/2 = (x- - m) / (s/n1/2) ,

• where x- and s are the mean and standard deviations of a sample of size n taken from a population with a normal distribution having mean m. It is symmetric about 0 and approaches the normal distribution as the sample size increases.
t distributes according to the bell-shaped symmetrical curve y = y0/[1+[t2/(x- - 1)x-/2] (this is not a gaussian distribution as x and m are symnmetrical but s isn't as it distributes as (c2)1/2).
tn = (x - x-) / sx
Only when n > 60, limn=>+oot ~ Z (t60(0.05) = 2).
When s is known, m - 1.96 . s < x- < m + 1.96 . s => x- - 1.96 .s < m < x- + 1.96 .s
• for large samples : m - 2 . SEM < x- < m + 2 . SEM => x- - 2 . SEM < m < x- + 2 . SEM
• for small samples : m - tn. s < x- < m + tn. s => x- - tn. s < m < x- + tn. s
In the null hypothesis H0,
• nondirectional hypothesis (two-tail t test) (the most used)
• H0 : m1 - m2 = 0
• H1 : m1 - m2 different from 0
• directional hypothesis (one-tail t test)
• H0 : m1 - m2 </> 0
• H1 : m1 - m2 >/< 0
... to increase the sensitivity of t, s12 and s22 variances can be replaced by the combined estimate variance (s2), equivalent to their arithmetic mean : s2 = [(n1-1) . s12 + (n2-1) . s22] / [(n1-1) + (n2-1)] = [(n1-1) . s12 + (n2-1) . s22] / (n1 + n2 - 2), so that t = (x-1 - x-2) / (s2/n1 + s2/n2)1/2, with n = 2(n-1) degrees of freedom. If |t| < |tn(a)| tabled at a certain confidence coefficient for n = (n1 - 1) + (n2 - 1) = n1 + n2 - 2, there is no proof that differences exist (i.e. H0 is true). a value corresponds to the probability to make the observation that satisfy H0 : a is low when consequences of a type a error are severe. If (n1+n2) grows it is possible to evidentiate smaller differences for any prefixed confidence coefficient.
Placed X- = (x1- + x2-)/2, F-ratio = s2bg / s2wg = 2 . n . SEM2 / (SEM12 + SEM22) = n . {[(x1- - X-)2 + (x2- - X-)2] / (2-1)} / [1/2 . (SEM12 + SEM22)] = n. {[(x1- - (x1- + x2-)/2)2 + (x2- - (x1- + x2-)/2)2] / (2-1)} / [1/2 . (SEM12 + SEM22)]  = (x1- - x2-)2/(SEM12/n + SEM22/n) = {(x1- - x2-)/[(SEM12/n + SEM22/n)]1/2}2 = t2
When using Student's t-test to check for differences among the means of m (> 2) groups, the true P value (i.e. the probability to do a mistake when affermating that the means of 2 groups extracted from the same population are different) may be estimated just adding the P values of the multiple tests (if the comparisons are not too many) or othwerwise multiplying the P value obtained by the number of the t-tests practiced (k = (x- 2) = x-(x--1)/2). In such a case, in fact, the simple sum of P values, i.e. the probability a to affirm erroneously that the drug is effective) is higher than the declared nominal level; by increasing the probability to reject the no effect H0 hypothesis, it increases the probability to declare a therapy effective when there are no certain proofs. To verify the belonging to a population of mean m, t = (x- - m)/SEM = (x- - m)/(s/n1/2). You cannot affirm equality, but only state how few they are different from not significant (n.s.) t values.
• Student's t test for paired samples : observations on the average variation (d-) in single individuals before and after a treatment (H0 : d = 0) rather than on the difference in average response of 2 samples (one treated and the other not). sd = {summation[(d - d-)2]/(n-1)}1/2 => sd- = sd/n1/2 => t = (d- - 0) / sd- with nt = n-1. Evaluating the whole of individuals before and after the treatment, the variability in responses of single individuals would mask the variation due to the treatment, the actual endopoint.
• analysis of variance for paired data : the same notations of one-way ANOVA are used, but the same n subjects undergo m treatments. The general mean of all observations is X- = summationt[summationsXts] / (m . n)

• Total variability :
• between subjects (bs) : SSbs = m . summation[(ss- - X-)2], with nbs = n-1
• within subject (ws) : SSws = summations{summationt[(Xts - ss-)2]}
• between treatments : SSbt = n . summation(Tt- - X-)2, with nbt = m-1
• residual variability : SSres = ssws - ssbt, with nres = (n-1) . (m-1)
The average response of each subject to the various treatments (Ss-) = summationt(Xts)/m
The average response to each treatment by the various subjects (Tt-) = summations(Xts)/(m . n)
F = MSbt/MSres = SSbt/nbt .nres/SSres = [n . (n-1)  . summationt(Tt- - X-)2] / {summations[summationt(Xts - Ss-)2 - summationt(Tt- - X-)2]}. If the value of the F statistic results high for nbt and nres, significative differences are isolated by using Bonferroni's correction (i.e. comparing t = (Ti- - Tj-)/ [(2MSrs/n)1/2] with the critical value reported for P < aT/k, where k is the number of multiple comparisons), or with SNK test or Dunnett's test, always replacing MSres to swg2 and using nres to determinate the critial values of f and f'.
• McNemar test for paired data with 1 degree of freedom whose results are measured on a nominal scale : when comparing taxuses of a sample before and after a treatment or when comparing the results of 2 different diagnostic tests on a same sample, lines and columns in the contingency table are no longer independent as in bernoullian trials. The c2 test may be so used : ignore the subjects that responded in analogous way (positive or negative both before and after the treatment) to the 2 treatments (as they don't produce useful information) and calculate the number of subjects that responded in a different way to the 2 treatments. The expected number of +and+ and -and- subjects should be half that number => calculate c2 including Yates' correction for continuity (2 x 2 contingency table) and compare with critical value for c21(a)
•  Y+ Y- X+ ignored B (r) X- C (s) ignored
c2Y = (|B-C|-1)2/(B+C) or z = |B - 0.5 . (B+C)| - 0.5 . [n . p(1-p)]1/2
Placed pT = xT/nT and pC = xC/nC, p = (xT + xC)/(nT + nC) =>
z = |pT - pC| / [p . q . (1/nT + 1/nC)]1/2 or zY = |(xT -0.5)/nT - (xC -0.5)/nC| / [p . q . (1/nT + 1/nC)]1/2
When (pT - pC) is statistically significative, (pT - pC) - 1.96 .  [pT . qT/nT + pC . qC/nC]1/2  < (pT - pC) < (pT - pC) + 1.96.  [pT . qT/nT + pC. qC/nC]1/2
• codeviance = summation[(xi - x-x)(yi - x-y)]
• covariance = summation[(xi - x-x)(yi - x-y)]/n
• analysis of covariance (ANCOVA) : a statistical procedure used with 1 dependent variable and multiple independent variables of both categorical (ordinal, dichotomous, or nominal) and continuous types; it is a variation of analysis of variance that adjusts for confounding by continuous variables
• procedures for multiple comparisons between m samples when F-test doesn't detect differences : ANOVA tests the H0 "all samples comes from the same population", but doesn't indicate which sample(s) differ(s) from the others. The use of t-test for such a purpose erroneously increase the probability to register an effect above the nominal value. Practicing k = (m 2) statistical tests awith a confidence coefficient  a, the calculated t value has to be compared with the threshold value for the confidence coefficient a/k. Bonferroni's disequality tells that the actual probability to conclude that exists (at least one time), a difference is atot < k . a => to have a global error rate < a you have to compare t values with aT/k threshold. If  isn't tabulated extremating values are interpolated. Considering as population variance estimate swg2 => t' = (m1-m2) / [swg2 . (1/n1 + 1/n2)]1/2, with nt' = nswg2 = m(n-1) > nt. Lower threshold values => higher sensitivity. Bonferroni's correction is too cautelative to compare > 10 samples : its threshold values come from the global error rate. Once calculated a significative F value, the Student-Neumann-Keuls (SNK) test can be used : the sample means are placed in an increasing order and q = (mA - mB) / [swg2 / 2 . (1/nA + 1/nB)]1/2, with nq = nt = m(n-1). The critical value of q depends on atot, nt and p = (|A rank - B rank| + 1). The higher mean is compared with the lowest one, then with the next to the last, and so on... : the the operation is repeated replacing the higher mean with the second higher mean and so on... . If no significative difference exists between 2 means, it will not exist between the means placed between them and no tests are required. The SNK test detects the error rate for all comparisons involving p means. In Tukey's test all samples are tested as if they were separated by the highest number of steps (i.e. p = m for every test) : it is less probable that Tukey's test could define a significative difference. Scheffé's test is the most conservative (lowest probability to declarate significative comparisons).
• multiple comparisons with an only control group when F-test detects differences : Bonferroni's test with k = m instead of (m 2) => lower critical values tells it is easier to find a difference with a control group than practicing all possible comparisons. The analogous of SNK tests (and hence more sensitive) is Dunnett's test : q' = (mcontrol - mA) / [swg2. (1/ncontrol + 1/nA)]1/2
• analysis of rates and proportions of categorical variables measured on nominal scale instead of constant interval scales as for contigency) : in independent bernoullian trials, if n . p and n . (1-p) are > 5, the bernoullian distribution can be approximated by the gaussian or normal distribution => Z test : (difference of sample proportions) / (standard error of the difference between the sample proportions) = [(p1 - p2) - 0] / [p1 . (1-p1)/n1 + p2. (1-p2)/n2]1/2

• If H0 : "the samples come from the same population" is true, p1 and p2 are estimators of p = (k1 + k2) / (n1 + n2) => higher sensitivity and Z = (p1 - p2) / [p . (1-p) . (1/n1 + 1/n2)]1/2 (no loss of n)
Yates' correction for continuity : ZY = [|p1 - p2| - 1/2 . (1/n1 + 1/n2)] / [p . (1-p) . (1/n1 + 1/n2)]
To compare > 2 samples or possible outcomes, use contigency tables (nonparametric statistic), with n = (number of lines -1) (number of columns -1)
 outcomes treatments a = R1/N . S1 a = R1/N . S1 R1 c = R1/N . S2 d = R2/N . S2 R2 S1 S2 N
• chi-square (c2) test : any statistical hypothesis test that employs the chi-square (c2) distribution (a theoretical probability distribution of the sum of the squares of a number (k) of normally distributed variables whose mean is 0 and standard deviation is 1; the parameter k is the number of degrees of freedom), especially 2 tests applied to categorical data:
• the c2-test of goodness of fit, which tests whether an observed frequency distribution fits a specified theoretical model
• the c2-test of independence or homogeneity, which tests whether 2 or more series of frequencies (the rows and columns of a contingency table) are independent.
In both cases the test statistic c2 = summation[(expected frequency - observed frequency)2/(expected frequency under the null hypothesis in each box)]. Square elevation increases all significative differences greater than 1 unit.
• in a 2 x 2 contingency table c2 = [N . (|a . d - b . c|- N/2)2] / [(a+b) . (c+d) . (a+c) . (b+d)], with n = 1; all expected frequencies (a, b, c, and d) should be > 5 and Yates' correction for continuity should be applied : cY2 = summation[(|observed frequency - expected frequency| - 1/2)2/expected frequency]
• for larger tables the frequencies should be all > 0 and < 20% of them can be < 5. Alternatives are to collect more data or to reduce the number of categories.
The sampling distribution of this c2-statistic approaches the c2-distribution as the sample size increases, under the null hypothesis
• exact test : a statistical test based on the actual probability distribution of the data in the study, rather than on an approximation of it
• Fisher exact test : a statistical hypothesis test of independence of rows and columns in a 2 x 2 contingency table based on the exact sampling distribution of the observed frequencies, useful when any expected value in the table is small.
• one-way :
• two-way : the 2 tails are identical only if R1 = R2 and S1 = S2
P = (R1! . R2! . S1! . S2 !/N!)/(a! . b! . c! . d!).
Identify the box with the lowest frequency, decrease its frequency by 1 unit per time until 0, changing  and finding as above the new Pi's (the numerator doesn't changes). Repeat the operation with the other 2 boxes, calculating Pi (not for the boxes with repeating values). PT = summation(Pi < P)
• potency of a statistical test (1-b) : when pretending more convincing evidences of an effect to accept H1, the probability a to do a type I error should be reduced, but this increases the risk b not to detect an effect when this exists due to decreased potency (1-b). The potency is even directly proportional to the entity of the effect of the treatment to detect (i.e. it is easier to detect larger than smaller differences) : the latter is evaluated by the unitless non-centrality parameter f = d/swg, where d is the minimal difference one wants to detect between each pair. Potency function (once fixed a)

• So t = (m1-m2)/(s2/n1 + s2/n2)1/2 if n1 = n2 (i.e. best potency) => t = (m1-m2)/s. (2/n)1/2 = d/s. (2/n)1/2 = f. (2/n)1/2
By increasing sample numerosity, n increases => the critical values of t (for a given a) decrease, while t values increase => t distribution is centered on higher values => 1-b is increased without incresing a.
The potency of an ANOVA is instead related to a non-centrality parameter f = d/s . [n/(2k)]1/2 = {n . summation (mi - m-)2]/(k .s2)}1/2, where d is the minimal difference one wants to detect between each pair, n is the width of each group and k is the number of groups. Once fixed nn and a => potency function :
The potency of a contingency table having r lines and c columns whose totals are R and C is related to the non-centrality parameter f = {(N/[(r-1) . (c-1) +1]) . summation[(Pij- RiCj)2/(RiCj)]}1/2. The potency is determinated with the above-mentioned potency function for nn= (r-1) . (c-1) and nd = +oo
To achieve a given 1-b potency we need to obtain f from the potency function => the sample numerosity N required to project a trial can be obtained by the above-mentioned formulas : standard deviation may be obtained by pilot studies or suppositions. To limit the number of required subjects for a trila sometimes it is affirmated to want to detect an effect greater than that actually relevant to clinical purposes => reduced 1-b : negative studies have a sufficiently large sample width to formulate definitive conclusions. To obtain the same a and b levels with smaller samples you may use sequential trials : at the end of the study you may decide if accept or reject H0 or to add another subject (avoiding to repeat t test).
If a confidence interval contains 0, there is no evidence to reject H0 for P < a, while on the opposite case there is enough evidence. It also shows whether the statistical significance has been achieved thanks to the actual importance of the effect or only thanks to the sample numerosity
Confidence intervals (CI) at the 100(1-a)% :
• for the difference of the means : (x-1 - x-2) - ta . sx-1 - x-2 < m1 - m2 < x-1 - x-2) + ta. sx-1 - x-2
• for the population mean : x- - ta. sx- < m < x- + ta. sx-
• for the difference of proportions : (p^1 - p^2) - za . sp^1 - p^2 < p1 - p2 < (p^1 - p^2) + za . sp^1 - p^2. If np < 5, the binomial distribution no longer overlaps the gaussian distribution and limits are determined exactly as intersections of the function. If p^ = 0 => 0 < p < 3/n (approximate rule)
• for the whole population (confidence limits) : if the sample is small the "2s rule" underestimate the interval of assumable values because both x- and s are just estimates, hence x- - ka. s < x < x- + ka. s, where ka is function of sample numerosity, a, and population fraction included. ka > ta > za. Once fixed a, ...
• Kolmogorov-Smirnov test : a statistical test of goodness of fit of a sample to a specified theoretical distribution function, based on the size of the maximum difference between the cumulative distribution functions of the sample and theoretical distributions and using the exact sampling distribution of this difference to determine the significance level. The test can also be used to determine whether two samples are drawn from the same population by examining the maximum difference between the cumulative distribution functions of the two samples.
• likelihood ratio test : in statistics, a test using the ratio of the maximum value of the likelihood function from one statistical model to that from another model, a smaller ratio indicating a stronger relationship between the variables
• log-rank test (Peto R, Peto J. Asymptotically efficient rank invariant test procedures. J R Stat Soc [A] 1972;135:185-206) : a statistical test used to test the null hypothesis that 2 groups have the same distribution of survival by analyzing and comparing the number of observed and expected deaths for each group each time a death occurs in either group
• nonparametric test : one using nonparametric statistics; nonparametric tests are often less powerful than parametric tests but are valid in cases where parametric tests are not.
• rank sum test / Mann-Whitney U test / Mann-Whitney-Wilcoxon test / Wilcoxon rank sum test : a nonparametric statistical test for ordinal data, testing the null hypothesis that two samples are drawn from the same population versus the alternative hypothesis that the two samples are drawn from two populations having probability distributions of the same shape but different locations. It is based on the value of the rank sum statistic, which is calculated as the sum of the ranks of each sample after the observations in both samples are jointly ranked in ascending order; if and only if the null hypothesis is true, the average ranks of the 2 samples will be similar
• sign test : a nonparametric statistical test based on a null hypothesis that by chance the experimental group should outperform the control group for half the outcome variables and vice versa. Results are scored as a series of pluses and minuses awarded to the experimental group depending on its performance relative to that of the control group, a binomial distribution of scores with p = 0.5 being expected under the null hypothesis.
• Wilcoxon signed rank test : a nonparametric statistical test for ordinal data, comparing two populations of data by examining the differences between matched pairs in the two populations. It is based on the signed rank statistic, calculated by arranging all samples in order without regard to which population they are drawn from, identifying pairs, assessing the difference in rankings for the members of each pair, and summing these differences for all pairs. If the null hypothesis is true and there is no difference between the two populations, the median difference in rankings between matched pairs in the population approximates
• Kruskal-Wallis H test : a nonparametric test for ordinal data, comparing 3 or more groups simultaneously: all data are ranked numerically and then the rank values are summed and averaged for each group. If the null hypothesis that all groups are drawn from the same population is true, then the mean ranks should be similar across all groups.
• the 2x2x2 contingency table :
• Cochran-Mantel-Haenszel test assumes a common odds ratio and test the null hypothesis that X and Y are conditionally independent, given Z. In short, the purpose of the CMH is to test whether the response is conditionally independent of the explanatory variable when adjusting for the control variable.
• Mantel-Haenszel test measures the strength of association by estimating the common odds ratio. In the 2*2 table, one odds ratio explains the odds of success proportion in row 1 and those of row 2. On the other hand, in the 2*2*2 table, there are 2 odds ratio, therefore the 2*2*2 table requires to calculate the overall odds ratio to measure the strength of association. In short, the purpose of the MH is to estimate the average conditional association between the explanatory and the response variable
• Breslow-Day statistic tests the null hypothesis of homogeneous odds ratio, which means it tests whether the odds ratio between X and Y is the same as in different Z categories. It is a test of homogeneous association. In short, when there are more than two explanatory variables (usually one is the explanatory variable (X) and the other is the control variable(Z)), three tests should be done to test the conditional independence of X and Y (CMH test), and to estimate the strength of its association (M-H test), and to test the homogeneity of the odds ratio (B-D test)
• verification of relationship among variables : predeterminate the values assumed by a dependent variable at variation of indipendent variable(s)
• regression analysis : interpretation of a finite population of data by exploring the relationship between several variables using the principle of regression : a functional relationship between the mean value of a random dependent or response variable (y) and the corresponding values of one or more variables identified by the experimenter (the independent or explicative variables / predictors (x)). The regression curve is a curve describing the relation between the average value of one variable (the dependent variable) and the values of one or more independent variables; the regression curve of Y on X is the graph of the average value of Y associated with each value of X.
• simple regression : only 2 variables
• linear regression : the statistical procedure for fitting a straight regression line to observed data, usually by minimizing the sum of the squared deviations of the observed values of the dependent variable from the regression line (least-squares regression). Given a value of x, my|x is the mean of all possible values of y=f(x) arranged in a normal distribution with constant sy|x.
• my|x = a + bx (mean line). the 2 parameters of the population, the intercept a and the angular coefficient b (true values), can be estimated with the minimal square criterion (i.e. the value which minimize the deviance between observed values and values estimated by the mean line)
• g(y) = summation[(yi-y^)2] = summation[(yi - a - b . x)2] = summation (y2 + a2 + b2 . xi2 - 2a . y - 2 . b . xi . yi + 2 . a . b . xi) = summation(yi2 + n . a2 + b2 . summation(xi2) - 2 . a . summation(yi) - 2 . b . summation(xi . yi) + 2 . a . b . summation(xi)
• g'(a) = 2 . a . n - 2 . summation(yi) + 2 . b . summation(xi) = 0
• g'(b) = 2 . b . summation(xi2) - 2 . summation(xi . yi) + 2 . a . summation(xi) = 0
• a = [summation(yi) - b . summation(xi)]/n, with n = number of observations in the sample
• b . summation(xi2) - summation(xi . yi) + [summation(yi) . summation2(xi)]/n - b . summation(xi2)/n = 0 => b . {summation(xi2) - [summation2(xi)]/n} = summation(xi . yi) - [summation(xi) . summation(yi)]/n =>
• b = [n . summation(xi . yi) - summation(xi) . summation(yi)] / [n . summation(xi2) - summation2(xi)] =>
• sb = 1/(n-1)1/2 . sy|x/sx
• a = [summation(xi) . summation(yi) - summation(xi) . summation(xi . yi)] / [n . summation(xi2) - summation2(xi)] = y- - b . x-
• sa = sy|x . [1/n + mx2 / [(n-1) . sx2]}1/2
• a = N (a, sa)
• b = N (b; sb)
Different samples have different regression lines. The other population parameter, the standard error of the estimate (variability of the population around the mean line) (sy|x) can be estimated by sy|x = (summation{[yi - (a + b . xi)]2}/(n-2))1/2 = [(n-1)/(n-2) . (sy2 - b . sx2)]1/2
If b = 0, there is no relation betweeen dependent and independent variables. As b = N, t = (b - b) / sb. In H0 v = n-2 => ta = b/sb => 100(1-a)% confidence interval : b - ta . sb < b < b + ta . sb. For any x value exist more y^ value on the regression line, that cover a wider region at the extremities than in the central region, because of geometrical reasons
• sy^ = sy|x . (1/n + (x-mx)2/[n-1)sx2] => 100(1-a) confidence interval for the regression line : y^ - ta . sy^ < y < y^ + ta . sy^
Comparison between 2 regression lines : look for differences in
• angular coefficients only : t = (b1-b2) / sb1-b2, where sb1-b2 =
• (sb12 + sb22)1/2 if n1=n2
• {sy|xp2 /[(n1-1) . sx12] + sy|xp2 /[(n2-1) . sx22]}1/2 if n1 different from n2
• by using the combined variance estimate sy|xp2 = [(n1-2) . sy|x12 + (n2-2) . sy|x12]/(n1 + n2 - 4)
• intercepts only : t = (a1 - a2) / sa1-a2
• both (coincidence) : analyze if approximating the 2 data collections to 2 different regression lines produces smaller remnants than approximating all data to a single regression line
• variance around the common line : sy|best x = [(n1 + n2 - 2) . sy|xc2 - (n1 + n2 - 4) . sy|xp2]/2 = (ssres,c - ssres, p)/2, where SSres = sum of difference squares
• F = sy|best x/sy|xc2, with nn = 2 and na = n1 + n2 - 4. If higher than the critical value, the 2 groups are extracted from populations with differentmean lines
• logistic regression : a multivariate statistical method used for modeling the probability of occurrence of a dichotomous outcome as a function of multiple independent variables; it always yields a probability between 0 and 1.
• multiple regression : a form of linear regression or other regression method analyzing the effects of > 2 variables (i.e. > 1 independent variable) simultaneously.
• multiple linear regression (if the dependent variable y is quantitative and continuous)
• multiple logistic regression : P = OR / (1 + OR)
• Even if variables vary conjunctively, if none can be defined as dependent, no causality exists but only correlation or covariation strength
• multiple correlation : among > 2 variables
• simple correlation : among 2 variables
• curved correlation
• exponential correlation
• parabolic correlation
• linear correlation
Correlation coefficient : a statistical measure which when squared gives the degree of association between the values of 2 random variables. Most correlation coefficients are normalized so that they have values between +1 (which indicates perfect correlation) and -1 (which indicates perfect inverse correlation); a value of 0 indicates no correlation. As the absolute value of the correlation coefficient increases, so does the strength of correlation.
• the true theoretical correlation coefficient for a population is symbolized r
• the sample correlation coefficient, computed from experimental data, estimates the theoretical and is symbolized r.
• Pearson's product-moment correlation coefficient : the most common correlation coefficient for normally distributed variables linked by a linear relationship along an interval scale; it is the covariance of 2 random variables divided by the product of their standard deviations
• r = [summation(xi - x-) . (yi - y-)] / {summation[(x - x-)2] . summation[(y - y-)2]}
• if |r| = 1 : perfect correlation
• +1 : direct correlation (same behaviour)
• -1 : reverse correlation (opposite behaviour)
• if 0 < |r| < 1 : imperfect correlation
• 0 < |r| < 0.2 : very poor correlation
• 0.2 < |r| < 0.4 : poor to moderate correlation
• 0.4 < |r| < 0.6 : moderate to good correlation
• 0.6 < |r| < 0.8 : strong correlation
• 0.8 < |r| < 1 : very strong correlation
• if |r| = 0 : no correlation
It is not necessary to fix arbitrarily the independent variable.
• r = 1-[summation(yi-y^)2]/summation(yi-myi|x)2]}1/2 = [1-(SSres/SStot)]1/2 = {1-[(n-2) . syi|x]/[(n-1) . sy2]
• determination coefficient r2 = bxi|y . byi|x, with 0 < r2 < +1
Verification H0 : ".... = 0"
• t = b/sb = [r . (sy/sx)]/[(n-1)2 . sy|x/sx] = [r . (sy/sx)]/{(n-1)2. [(n-1) . (1-r2) . sy2/(n-2)]1/2} = r / [(1-r2)/(n-2)]1/2= r . [(n-2)/(1-r2)]1/2
• rank correlation coefficient : the correlation coefficient of 2 variables calculated after ranks have been substituted for actual values
• Kendall's rank correlation coefficient / Kendall's tau (t) : a rank correlation coefficient used when multiple independent variables represent ordinal data in a limited number of grades, such as the categories none, mild, moderate, and severe, so that multiple samples can be assigned to each grade
• Spearman's rank correlation coefficient or rho (rs) : a nonparametric rank correlation coefficient used when both variables represent ordinal data in an unlimited ranking (a graduate scale without arithmetic relationship between successive classes, on the contrary of the interval scale), such as class standing, so that each sample is assigned a unique rank. Once values of the 2 variables have been ordered in an increasing manner, di is the difference between the ranks of the 2 variables in each observation : rs = 1-[6 . summation(di2)/(n3-n). When 2 values are identical both are assigned a rank identical to the mean of ranks that they would have been assigned if they were distinguished.
• for n < 50 => tabulated critical values
• for n > 50 => use critical values of t = rs/[(1-rs)/(n-2)]1/2, with n = n-2.
• estimate : to measure or calculate a statistic for characterization of a population parameter.
• interval estimate : a statistical estimate that states with a specified degree of confidence that the parameter lies within a specified interval
• point estimate : a statistical estimate that specifies a value for the parameter
• biased estimate : a point estimate that is not unbiased, i.e., a point estimate that for some reason should tend to be wrong in a given direction.
• unbiased estimate : a point estimate having a sampling distribution with a mean equal to the parameter being estimated; i.e., the estimate will be greater than the true value as often as it is less than the true value
• consistent estimate : a statistic that converges to the true value of the parameter being estimated (population parameter) as the sample size increases; i.e., the estimate value can be made as statistically close to the true value as desired by taking a large enough sample
• maximum likelihood estimate : the estimate of a parameter describing a population distribution that makes the likelihood function take its maximum value; an estimate of the parameter that maximizes the probability of obtaining the sample values actually observed
• multivariate analysis : any of various statistical methods for analyzing more than two variables simultaneously.
• bivariate analysis : any of various statistical methods for analysis of the association between one independent and one dependent variable
• log-linear analysis : a form of multivariate analysis useful for examining the effects of multiple independent variables, at least some of which are categorical, on a nominal dependent variable; it is used to construct models for the evaluation of relationships between categorical variables
• bayesian statistics : somewhat controversial statistical methodology that, unlike conventional statistics, which treats population parameters as fixed (though unknown) values, treats parameters as random variables with a specified probability distribution, termed the prior (or a priori) distribution. Bayes' theorem is then used to convert the probability distribution of an observable statistic (treated as a conditional probability for a given parameter value) to a conditional probability distribution of the parameter values for a given value of the observable statistic. This distribution is termed the posterior (or a posteriori) distribution because it assigns a probability to each parameter value that depends on the observed data. The controversial point is the prior distribution, which represents a subjective opinion of the experimenter as to the a priori credibility of the various parameter values; for example, in estimating the probability of the presence of a particular disease given a positive test result, the prior distribution represents the experimenter's judgment of the prevalence of the disease in the population under study.
• alternative law (tertium non datur) for independent events A and B : P(A) = P(A and 1) = P[A and (B U B-)] = P[(A and B) U (A and not B)] = P(A and B) + P (A and not B)
• composed or conjuncted probabilities theorem : P(A and B) = P(B) . P(A|B) = P(B and A) = P(A) . P(B|A)
• independent events : if P(A) = P(A|B) or P(B) = P(B|A) => product rule : P(A and B) = P(B and A) = P(A) . P(B)
• Bayes' theorem : a theorem used to interconvert conditional probabilities:

•

P(B|A) (a posteriori) = [P(B) . P(A|B) (a priori)]/P(A) = [P(B) . P(A|B)] / [P(A and B) + P(A and not B)] = [P(B) . P(A|B)] / [P(A|B) . P(B) + P(A|not B) . P(not B)] = 1 - [P(not B) . P(A|not B)] / [P(B) . P(A|B) + P(not B) . P(A|not B)]

where :

• P(A) and P(B) are the probabilities of 2 events, A and B
• P(A|B) and P(B|A) are the conditional probabilities of A given B and of B given A
For example, if A denotes a positive laboratory test result and B denotes the actual presence of disease in a tested patient, then P (A|B) is the diagnostic sensitivity of the test (true positive rate) and P (B) is the prevalence of the disease (P(A) is the frequency of positive test results). P(B|A) is the predictive value of a positive test, the probability that a patient testing positive will actually have the disease. The denominator of the equation, representing the sum of the true positives and false positives, is sometimes simplified to the equivalent function P(A), representing all those with positive results, both true and false.
• nonparametric statistics : a statistical methodology that can be used on data without making the assumption that the data are drawn from a population with a normal or other specified distribution
• semiparametric statistics : statistical methodology that combines both parametric and nonparametric elements; used for estimating population parameters when a function is unknown, e.g. the distribution function of a random variable that has not been observed.
• parametric statistics : statistical methodology that depends upon assumptions about the distribution of the data, e.g., that the data approximate a normal distribution and are homoscedastic.
• normal reference values : a set of values of a quantity measured in the clinical laboratory that characterize a specified population in a defined state of health. The values obtained from a statistical sample are used to establish a reference interval that covers 95% of the values of the healthy general population or of specific subpopulations differing in age and sex. These concepts were originally and are still widely referred to as normal values and the normal range, but the use of these terms is now discouraged because of their implication that values falling outside of the reference interval are abnormal or unhealthy, which has led to much confusion. It must be remembered that, by definition, 5% of healthy individuals fall outside of the reference interval.
• Cox proportional hazards model (Cox DR. Regression models and life-tables. J R Stat Soc [B] 1972;34:187-220) : a method of analysis of multiple factors (variables) that influence an actuarial curve of the risk of a given negative outcome such as disease occurrence or death. A hazard rate is computed for each separate variable (e.g., among those patients with that factor who had not previously suffered a negative outcome, how many subsequently suffered the negative outcome during a short interval) and cumulative hazard rates are computed for the combinations of variables that exist in actual situations.
• statistical functions : in mathematics, a rule that assigns to each member of one set (the domain) a value in another set (the range)
• cumulative distribution function (cdf) : a mathematical function that defines the probability distribution of a random variable by giving for each random variable X the probability of observing a value less than or equal to a specified value x
• dose-effect curve / dose-response curve : a graphic representation of the effect (such as therapeutic response or the incidence of cancer) plotted against the dose of an agent (such as a drug or x-rays), showing the relationship of the effect to changes in the dose of the agent
• dose-frequency curve : a graphic representation of the relationship of the number of responses (such as cases of cancer) in a population to changes in the dose of an agent
• dose-intensity curve : a graphic representation of the relationship of the intensity of effect (such as amount of vasodilation) in an individual to changes in the dose of an agent.
• probability distribution : a mathematical function that assigns to each measurable event (E) in a sample space the probability that the event will occur.
• probability (P) : the likelihood of occurrence of a specified event
• law of large numbers : any of several theorems dealing with the convergence of the sample average to the population mean as the sample size is increased
• classical or frequencist definition : 0 (impossible event) < P(E) = k/N < 1 (certain event), that corresponds to the long-run trial at which an event occurs in a sequence of random independent trials under identical conditions, as the number of trials approaches infinity
• statistical definition : P(E) = limN=>+oonrel(E) = limN=>+oo(nass(E)/N). This doesn't mean that by increasing N, nass comes nearer to N . P(E)
• subjective definition : the congruous price Np that one is ready to pay to receive the income N if E comes true and the income 0 if E comes false. The income of the gambler is q = 1-p, where p and q are the aliquots of the bet, acceptable for both the bank and the gambler
• axiomatic definition
• complementary event (E-), i.e. P(E1 and E-) = 1
• likelihood : a function of data that specifies, for each value of an unknown parameter describing a population distribution, the probability of observing the values sampled.

• Properties :
• multiplicative property : P(E1 and E2) = P(E1) . P(E2)
• additive property : P(E1 U E2) =
• = P(E1) + P(E2) for incompatible or disjunct events (mutually exclusive events, i.e. P(E1 and E2) = 0)
• = P(E1) + P(E2) - P(E1) . P(E2) for events not mutually exclusive
• extractions :
• with readdition : at the following extraction P(E') = P(E)
• without readdition of the extracted item : at the following extraction P(E') > P(E) or P(E') = 0
• probability density function : in statistics, a mathematical function that describes the distribution of measurements for a population; a curve that describes a population. Its integral is the cumulative distribution function, so the probability that an individual measurement will fall between two numbers a and b is equal to the proportion of the area under the curve between points a and b, with the entire area under the function being 1
• expected values for a random variable
• discrete variable :
• m(x) = sum xi .p(xi)
• s2(x) = sum (xi-m)2.p(xi)
• continuous variable :
• m(x) = integral[x . f(x) dx]
• s2(x) = integral[(xi-m)2. f(x)dx]
• successions (S) or progressions
• arithmetic successions (linear law) : Sn = a1n + [n(n-1)]/2 . d
• recursive definition : an+1 = an + d (quantity in time unit)
• direct definition : an = a1 + (n-1)d
• geometrical successions (exponential law) : Sn = a1 . [(1-qn) / (1-q)] if 0 < q < 1. limn=>+ooSm = a1/(1-q)
• recursive definition : an+1 = an . q
• direct definition : an = a1 . qn-1
q (or d) value doesn't depend on time of beginning but from the interval of observation : Cn/Cm = (C0 . qn) / (C0 . qm) = q(n-m)
Yearly interests : qyr = (qmo)12 => qmo = 12(qyr)1/12
Dilation of an yearly interest = k :
• yearly : C = C0 . (1 + k)
• monthly : C = C0 . (1 + k/12)12
• weekly : C = C0 . (1 + k/52)52
• daily : C = C0 . (1 + k/365)365
• instantaneous interest : C = C0 . limn=>+oo(1+k/n)n = C0 . ek = C0 . e0.12 = C0 . 1.12749 (convergent series)
• limx=>0(1 + kx/a) = limx=>oo[(1 + k/ax)x] = e-k/a
Radioactive decay : k(x) = k0 . 1/2t1/2 = k0/2x, where x = number of half-decay times (half-lives) = t/t1/2
Number of different bytes = DR of numeral 0 and 1 = (number of numerals)2
• factorial notation (!) : n! = n . (n-1) . (n-2) . ... . 1 ~ nn . e-n . (2pn)1/2 (Stirling-De Haivre formula)
• permutation (Pn) : n!
• permutation of n objects of which 1 repeated r times (PR(n,r)) = n!/r!
• disposition (Dn)
• disposition with repetition (DR(n,k)) = nk
• combinations (Cn,k) :
• binomial coefficient : the number of different sets of size k that can be chosen from a set of n objects; denoted nCx = (nx) = n!/[x!(n-x)!]
• C(n,k) = D(n,k) / P(k) = [n . (n-1) . ... . (n-k+1)] / k! . [(n-k)!/(n-k)!] = n!/[k!(n-k)!] = (n k) = (n n-k) = (n-1 k) + (n-1 k-1)
• combinations with repetition (CR(n,k)) = (n+k-1k) = (n+k-1)!/[k!(n-1)!]
• bernoullian event : one with 2 only possible outcomes (commonly called success and failure,)
• Bernoulli or binomial distribution : the probability distribution that describes the frequencies of the different possible combinations of 2 mutually exclusive outcomes of a bernoullian event in a series of n independent Bernoulli trials; it is given by expansion of the binomial (p + q)n, where one of the 2 alternative outcomes has probability of p and the other of q = 1 - p throughout all trials
• Bernoulli theorem : in an experiment involving probability, the larger the number of trials, the closer the observed probability of an event approaches its theoretical probability.
• P(n,k) = probability of exactly k successes (i.e. n-k failures) on n bernoullian trials = binomial coefficient (number of positions in whcih the successes may occur) . probability of a success . probability of a failure = {n! / [k!(n-k)!]} .pk. (1-p)n-k
• expected values :
• m(E) = sum[xi .p(xi)] = n(0 . (1-p) + 1 . p) = np
• s2(E) = sum[(xi-m)2.p(xi)]
• if E' = number of frequency of success
• m(E') = m(E)n = p
• s2(E) = s2(E)/n2 = p(1-p)/n
It cannot be used in extractions  without reinsertion as p is no longer constant : anyway if event space (q) is quite large, the error is minimal. If n is large and p is small (i.e. even k is small) => (n-k)! ~ n! and (1-p)n-k ~ 1 => Poisson's approximation : P(n,k) = e-np. (np)k/k! = e-m . mk/k!, where np = m = s2 (if p is small => (1-p) ~ 1). When 0 < k < n you have a bernoullian distribution of P(n,k) values : y = e-(k-p)2 / [2p(1-p)] / [2pp(1-p)]1/2.
• Poisson or exponential distributionref : the probability distribution that describes counts of events randomly distributed in time or space, such as radioactive decay or blood cell counts. The probability of observing exactly k events in a fixed time period or region is :
• f(k) =  (lke-l) /  k!
where l is the average density of events in a period or region of that size and e is the base of natural logarithms (2.718). The mean and variance of the distribution are both equal to l, thus the coefficient of variation for a Poisson distribution is 1/l1/2 (the variability of the count is inversely proportional to the square root of the average count).
f(x) =
• = 0, for x < 0
• = e-x/k/k, for x >0, with k >0
If p = 1-p = 0.5, you have a gaussian distribution (in fact (n k) . 0.5n = (n n-k) . 0.5n) => symmetrical respect to x = k/2.
For n events of pn and kn : (p1 + p2 + ... + pn)N = N!/(k1! . k2! . ... .kn!) . p1k1.p2k2 . ... pnkn. When the various pn are known, you can determinate the probabilities that exact numbers of k occur in each event.
Assigning the values x = 1 at each success and x = 0 at each failure, binomial distribution has
• mean (m) = (summationx)/n = [k. 1 + (n-k) . 0] / n = k/n = np
• standard deviation (s) = {[summation(x-m)2]/n}1/2 = {[k . (1-p)2 + (n-k) . (0-p)2]/ n}1/2 = [k/n. (1-p)2 + (1-k/n)p2]1/2 = [p(1-p)2 + (1-p)p2]1/2 = [n .p. (1-p)]1/2
Newton binomial development : (a+b)n = summation[(n k) . an-k) .bk = an + nan-1. b1 + n(n-1)/2 . an-2 .b2 + ... + bn
Tartaglia-Pascal triangle : coefficients of expansion of Newton's binomium (always n+1 terms) :
• n = 0 : 1
• n = 1 : 1 1
• n = 2 : 1 2 1
• n = 3 : 1 3 3 1
• n = 4 : 1 4 6 4 1
(n k) = (n-1 k) + (n-1 k-1)
When np and n(1-p) > 5, the binomial distribution can be approximated by the ...
• gaussian or normal distribution : a symmetric, bell-shaped probability distribution (bell-shaped or gaussian curve) having the density function ...
• f(x) = 1/[s. (2p)1/2] . e^{-[(x-m)/s)2/2]}
... where :
• x is the abscissa
• f(x) is the ordinate
• e is the base of natural logarithms (2.718)
• m is the mean = integral+oo-oo[xf(x)dx]
• s is the standard deviation
• s2 is the variance = integral+oo-oo[(x-m)2f(x)dx]
• peak = 2p)2/s
The normal distribution is entirely dependent on m and s; it is symmetric about the mean, with both tails extending to infinity; and the mean, the median, and the mode are identical. Roughly speaking, the normal distribution characterizes a random variable that is the sum of a large number of independent random effects. More precisely, it is typically the limiting distribution of a standardized sum of an infinite series of random variables with finite variance, each making a negligible contribution to the total variance. For this reason it is common statistical practice to assume that random sampling distributions of statistical measures are approximately normal and apply tests (e.g., t-test, analysis of variance) based on the normal distribution. As integral+oo-ooe^{-[(x-m)/s)2/2]}= s. (2p)1/2, we introduce a normalization factor so that integral+oo-oof(x)dx = 1 (certain event). Each single value has P = 0. Conditions of normality can be assayed by
• comparing mean and median
• comparing 84th percentile with standard deviation
• if standard deviation is identical or higher than the mean but the population contains only positive values
• if there are out-scale values that make the distribution asymmetrical
• looking at the fitness of the value distribution to a straight line on a normal probability chart
• practicing a chi-square test between observed and expected values on the basis of a
• standard normal distribution : the normal distribution with m = 0 and s = 1, i.e. Z value = (x-m)/s => x = s .Z + m. s is used as unit of measure of differences from average. f(Z) = 1/(2p)1/2. e^(-Z2/2), with peak = 1/(2p)1/2. The number of elements in [a;b] is N . 1/(2p)1/2. integral(b-m)/s(a-m)/se^(-Z2/2)dz. Considering a distribution with m = 0 and s = unknwon, f(x) = 1/[s . (2p)1/2] . e^(-x2/2s2). Replaced x2/2s2 = t, dy/dx = dy/dt . dt/dx.
• f'(x) = 1/[s . (2p)1/2] . -e^(-x2/2s2) . (-2x/2s2) = 1/[s3. (2p)1/2] . x . e^(-x2/2s2), with peak in x = 0
• f''(x) = 1/[s3 . (2p)1/2] . [e^(-x2/2s2) + x . e^(-x2/2s2) . (-2x/2s2)] = e^(-x2/2s2)/[s3. (2p)1/2] . (1 - x2/s2), with flexus at x = m±s.
• the average of the sum of 2 random variables is mx+y = mx + my.
• mx-y = mx - my
• mkx = k . mx
• s2kx = k2 . s2x
• the variance of the difference (or the sum) of 2 random variables is the same as the sum of the variance of the 2 populations from which they have been extracted : s2x-y = s2x+y = s2x + s2y => sx-y = sx+y = (sx + sy)1/2. In fact when values placed on the same side or subtract values tha are placed on opposite sides respect to the mean, results are still more far from the average. If the variables are extracted from the same population distributed according to a standard normal distribution, standard deviation of the population of the difference = 21/2 (or 40%) larger than the original.
• by knowing that integral+oo0[x2n . e^-ax2 dx = 1 . 3 . 5 . ... . (2n-1)/(2n+1 . an) . (p/a)1/2 and condisering a = 0.5 and n = 1, =>
• integral+oo-oo(e^-z2/2) = 2 . integral+oo-oo(e^-z2/2) = 2{[x . e^-z2/2]+oo0 - integral+oo-oo(-x2. e^-x2/2)= 2 {0 - 0 + (1 / (22 . 0.5) . (p/a)1/2} = (2p)1/2
• integral calculus per parts : integral[f(x) . g'(x)] = f(x) . g(x) - integral[f'(x) .g(x)]
Z values are tabulated : they are calculated with 2 decimal numerals (e.g. x.xx). The first 2 numerals are listed in the first column, while a separate column exists for the second decimal.
• exponential distribution : a skewed probability distribution with right tail extending to infinity and having the density function
• f(x) = le-lx
for x ³ > 0 and l > 0. The mean is 1/l and the variance is (1/l)2. The mode is at 0 and the larger the parameter l, the more clustered the distribution toward 0. The exponential distribution arises in medicine and reliability as the time to mortality/morbidity or failure; l is often interpreted as the force of mortality or failure.
• F-distribution : the ratio of 2 independent chi-square distributions; the exact sampling distribution of the ratio of variances from two independent samples from identical normal distributions.
• log-normal distribution : a distribution of a random variable x such that y = ln x has a normal distribution; it is often used to model incubation times for diseases.
• Weibull distribution : a skewed unimodal probability distribution for nonnegative variables, characterized by the parameters of shape and scale. It can be used for negatively skewed data; common uses include modeling lifespans of materials and modeling incubation times for diseases such as AIDS.
• the most appropriate statistical test for ...
• study : a research project
• epidemiological studies
• observational study
• aggregated data : descriptive or ecological study : a statistical study exploring hypotheses by comparing groups, rather than individuals, e.g., comparing rates of breast cancer and levels of fat intake by country
• individualized data :
• case study : one that identifies and samples individuals with a particular disease or condition, noting characteristics of the disease or condition and persons afflicted. Case studies are often used to call attention to new diseases or to diseases entering new populations
• case-cohort study : an epidemiologic study in which samples of cases and surviving noncases of the condition being studied are drawn from the same cohort of a cohort study. The cases and noncases are matched for duration of survival and their histories are compared
• cross-sectional, transverse or prevalence study (screening) : one employing a single point of data collection for each participant or system being studied; used for examining phenomena expected to be static though the period of interest
• longitudinal study : one in which participants, processes, or systems are studied over time, with data being collected at multiple intervals. The 2 main types are ...
• cohort or prospective study : a longitudinal epidemiologic study in which the groups of individuals (cohorts) are selected on the basis of factors that are to be examined for their effects on outcomes, e.g., the effect of exposure to a specific risk factor on the eventual development of a particular disease, and are then followed over a period of time to determine the incidence rates of the outcomes in question in relation to the original factors. Prospective usually implies a cohort selected in the present and followed into the future, but the cohort method can also be applied to existing longitudinal historical data, such as insurance or medical records: a cohort is identified and classified as to exposure to some factor at some date in the past and followed up to the present to determine incidence rates of the outcome. This is called a historical prospective study, prospective study of past data, or retrospective cohort study
• case-control or retrospective study : a longitudinal epidemiologic study in which participating individuals are classified as either having (cases) or lacking (controls) some outcome and their histories are examined for the presence of specific factors possibly associated with that outcome. Cases and controls are often matched with respect to certain demographic or other variables but need not be. As compared to prospective studies, retrospective studies suffer from drawbacks: although they can measure the odds ratio, which often approximates relative risk, they cannot reveal true incidence rates or attributable risk. Also, large biases can be introduced both in the selection of controls and in the recall of past exposure to risk factors. The advantage of the retrospective study is its small scale, usually short time for completion, and its applicability to rare diseases, which would require study of very large cohorts in prospective studies
• overmatching : matching of too many variables or matching variables too closely when selecting cases and controls for study, so that true causal relationships between variables may be obscured, irrelevant variables may be included, or the study may become too complex and specific for appropriate controls to be reasonably obtained.
• nested case-control study : a case-control study nested within a cohort study: a population is identified and baseline data obtained and stored; after following the cohort for some time, a case-control study is performed using the small percentage of people who develop the disease, with controls selected as a sample of those surviving members without disease who were at risk at the time of occurrence of each case. Baseline data can be analyzed solely for this small subset of the original population, and recall bias and other hazards of case-control studies are avoided
• experimental study (trials)
• experiment : a procedure done in order to discover or to demonstrate some fact or general truth.
• control experiment : an experiment that is made under standard conditions, to test the correctness of other observations
• check or crucial experiment : an experiment so designed and so prepared for by previous work that it will definitely settle some point.
• clinical trial : an experiment performed on human beings in order to evaluate the comparative efficacy of 2 or more therapies. The randomized controlled trial, which uses an appropriate control group (placebo or sham treatment or the standard well-established therapy) for comparison with the experimental therapy and random allocation of patients to the experimental and control groups, is generally considered to yield the strongest scientific evidence of any well-designed trial. Another element of well-designed trials is called blinding
• single blind : pertaining to a clinical trial or other experiment in which subjects do not know which treatment they are receiving
• double blind : pertaining to a clinical trial or other experiment in which neither the subject nor the person administering treatment knows which treatment the subject is receiving. The term double mask is sometimes preferred to avoid confusion associated with the use of the term blind.
• triple blind : pertaining to a clinical trial or other experiment in which neither the subject nor the person administering treatment nor the person evaluating the response to treatment knows which treatment any particular subject is receiving.The term triple mask is sometimes preferred to avoid confusion associated with the use of the term blind.
• therapeutic trials • crossover trial : a multipart clinical trial in which each subject is tested with each (or most) of the treatments being compared in turn, in random order
• phase I trial : a clinical trial on normal volunteers, designed to determine the biological activities and range of toxicity or other safety factors of a given therapy
• phase II trial : a clinical trial on a small group of patients, designed to determine the effectiveness of the given regimen in treating the disorder in question
• phase III trial : a clinical trial using a large sample of patients, designed to compare the overall course of their disorder under the new treatment with its course untreated and treated with standard therapies previously used; studies are also done on the relative morbidities of the different treatments
• phase IV trial : additional studies done after a drug has been approved for distribution or marketing, which could include examination of long-term effects, adverse effects, or specific aspects of a drug's action.
• preventive trials
• individualized trial
• collective trial
Most trials present baseline comparability in a table often unduly large, and about half the trials inappropriately use significance tests for baseline comparison. Methods of randomisation, including possible stratification, are often poorly described. There is little consistency over whether to use covariate adjustment and the criteria for selecting baseline factors for which to adjust are often unclear. Most trials emphasise the simple unadjusted results and covariate adjustment usually make negligible difference. Two-thirds of the reports present subgroup findings, but mostly without appropriate statistical tests for interaction. Many reports put too much emphasis on subgroup analyses that commonly lack statistical power. Clinical trials need a predefined statistical analysis plan for uses of baseline data, especially covariate-adjusted analyses and subgroup analyses. Investigators and journals need to adopt improved standards of statistical reporting, and exercise caution when drawing conclusions from subgroup findingsref.
• data sources
• population censiments
• de jure censiments
• de facto censiments
• death schedules
• birth schedules
• hospital sources
• clinical cartels
• dimission schedules
• hospital registry
• pathology registry
• workers associations
• retired associations
• ensurance companies
• educative institutions
• statistic institutions
• medical sources
• sampling :
• randomization : assignment of experimental subjects to treatment groups according to some known probability distribution governed by chance, so that the distribution of subjects within each group should vary only by chance
• simple randomization
• stratified randomization
• cluster randomization
• ascertainment : the selection of samples (such as markers, individuals, populations) through a process that often deviates from random sampling and can therefore introduce bias.
• systematic : step by step
• data collection :
• personal measurements
• environmental measurements
• codified questionnaires
• postal delivery
• de manu delivery
• interview
• direct interview (de visu)
• telephonic interview
• questions
• closed questions
• open questions
• semi-closed questions
• data evaluation :
• illegitimate error : error on the significative decimal numerals considered (e.g. for p)
• validity or accuracy : the closeness of the expected value to the true value of the measured or estimated quantity; a measure that depends on both precision and bias.
• precision : in statistics, the extent to which a measurement procedure gives the same results when repeated under identical conditions on the same sample. Under certain conditions, may be called reliability, repeatability, or reproducibility (sample variability)
• absolute error :
• systematic error / bias (e) : reproducible inaccuracy; error in a measurement process that is predictable or in the same direction in all measurements; it may not be detectable by statistical methods
• selection bias
• Berkson's bias : the hospitalized population is not a randomized sample of the population, as aren't their diseases
• sample selection bias
• control selection bias
• information bias (data wrongly collected)
• partecipation bias
• surveillance bias
• knowledge of exposure
• recall bias
• interviewer bias
Relation between absolute error of a direct measure and absolute error in an indirect measure (e.g. radius and area) : y0 ± e(y0) = f(x0±e(x0)) => f(x0±e(x0)) - f(x0) = [f(x0±e(x0)) - f(x0)]/ e(x0) . e(x0) = f'(x0).e(x0), i.e. e(y0) is proportional to both the derivate and e(x0)
• quadratic model : e2(y0) = [f'(x0)]2 . e2(x0) => e(y0) = {[f'(x0)]2.e2(x0)}1/2
Error propagation : general law : if B = f(x, y, z, ...) => e(B) = df/dx.e(x) + df/dy . e(y) + df/dz . e(z) + ...
• sum
• of identical addenda : e(n .a) = n . e(a)
• of different addenda : e(a + b + ... + z) = e(a) + e(b) + ... + e(z), but this direct model doesn't keep into account the compensation of random errors. Hence a quadratic model is preferred : e(a + b + ... + n) = [e2(a) + e2(b) + ... + e2(z)]1/2. In such a way e(a + b) result greater than e(a) and e(b), but smaller than their direct sum (hypothenusa < cateta)
• difference :
• e(a - a) = 0
• e(a - b - ... - z) = [e(a)2 + e(b)2 + ... + e(z)2]1/2
• product
• of equal factors : d(a . n) = nd(a)
• of different factors : e(a . b) =e(a) . b + e(b) . a + e(a) . e(b) (the last is a transcurable quadratic term). By dividing all members by (a . b) => e(a . b)/(a . b) = e(a)/a + e(b)/b => d(a . b) = d(a) + d(b)
• quotient : d(a/b) = [d2(a) + d2(b)]1/2
• nondifferential or random error (z) : indefiniteness or error in a measurement process that varies unsystematically or unpredictably from measurement to measurement; its magnitude may be quantifiable by statistical methods. It is significant when measurements are few, but limn=>+oo [sum(z)/n] = 0
• relative error : e / m. A unitless quantity. When e is small, m can be replaced by x.
• standard error = 2s; in statistics, a measure of the variability that the calculated parameter estimate shows as repeated random samples are taken from the same population
• false-positive or type I error : in a hypothesis test, the rejection of the null hypothesis (H0) when it is true; the probability of a type I error (the significance level / level of significance) is denoted by a.
• conservative test : a test having a type I error probability that is at most a stated nominal level
• b, false-negative or type II error : in a hypothesis test, failing to reject the null hypothesis (H0) when it is false; the probability of a type II error is denoted by b.
• 1-b = test potency
• z analysis :
• single
• ....
• meta-analysis : any systematic method that uses statistical analysis to integrate the data from a number of independent studies
• multivariated :
• epidemiological indexes
• in longitudinal studies
•  2 x 2 contigency table exposure (E+) no exposure (E-) total disease (D+) a b a + b no disease (D-) c d c + d total a + c b +d n = a +b +c +d
• IE-
• risk : a danger or hazard, the probability of suffering harm or other unfavorable outcome.
• absolute risk (R) = IE+ = a/(a+b)
• relative risk (RR) = R / IE-= P(D+|E+)/P(D+/E-) = [a/(a+c)] / [c/(b+d)] = [a . (b +d)] / [b . (a +c )] = the ratio of the incidence rate among individuals with a given risk factor to the incidence rate among those without it. When the disease is rare (i.e. very low a and very low b), a . b ~ 0, and hence RR ~ OR
• standard errorln(RR) = (1/a + 1/(a+c) + 1/c + 1/(b+d))1/2
• CI95% : RR distribution is not gaussian, while y = ln(RR) is normally distributed
• ln(RR) - 1.96 . standard errorln(RR) < ln(RR) < ln(RR) - 1.96 . standard errorln(RR)
• e^(ln(RR) - 1.96 . standard errorln(RR)) < RR < e^(ln(OR) - 1.96 . standard errorln(RR))
• add + 0.5 to each value if at least one value = 0 (unless standard error wouldn't be defined)
• RA (individual attributable risk) = R - IE- = a/(a+b) - c/(c+d)
• RAE (attributable risk in exposed ones) = RA / R = (IE+ - IE-) / IE+
• FE (aetiological fraction) = RAP / Itot
• attributable risk : the amount or proportion of incidence of disease or death (or risk of disease or death) in individuals exposed to a specific risk factor that can be attributed to exposure to that factor; the difference in the risk for unexposed versus exposed individuals. The term is sometimes incorrectly used to denote ...
• population attributable risk (PAR) = Itot - IE- = RA . P(f.d.r.)+ : in a total population, the proportion of a disease incidence, or risk of the disease, that can be attributed to exposure to a specific risk factor; the difference between the risk in the total population and the risk in the unexposed group.
• competing risk : an event that removes a subject from being at risk for the outcome under study; e.g., death from automobile accident is a competing risk that removes a subject from the risk of heart disease.
• empiric risk : the probability that a trait will occur or recur in a family, based solely on experience rather than on knowledge of the causative mechanism
• genetic risk : the probability that a trait will occur or recur in a family, based on knowledge of its pattern of genetic transmission
• in prevalence and case-control studies
• odd = P / (1-P)
• odds ratio (OR) = [P(D+|E+)/(1-P(D+|E+)] / [P(D+/E-) / (1-P(D+|E-)] = {[a/(a+c)] / [c/(a+c)]} / {[b/(b+d)] / [d/(b+d)]} = (a/c) (b/d) = (a . d) / (b . c) = the ratio of the disease odds among individuals with a given risk factor to the disease odds among those without it
• standard errorln(OR) = (1/a + 1/b + 1/c + 1/d)1/2
• CI95% : OR distribution is not gaussian, while y = ln(OR) is normally distributed
• ln(OR) - 1.96 . standard errorln(OR) < ln(OR) < ln(OR) - 1.96 . sln(OR)
• e^(ln(OR) - 1.96 . standard errorln(OR)) < OR < e^(ln(OR) - 1.96 . standard errorln(OR))
• add + 0.5 to each value if at least one value = 0 (unless standard error wouldn't be defined)
• matched odds ratio (mOR)
• Mantel-Haenszel matched odds ratio
• taxus (R) :
• standard errorR = (k . R/N)1/2, where k is the person-time multiplicative factor and N = (number of persons . number of years)
• CI95% : R - 1.96 . standard errorR < R < R + 1.96 . standard errorR
• indicated with T- = negative test; T+ = positive test (event dichotomization fixed by a given threshold), ...
• specificity (Sp / y) = probability of negative test in an healthy subject = P(T-|D-) = P(T- and D-) / P(D-) ~ (T- and D-)/D-
• sensitivity (Se / f) = probability of positive test in an affected subject = P(T+|D+) = P(T+ and D+) / P(D+) ~ (T+ and D+)/D+
• a test isbetter than the other if it has both higher Sp and Se : an ideal test should have Sp = Se = 1
• likelihood ratio : an index of diagnostic marker tests, the odds of a disease given a specified test value relative to the odds of the disease in the study population. It can be calculated for either a positive or a negative test, the former (LR+) being the ratio of the sensitivity to the false-positive error rate and the latter (LR-) being the ratio of the false-negative error rate to the specificity. Depending on how it is written, it can be viewed either as a risk ratio or an odds ratio
• predictive value : the conditional probability that a clinical test result correctly identifies a patient as having or not having a disease, i.e., the predictive value of a positive test (positive predictive value (PPV)) is the probability that a person with a positive test is a true positive (i.e., does have the disease) and the predictive value of a negative test (negative predictive value (NPV)) is the probability that a person with a negative test does not have the disease. The predictive value of a screening test is determined by the sensitivity and specificity of the test, and by the prevalence of the condition for which the test is used.
• negative predictive value (Pv-) = P(D-|T-) = P(D- and T-) / P(T-) = P(D-) . P(T-|D-) / [P(D-) . P(T-|D-) + P(D+) . P(T-|D+)] = P(D-) . P(T-|D-) / [P(D-) . P(T-|D-) + P(D+) . (1- Se)]
• positive predictive value (Pv+) = P(D+|T+) = P(D+ and T+) / P(T+) = P(D+) . P(T+|D+) / [P(D+) . P(T+|D+) + P(D-) . P(T+|D-)] = P(D+) . P(T+|D+) / [P(D+) . P(T+|D+) + P(D-) . (1-Sp)]
2 x 2 contingency table :  T- T+ D- Sp . x (1-Sp) . x x D+ (1-Se) . (n-x) Se . (n-x) n-x number of negative tests number of positive tests n
• accuracy = probability of a right test = P(T- and D-) + P(T+ and D+) = P(D-) . P(T-|D-) + P(D+) . P(T+|D+)
• bias = P(T+)/P(D+)
Epidemiology :
• historical notes:
• John Gaunt (1662, plague) :
• John Snow (1848, cholera in London from the Broad Street Pump) :
• statistical associations :
• association => inference
• indirect : A and B may be both caused by X
• not causal : A and B are linked by a confunding factor X
• causal => action : A causes B but not viceversa. Supporting evidences are :
• temporal sequence
• biological plausibility
• strength and grade of association (dose-response relationships)(
• consistence of the association (metaanalysis)
• reversibility of incidence after primary prevention
• lack of confunding factors
• cause :
• necessary ...
• ... and sufficient
• ... but not sufficient
• not necessary ...
• ... but sufficient
• ... nor sufficient (risk factor : a clearly defined occurrence or characteristic that has been associated with the increased rate of a subsequently occurring disease; causality may or may not be implied)
• predisposing
• precipitating
• reinforcing
Subgroup analysis : when treatments have identical efficacy, the probability of finding at least one "statistically significant" interaction test when 10 independent interaction tests are undertaken is 40% : If the overall trial results fail to demonstrate that the new treatment is better than the conventional treatment, it may still be better in certain patients (say, women). And if the new treatment is demonstrated to be superior, the magnitude of the benefit may vary according to sex. Both scenarios should be formally investigated by means of an "interaction test" of the null hypothesis that the relative efficacy of the two treatments is the same in women and in men. An interaction is called quantitativeref1, ref2 when the new treatment is superior for both subgroups but its relative benefit differs between the subgroups. The clinical implications are usually more important for a qualitativeref1, ref2 interaction, in which the new treatment is superior in one subgroup but no different from or inferior to conventional treatment in another subgroup. An alternative, but problematicref1, ref2, ref3, approach to investigating subgroups is to test the hypothesis that there is no treatment difference separately in women and in men. However, even if both sex-specific treatment differences are statistically significant, this approach does not address the question of whether the magnitude of benefit depends on sex. Moreover, subdividing the data into subgroups reduces the study's power to detect treatment differences, because not only are the sample sizes reduced, but the number of statistical tests needed is also more than double that required to test for an interaction. One way to correct for the inflated false positive rate when multiple subgroup analyses are conducted is to apply a stricter criterion than the usual P=0.05 for judging the significance of each interaction testref (Bailar JC III, Mosteller F, eds. Medical uses of statistics. 2nd ed. Waltham, Mass.: NEJM Books, 1992). If K independent tests are conducted, one way to ensure that the overall chances of a false positive result are < 5% (0.05) is for each test to use a criterion of (10.95)1/K, or about 0.05÷K, to assess statistical significance. For example, if 10 tests are conducted, each one should use 0.005 as the threshold for significance. False positive rates are also inflated when the multiple interaction tests are not independent of one another; since corrections for this problem require information about the correlation among the testsref,2 the criteria for statistical significance used for independent tests are commonly applied, even though these criteria may be conservative. In the 20 subgroup analyses conducted by Bhatt et al., only one interaction test, for symptomatic versus asymptomatic patients, gives an uncorrected P value < 0.05 (0.045). Had the interaction tests been assessed with a criterion of 0.05÷20 (0.0025) to account for the fact that 20 were conducted, none would have come close to reaching statistical significance. Instead of assessing an uncorrected P value against a stricter criterion for significance to account for multiple subgroup analyses, one can sometimes correct the P value so that it can be compared with the usual criterion of P=0.05. When K independent interaction tests are performed, the appropriate correction for the smallest of the resulting P values  say, P*  is 1(1P*)K. This formula can be modified for correlated tests, and if applied without modification, it will usually be conservative. Its application to the analyses by Bhatt et al. gives a corrected P value of 0.60 for the interaction test of whether the relative efficacy of clopidogrel depends on symptomatic status. The inflation of false positive rates by the application of multiple statistical tests applies to both prespecified and post hoc subgroup analyses. The important distinction is that the number of prespecified subgroup analyses is known and determined before the data are examined (though in some cases, important details such as how variables such as age will be categorized are not specified in advance). In contrast, when a report presents the results of post hoc subgroup analyses, it may be unclear why and how the subgroups were selected and how many other subgroups were analyzed. Post hoc subgroup analyses undertaken because of an intriguing trend seen in the results or selective reporting of certain subgroup analyses can be especially misleadingref. Authors and medical journals have a responsibility to ensure that the reporting of subgroup analyses is transparent. Ignorance of the total number of subgroup analyses, which ones were prespecified and which were post hoc, and whether any were suggested by the data makes it very difficult to interpret the reported results. When an interaction test for a baseline variable fails to reach the appropriate threshold for significance, conclusions about a differential treatment benefit related to this variable should be avoided or presented with caution. When subgroup analyses are properly conducted, presentation of their results can be informative, especially when the treatments being compared are used in practice. When reporting subgroup analyses, it is best not to present P values for within-subgroup comparisons, but rather to give an estimate of the magnitude of the treatment difference and a corresponding confidence intervalref. These confidence intervals should not be used to infer whether a treatment difference in a subgroup is statistically significant, on the basis of whether the interval excludes the hypothesis of equality between treatment groups, since such analyses suffer from the same problems as the use of multiple statistical tests. Rather, they should be interpreted as providing a plausible range of treatment differences consistent with the trial results. Overstating the results of subgroup analyses can misinform future research and lead to suboptimal clinical practice. Yet avoiding any presentation of subgroup analyses because of their history of being overinterpreted is a steep price to pay for a problem that can be remedied by more responsible analysis and reporting. Ultimately, medical research and patients are best served when subgroup analyses are well planned and appropriately analyzed and when conclusions and recommendations about clinical practice are guided by the strength of the evidenceref.
• clinical conditions
• Homo sapiens clinical conditions may be ...
• sporadic disease : neither endemic nor epidemic; occurring occasionally in a random or isolated manner.
• endemic (disease) / endemia : one present or usually prevalent in a population or geographical area at all times; such diseases usually have low mortality
• hyperendemic (disease) : an endemic disease equally prevalent in all age groups of a population.
• hypoendemic (disease) (0 < prevalence < 10%)
• mesoendemic (disease) (11 < prevalence < 50%)
• holoendemic (disease) : an endemic disease occurring at a high level in a population so that most of the children are affected, the adults in the same population then being less so (prevalence > 50%)
• endemic index :  the percentage of persons in any locality affected with an endemic disease
• epidemic (disease) : an infectious or other disease that suddenly affects individuals in a population or geographical area clearly in excess of the number of cases normally expected; said especially of infectious diseases but applied also to any disease, injury, or other health-related event occurring in such outbreaks
• pandemic (disease)
• ecdemic (disease) : of or pertaining to an infectious disease introduced into a population or geographic area from without
Clinical conditions of other Metazoa may be ...
• enzootic disease
• epizootic disease
• panzootic disease
Infectious diseases provide a particularly clear illustration of the spatiotemporal underpinnings of consumer-resource dynamics. The paradigm is provided by extremely contagious, acute, immunizing childhood infections. Partially synchronized, unstable oscillations are punctuated by local extinctions. This, in turn, can result in spatial differentiation in the timing of epidemics and, depending on the nature of spatial contagion, may result in traveling waves. Measles epidemics are one of a few systems documented well enough to reveal all of these properties and how they are affected by spatiotemporal variations in population structure and demography. In the 1970s, Cliff proposed that the equations used to calculate how planets are attracted to each other could also be used to predict how people with contagious disease would move. The shared idea is that a bigger place, whether planet or city, is more attractive. Infection is more likely to hop from a small town to the capital than to a nearby small town : this can be used to predict where the disease was likely to move next. The model has four parameters: relating to the likelihood that someone will travel to a distant place instead of a close one; the likelihood that if someone travels they will go to a place of a particular size; the transmission rate of the visitor in the visited place (for example, children visiting family are likely to be around fewer children than when they are in school); and a final factor involving the varying rates of travel of people from small and large towns.
• index case : in epidemiology of contagious disease, the first case of a disease within a given outbreak, as opposed to subsequent cases.
• susceptible-exposed-infectious-recovered (SEIR) model
• susceptible-infected-recovered (SIR) model : a basic SIR model of disease dynamics consists of the 3 host cell categories (complicating for age, a, at time, t), thus: N(a, t) = X(a, t) + Y(a, t) + Z(a, t), where X, Y, and Z correspond to populations of each of the modeled categories (SIR); an additional category for a latent category (infected but not infectious), H, may be added. A basic set of differential equations describing this framework is beyond the scope of this correspondence, but includes age-specific host death rate (m), recovery rate (upsilon), disease induced mortality rate (a), and a "force of infection" component (l). An ideal host species for a directly transmitted viral disease would have a negligible age-specific or otherwise a(a), hence upsilon(a) would be an irrelevant component of the epidemiological dynamics. In such a situation, the total number of hosts, N, is not equivalent to the traditional equation shown above, rather Z ~= 0, and Y would be a proportion of X, but because a(a) ~= 0, (i.e., m + a ~= m) the entire host population N can be uniformly modeled
• basic reproductive ratio (R0) of the virus represents the number of secondary cases produced when an infected individual is introduced into a well-mixed local population of wholly susceptible individualsref (R. M. Anderson, R. M. May, Infectious Diseases of Humans: Dynamics and Control (Oxford Univ. Press, Oxford, 1991.). It is well established that, as the proportion of susceptibles in the population, s, drops (as individuals become infected, then recover), the number of secondary cases per infection, R, also drops: R = sR0. If R < 1, as is currently the case for H5N1 virus in humans, an infection will not cause a major epidemic. But if R is even modestly greater than unity, a novel infection may spread locally, with potential for further spread in the absence of controlref. Although technically this should represent the intrinsic reproductive rate of an organism, in terms of its representation in disease dynamics, R0 generally is symbolic of infections produced by an infected, infectious host or victim -- given that evolutionarily a virus is dead if it does not colonize a new host. In order to establish itself in a host (or victim) population -- or establish an equilibrium -- a virus would within this framework have R0 > 1. Departures from equilibrium lead to the effective reproductive rate (R), discounted by the fraction of the host population being susceptible (x*), where R = R0x* =1. R0 must be proportional to the number (or density) of hosts candidate for infection such that R0 = population size (N) / threshold magnitude (NT). For condition R0 > 1, the corollary condition N > NT must hold. R0 is a measure of transmissibility and of the stringency of control measures required to stop an epidemic. In each generation of the outbreak, each case will lead to "X" number of new cases -- the larger a number X is, the more rapid the growth in number of cases in each following generation. Thus, the earlier an outbreak is detected, the fewer number of primary cases to identify and prevent from transmitting to other cases. The higher the number of cases, the greater the risk of exhausting the resources available for appropriate control measures.
Long-range spatial spread is in general facilitated by high infectivity, a long infectious period, and (at least in human influenza) a period of transmission before symptoms become apparent and quarantine measures can be taken. This contrasts with severe acute respiratory syndrome (SARS), in which the period of infectiousness begins with the onset of symptoms, allowing quarantine measures to be taken before maximum infectiousness is attainedref. The fate of an epidemic can also depend strongly on heterogeneities in R0, particularly on the role of "superspreaders" early in the epidemicref. For example, some superspreader individuals may be more infective per contact for some reason; other superspreaders may not have higher per-contact transmission, but have many more contacts and therefore greatly multiply the rate of spread (R. M. Anderson, R. M. May, Infectious Diseases of Humans: Dynamics and Control (Oxford Univ. Press, Oxford, 1991)). If superspreaders become infected early in an outbreak, the epidemic is more likely to take off, which can have substantial implications for disease controlref. In evolutionary terms and from the perspective of the pathogen, the host species barrier for infection can be thought of as a fitness valley lying between 2 distinct fitness peaks representing donor and recipient hosts, respectively. The more mutations required for a virus to move between these peaks, the deeper the valley and the less likely that this can occur in a single step, particularly if adaptation involves changes at multiple loci, as in the case of avian influenza virus transmitting in human populationsref. Such a model has two important implications for our understanding of viral disease emergence. • time series susceptible-infected-recovered (TSIR) model : on the basis of a gravity coupling model and a TSIR model for local dynamics, a metapopulation model for regional measles dynamics can capture all the major spatiotemporal properties in prevaccination epidemics of measles in England and Walesref ref ref
• prevention : it is not uncommon to find vaccination coverages reported as greater than 100%.  That occurs when the estimates for the denominator are off (they use population estimates that are lower than actual) or when the numerator includes more than planned for (for example, if they are doing coverages of the 0-5 year-old age group and 5-10 year-olds are vaccinated and counted in the vaccinations given numerator... or when they plan to vaccinate a town, and there are people who live in another town and come to this town for vaccinations).
• epidemiological measures :
• types :
• number of events
• quotient : a number obtained as the result of division; a number indicating how many times one number is contained in another
• ratio : an expression of the quantity of one substance or entity in relation to that of another; the relationship between two quantities expressed as the quotient of one divided by the other
• cross-product or odds ratio (OR) : the ratio of the probability of occurrence of one event to that of its alternative; it is often used in epidemiological analysis as it closely approximates relative risk
• rate (a.k.a. tasso in Italy) : in epidemiology and demography, the frequency of an event in a specified population; correctly applied only to fractions for which all of the cases contributing to the numerator are also counted in the denominator and for which the denominator is the entire population at risk. Rates are often multiplied by a factor to give the number of events per 1000, 10,000, or 100,000 population ...
• taxus
• raw taxus :
• specific taxus :
• proportional taxus : specific taxus / raw taxus
• survival analysis : statistical analysis that evaluates the timing of events, particularly survival but also by extension other nonrecurrent events occurring in a cohort over time, such as relapse, death, or marriage. It involves following the cohort, plotting the occurrence of events, and calculating their probabilities for each time interval
• product-limit estimate / Kaplan-Meier survival curve : a consistent estimate of the survival curve that can be computed from randomly censored data. At each patient death (or other endpoint) the conditional probability of survival during the interval since the last death is calculated as the number of patients observed to survive beyond that point (i.e., those who have not yet died and have not left the trial for other reasons) divided by the number at risk. The value of the survival curve at that point is calculated as the product of the conditional probabilities of survival for all of the intervals up to that point (Kaplan EL & Merier P. Non-parametric estimation from incomplete observations. J.Am.Stat.Associ. 53:457-481 (1958))
• vital statistics : the data, usually collected by governmental bodies, detailing the rates of birth, death, disease, marriage, and divorce in a population
• respondence :
• sex ratio : the proportion of one sex to the other; by tradition the number of males in a population to the number of females, usually stated as the number of males per 100 females
• morbidity : the incidence or prevalence of a disease or of all diseases in a population
• morbosity or morbidity rate : a rate in which the numerator is a number of cases of a disease and the denominator is the number of people at risk for the disease; it can denote either of the more precise terms
• standardized morbidity ratio (SMR) : a ratio like a standardized mortality ratio except that cases of disease rather than deaths are the observed data
• prevalence rate (P(D+)) = P(D+ and T+) + P(D+ and T-) = P(D+ and T+)/Se = P(T+) - P(D-) . P(T+|D-) / P(T+|D+) = [P(T-) - P(D-) . P(T-|D+)]/P(T-|D+) : the number of people in a population who have a disease or other condition at a given time: the numerator of the rate is the number of existing cases of the condition at a specified time and the denominator is the total population. Time may be a point or a defined interval; if unspecified it is traditionally the former
• puntual :
• periodic :
• incidence rate : the probability of developing a particular disease during a given period of time; the numerator of the rate is the number of new cases during the specified time period and the denominator is the population at risk (i.e. not previously affected) during the period
• cumulative incidence rate : the proportion of an initially disease-free population developing a disease over a fixed interval, calculated by cumulating the proportions developing the disease within short subintervals
• method of Gooleyref
• person/time incidence rate :
• prevalence/incidence : average duration of a chronic degenerative disease
• morbility :
• fecundity : ability to produce offspring rapidly and in large numbers. In demography, the physiological ability to reproduce, as opposed to fertility
• birth rate : number of live births in a geographic area in a defined period, usually 1 year / average total population or the midyear population in the area during the period x 1,000. Specific birth rates for subsets of the population may also be calculated, e.g., an age-specific birth rate, limited to the population of females of a defined age range
• fertility rate : a measure of fertility in a defined population over a specified period of time, usually 1 year
• general fertility rate : live births in a geographic area in a year / 1000 women of childbearing age (15 to 44 (49) years)
• more specific rates
• females of a given parity
• females of a particular decade in age
• completed rate for females who have finished childbearing
• abortivity rate : number of fetuses expulsed within pregnancy week 20 / number of pregnancies x 1000
• maternal or puerperal mortality rate : a rate in which the numerator is the number of maternal deaths ascribed to puerperal causes in 1 year; the number of live births in that year is often used as the denominator although to make a true rate the denominator should be the number of pregnancies (live births and fetal deaths)
• mortality or death rate : a rate expressing the number of deaths in a population at risk. The crude death rate is the ratio of the number of deaths in a geographic area in 1 year divided by the average or midyear population in the area during the year. An age-specific death rate is the ratio of the number of deaths occurring in a specified age group in one year to the average or midyear population of that group. A cause-specific death rate is the ratio of the number of deaths due to a specified cause in one year to the average or midyear total population
• standardized mortality ratio (SMR) : the ratio of the number of observed deaths in a study population to the number of expected deaths in that population. The expected deaths are calculated by classifying the study group by demographic variables such as age, sex, or race; computing the expected deaths for each class by multiplying the number of individuals in the study group in that class by the class-specific death rate in a standard reference population; and adding the expected deaths in all classes
• proportionate mortality ratio (PMR) : in occupational epidemiology, the ratio of observed deaths due to a specific cause in an occupational cohort to the expected deaths due to that cause, as determined by the proportion of deaths from the cause in the general population or comparison population, multiplied by 100
• stillbirth or fetal mortality or death rate or ratio / natimortality : number of fetal deaths in 1 year / total number of both live births and fetal deaths in that year x 1,000
• fetal death ratio : fetal deaths in 1 year / number of live births in that year
• early fetal mortality (before pregnancy week 21) x 1,000
• late fetal mortality (after pregnancy week 21) x 1,000
• perinatal mortality rate : number of the sum of fetal deaths after > 28 weeks of gestation (stillbirths) and deaths of infants < 7 days of age in one time period / population to the sum of the number of live births and fetal deaths after > 28 weeks of gestation (stillbirths) in that same time period and population x 1,000. It is directly related to income per inhabitant.
• infant mortality rate : number of deaths in 1 year of children < 1 year of age / number of live births in that year x 1,000. It is inversely related to income per inhabitant and maternal education level. 6/1,000 in Italy.
• neonatal mortality rate : number of deaths in one year of children < 28 days of age / number of live births in that year x 1,000
• early neonatal mortality (in extrauterine week 1) x 1,000
• late neonatal mortality (in extrauterine week 2 to 4) x 1,000
• postneonatal mortality rate : number of deaths in a given year of children between the 28th day of life and the first birthday / difference between the number of the live births and neonatal deaths in that year x 1,000; the denominator is sometimes simplified, less correctly, to the number of live births. The ratio is sometimes approximated as the difference between the infant mortality rate and the neonatal mortality rate
• Gompertz's law : at advanced ages the risk of dying increases geometrically with age: the death rate at age x may be computed by the formula qx = q<0·e·ax, where qx is the death rate at age x, q0 is the death rate at age 0, and a is a constant. From middle age on, actual death rates closely approximate the curve that corresponds to this formula. It is also used to describe the growth of cancer • Lexis or lexian distribution : mortality plotted vs. age with a peak representing infant mortality and another representing average life expenctancy. The latter is a subnormal distribution in which a normal distribution can be identified : the exceeding area represent preventable mortality.
• Lexis trials : n sets of s trials each, with the probability of success p constant in each set, s(x/n)2 = s . p . q + s . (s-1) . sp2, where  is the variance of .
• Lexis ratio (L) = s/sB, where s is the variance in a set of s Lexis trials and sB is the variance assuming Bernoulli trials. If L < 1, the trials are said to be subnormal, and if L > 1, the trials are said to be supernormal
• case fatality rate or ratio (CFR) / lethality rate : the proportion of persons contracting a disease who die of that disease: the numerator is the number of deaths caused by a disease and the denominator is the number of diagnosed cases of the disease
• population mortality data :
• life tables :
• the first vertical column (designated x) contains the age intervals chosen by the tabulator. In the case of human populations with a maximum life span of > 100 years, a 1-year age interval is usually chosen
• the second vertical column (designated dx) records the number of houseflies that dies during each age interval)
• the third vertical column (designated lx) lists the number of flies alive at the beginning of the age interval
• the fourth vertical column (designated qx) contains the age-specific death rates
• the fifth vertical column (designated ex) records the life expectancy : the number of years, based on statistical averages, that a given person of a specific age (or entering any given age interval), class, or other demographic variable may be expected to continue living. Also the average length of survival expected for an organism from a given point in its life cycle. For a newborn it is named average life expectancy at birth
• USA according to the latest census there are > 50,000 centenarians living and the average Americans life expectancy has risen to 77.2 years, an increase of > 30 years during the course of the last century
• Italy :
• 78 years for males
• 82 years for females
• Japan : life expectancy is longest in the world for both sexes:
• 78.32 for men in 2002
• 85.23 years for women
• UK : although traditionally lower, male life expectancy is rising at a faster rate than women's. A male born in 2002 could expect to live to about 76, while his sister would live until 81, but it is likely that by 2010 life expectancy for both men and women will start to converge at about 81. Heavy drinking in young women has more than tripled in the past 17 years, and the proportion of young women aged 16-24 drinking more than the recommended weekly limit is now almost the same as men10% for women in 2002 compared with 12% for men. Also, government antismoking drives do not seem to have been as successful in cutting the number of female smokers as in cutting the number of male smokers. Although the proportion of male smokers decreased from 51% in 1974 to 28% in 2002, the rate of decline in women has been slowerfrom 41% to 26%. These lifestyle changes have affected death rates. Deaths from lung cancer in men have halved since 1973; deaths from lung cancer in women have increased by 45%. Obesity is one area in which women's health can claim at least to be deteriorating at a slower rate than men's. In 1994, 15% of adult men had a body mass index of more than 30 compared with 18% of women. But, by 2002, this had risen to 21% compared with 22% of women, representing a levelling of the differences between men's and women's lifestylesref
There are 2 kinds of life tables :
• cohort life tables contains data collected over the entire time span of a particular cohort. Such data permit life expectancies to be calculated rather than projectively estimated
• period life tables are used when it is not possible to readily follow a cohort until the last member dies, which is often the case for species with long life spans, such as humans. Indeed, human cohort tables are available, but they are incomplete except for cohorts born well over 100 years ago. Because a period life table contains data from many different cohorts, it does not provide accurate information for any particular cohort. For example, in populations where the age-specific death rate has been progressively declining for an extended period of time, as has occurred in the USA during the 20th century, the projected life expectancies from data in a period life table are likely to be an underestimate for many of the cohorts comprising the table. The point to be emphasized is that life expectancies from period life tables are projections based on the mortality characteristics at the time of compiling the table, and thus may not be an accurate forecast of the actual remaining mean length of life of any of the cohorts in the table
• survival curves (x axis, age; y axis, percent of population alive) provide readily comprehensible visual information. The changein shape of the survival curve in recent years is termed rectangularization
• 5-year survival : an expression of the number of survivors with no trace of disease 5 years after each has been diagnosed or treated for the same disease.
• relative survival rate : a statistical comparison between the rate of patients in a cohort surviving for a certain length of time and the survival rate of a comparable group in the general population.
The 2 factors that determine the age structure of the world population are life expectany and birth rate. In the case of nations, there is an additional factor of migration into and out the nation :
• changes in life expectancy : given the progressive decrease at all ages in the age-specific death rate, life expectancy has increased during this period. Japan has experienced the greatest increase in life expectancy (for men, from 68 years in 1965 to 76 years in 1990; for women from 73 years in 1965 to 83 years in 1990). Increases in life expectancy during the 20th century have resulted primarily from prevention of death from infectious diseases at young ages, thereby affecting all age strata of the population to almost the same extent. However, if life expectancy continues to rise in the 21st century, it will primarily influence the older age strata of the population, because currently, in developed countries, few deaths occur at young ages. In the developed nations, future increases in life expectancy will primarily be due to the prevention of such age-associated diseases as cancer, coronary heart disease, and stroke.
• changes in birth rate : industrialization and urbanization during the 19th century reduced the number of progeny per family. Occasional deviatio from this trend, such as the high birth rate in the USA in the years immediately after WWII, generated what is referred to as the "Baby Boom Generation". In fact, the birth rate has decreased markedly in both developed nations and developing nations over the last half of the 20th century. During the period 1950-1955, the birth rate in developed countries was 2.8 per woman and in the developing nations it was 6.2 per woman; by the period of 1985-1990, the rate had fallen to 1.9 per woman in the developed nations and 3.9 per woman in the developing nations. The UN has projected a further decline in birth rate in the developing nations during the first part of the 21st century, with a birth rate of 2.3 per woman by 2025. Leading factors are socioeconomic factors that promote family planning (e.g., the enormous increase of women in the work force), the increased access to effective contraceptives, and the legalization of abortions in some onions (e.g., in the Soviet Union, it played a major role). In the developing nations the same factors as well as official government policies favoring small family size, have been responsible for the reduction in birth rate. Thus it is the decrease in birth rate, not the increase in life expectancy, that is the reason fo the marked rise in the 20th century in the fraction of population that is elderly. In addition, as stated earlier, the projected further increase in life expectancy during the 21st century will be another factor in the production of populations with high percentages of elderly, particular the old-old.
• changes in gender composition : during the 20th century, life expectancy for women in the USA and other developed nations has increase to a greater extent than life expectancy for men. The projected further increase in life expectancy for men in the developed nations during these years is a little greater than for women, thus narrowing the gender difference. For the 1976 population of Caucasian men and women in the USA, at all age intervals women had a lower age-specific death rate than men, but the rate of increase in the age-specific death rate with increasing calendar age and the mortality rate doubling time did not differ between men and women, suggesting that men and women age at the same rate, but men at all ages are more vulnerable than women of the same age. This results in many more women than men in the population of people of advanced age.
Changing age structure wil have negative impact on society
• work force : economists view the 2 periods of life in which an individual is not in the work force as periods of dependency (young-age dependency and old-age dependency), in that the individual is not producing but is consuming goods and services. The number in the population under age 21 divided by the number between the age of 20 and 65 provides the young-age dependency ratio. The number of people over 65 divided by the number between 20 and 65 provides the old-age dependency ratio. The total dependency ratio is, of course, the sum of the young-age dependecy ratio and the old-age dependency ratio. Economists tend to ignore the fact that many retired people provide goods and services as volunteers and unpaid helpers. In addition, many aged 65-plus have not retired, or have rejoined the work-force on a part-time or full-time basis. Nevertheless, the amount of goods and services needed for the old is 3-folds that for the young
• diseases : a major reason for the high costs involved in meeting the needs of old people is disease. Age-associated diseases tend to be chronic, often progressing with increasing severity over a period of years. Other do not result in death but interfere with functioning of the individual over a long period of time. In the age range 65-74, men ahve a higher prevalence than womenn of most fatal chronic diseases, while women have a greater prevalence than men of most nonfatal chronic diseases. The old are less able to cope with acute diseases, such as influenza (RR for hospital admission = 2 for people aged 65-74 and 2.5 for people aged > 75 years; average length of stay in the hospital was 18% longer for those 65-74 years old and 31% longer for those 75 years old and above than for those 45-64 years old)
• daily living assistance : elderly lose ability to carry out
• one or more of instrumental activities of daily life (IADL) such as shoppig for and preparing meals, managing money, doing laundry, using the telephone, doing heavy housework, doing light housework, and outside mobility.
• one or more of activities of daily living (ADL), such as using the toilet, bathing, getting in and out of bed, dressing, eating, and inside mobility. In the age range 65-74, almost 12% have difficulty with one or more ADL, and of these, about 75% will receive help in dealing with the deficit. For those aged > 85, > 50% have difficulty with one or more of the ADL, and of these > 85% receive help in dealing with the deficit. Much of the IADL and ADL loss is due to chronic disease, but other aspects of aging not classified as disease are also involved in ADL limitation.
• need for biomedical breakthrough : in the USA currently family members provide > 80% of the help needed by the elderly for ADL and IADL. However, the increasing number of never-marrieds and childless couples and the small families of the "Baby Boomers" will mean a markedly reduced pool of family caregivers in the future. The impact will be the greatest for elderly women because of the higher prevalence of nonfatal chronic diseases in women than in men; the longer life expectancy of women; and the likelihood of widowood, not only because women have a greater life expectany than men, but also because husbands in te "boomer" generation are usually older than their viwes. Currently, about 75% of the elderly men and 40% of elderly women live with their spouse, and althoug this may change somewhat in the 21st century, the basic pattern is likely to remain the same. The governments of the developed nations are beginning to have difficulties at financing their support programs for the elderly : providing benefits based on biological age may help; indeed, many over 65 years of age can continue in the work force. Of course, to use such an approach, in a manner that is fair, will require that gerontologists develop better means of measuring biological age than are currently available. Although it is clear that age-associated diseases, such as cancers and heart disease, can be delayed or prevented by these behavioral changes, one must consider the possibility that in the long-term these changes could increase costs. The reason for this apparent paradox is that in the absence of these fatal diseases, senescence will continue; and at more advanced ages, deterioration, including that due to nonfatal chronic disease, could result in a more prolonged period of dependency than prior to the behavioral change. Biomedical approaches aimed at lessening those aspects of aging that make the support of the elderly so costly are
• prevention or amelioration of the debilitating aspects of chronic diseases, particularly the nonfatal diseases, that afflict the elderly
• development of procedures that broadly reduce age-associated deterioration, ie. measures that decrease the extent of senescence
• attack or case rate : in the analysis of acute outbreaks of disease, the proportion of persons who are exposed to the disease during the outbreak who do become ill
• secondary attack rate : the attack rate in a closed exposed group, such as a household. The index case, which brings the group to the attention of the investigator, and also other initial cases occurring too early to be related to the index case are excluded from both the numerator and the denominator when possible.
• infection-morbidity rate (IMR) : how many in the risk group showed symptoms per thousand or 100 infected/truly exposed/seroconverted. Rather like an attack rate
• back-calculation : a statistical method that uses the current incidence and the length of incubation of a disease to estimate the cumulative incidence of the disease and project the number of cases that will occur in the future.
• fatality rate : the death rate in a specific group of persons simultaneously affected by some event, such as a natural disaster
• crude rate : one giving the total number of events occurring in an entire population over a period of time, without reference to any of the individuals or subgroups within the population
• specific rate :  a rate that applies to a specific demographic subgroup, e.g., individuals of a specific age, sex, or race, giving the total number of events in relation only to that subgroup
• adjusted or standardized rate : a fictitious summary rate statistically adjusted to remove the effect of a demographic or other influential variable such as age or sex, thus permitting unbiased comparison between groups having different underlying compositions with respect to these variables
• growth curve : the curve obtained by plotting increase in size or numbers against the elapsed time, as a measure of the growth of a child, or the multiplication of microorganisms.
• demography : models for population growth
• Leonardo Fibonacci's succession (Pise, 1202) : from month 2 each pair generates a pairof pups per month, at month n, C(n) = C(n-1) + C(n-2)
• Malthus model
• Verhulst model : susceptible/infected and population growth : when the number of infected people is no longer transcurable, meetings between them are more probable and don't increase spread of epidemics
• C1 = C0 + C1 = 2C0, where C0 = n/N
• C2 = 2C1 = 4 C0
• Ci+1 - Ci = Ci . (1-Ci) => Ci+1 = Ci . (2-Ci)
logistic curve : an S-shaped curve that describes population growth under limiting conditions as a function of time; when the population is low, growth begins slowly, then becomes rapid and increases exponentially, finally slowing down and reaching equilibrium as the population reaches the maximum that the environment can support : n = N/[1+B . e-ct)
Tracking travelers is difficult, so researchers came up with the idea of studying them indirectly by tracing how money circulates through the economy.  In the study, scientists traced the whereabouts of nearly half a million dollar bills on Where's George US bill-tracking site. Users register their money and then spend it. They can monitor the money's movement online as it changes hands. Researchers found that most of the money (57%) traveled between 30 miles and 500 miles over about nine months in the USA. About a quarter of the bills moved > 500 miles. By analyzing the movement of money -- and human travel -- over different distances, the scientists found that the money followed a predictable pattern. The method could be used to create more realistic disease models that track the spread of germs and perhaps prevent outbreaks
Statistical methods in medical publicationsref : exact methods should be used as extensively as possible in the analysis of categorical data. For analysis of measurements, nonparametric methods should be used to compare groups when the distribution of the dependent variable is not normal. Results should be presented with only as much precision as is of scientific value. For example, measures of association, such as odds ratios, should ordinarily be reported to two significant digits. Measures of uncertainty, such as confidence intervals, should be used consistently, including in figures that present aggregated results. Except when one-sided tests are required by study design, such as in noninferiority trials, all reported P values should be 2-sided. In general, P values larger than 0.01 should be reported to 2 decimal places, those between 0.01 and 0.001 to 3 decimal places; P values < 0.001 should be reported as P<0.001. Notable exceptions to this policy include P values arising in the application of stopping rules to the analysis of clinical trials and genetic-screening studies. In manuscripts that report on randomized clinical trials, authors should provide a flow diagram in CONSORT format and all of the information required by the CONSORT checklist. When restrictions on length prevent the inclusion of some of this information in the manuscript, it should be provided in a separate document submitted with the manuscript.
Web resources : The Virtual Library - Epidemiology
Web resources : Bibliography :
• Colton T. : Statistica in Medicina, Piccin, Padova
• Norman-Strenier : Biostatistica, Casa Editrice Ambrosiana
• Daniel, W.W. : Biostatistica, EdiSES Napoli
• Camussi, A. : Metodi statistici per la sperimentazione biologica, Zanichelli
• Armitage, P., Berry, G. : Statistica Medica, McGraw-Hill
• Glanz : Statistica per discipline biomediche, McGraw-Hill Copyright © 2001-2014 Daniele Focosi. All rights reserved Terms of use  | Legal notices Abbreviations and acronyms  |  Medical terminology  |  Add a link  |  Translate   |  Softwares Cite this page!     