physics    chemistry       statCalcs   

'Statistics' class notes, Laney College, Fall 2018

These notes, sometimes incomplete, were taken from a lectured class based on a CMU's OLI class:
CMU - Carnegie Mellon Univ.
OLI - OnLine Instruction (?)

The purpose of these notes is to help the writer (me) retain the information and to provide support for 're-entry', the moment in life where I really need to apply parts of this.

Attribution: many of the diagrams are images captured from the CMU material. Sometimes the supporting links end up in Wikipedia which I $upport monthly. I don't believe any of this material comes directly from Dr. Liang's fine lectures.

See Conventions
See My Review of the course.
See Layout of course materials

ADM
p31.3, Average Deviation from the Mean
alpha, α
α is a synonym for significance level.
This value is compared to the pValue to determine whether the alternate hypothesis is supported by the collected data. The significance level is early, during the design of the sampling effort. A smaller pValue is a stronger indication that the issue sought by the effort is true. A common value for alpha is 0.05 - for 95% confidence (TODO check this)
applet
TODO: THERE ARE SEVERAL APPLETS. Show typical link. calc,gen model data...
This 'applet' is frequently used in the OLI material to find the probability corresponding to a known Z-score.

I strongly recommend that the reader not get too attached to the applet - which 'goes away' at the end of the class. Alternatives to the applet can be found below in the discussion about 'p4z

useful locations of the applet is/are: App p65sht4
apps
Pinot:: ../p/slideShow/drafts/normalDistrib.xlsx (also on thumb drive ??and yahoo website??) TODO
  • 1st block (upper left) calcs P left of 'x value'
  • 2nd block "Normal Dist, Between x's. uses mean, std dev from A4,A5. Uses A11,A12 for xLeft, xRight
  • 3rd block. F3 to F7. calcs 'x value' given probability,mean,std dev

p66,1/7,top std normal curve. find P for any normal curve; pages > 1/7 show op

association
p39
"association (statistics)" Correlation.., wikip In statistics, dependence or association is any statistical relationship, whether causal or not, between two random variables or bivariate data. In the broadest sense correlation is any statistical association, though in common usage it most often refers to how close two variables are to having a linear relationship with each other. "
bell-shaped
p15. aka 'symmetric with central peak'

categorical variable
see variable, categorical The name of the entire study, "Cereals", p13, is not a variable at all.
central limit theorem
p111.4 "for large samples, the sampling distribution of sample means is approx normal". Even if the original data is skewed
In practical terms this seems to mean that, when dealing with the means of samples, a set of 30 samples is usually all that's needed to have a normal distribution.
CI
CI='confidence interval'
claim
p75sht1 SOON relates to 'hypothesis'
conditional distributions
p39
conditional percentages
p39.6 calc'd separately each val of variable, explanatory.
confidence interval

A "C.I.", "CI", or "Confidence Interval" can be described as (xmin, xmax) or as (x_middle +/- amount). 'xmin', 'xmax' bound the range of interest. see top of p83sht5 for eg Example: (35.0 , 42.5) or ( low bound, high bound ) and there's also a percentage value given with the interval, eg: 95%, as in "95% of any samples of the population data will give values falling between 35.0 and 42.5".
p83.0 "REQUIRES a random sample"; p91sht2/4 btm.
(p115) "We are 95% conf that the mean SAT math score in this state is between 467.2 ..."
The Zc z-score, introduced in m14, is about the 'c' (=confidence interval).
'stdErr' is like stdDev but for samples.

examples:

  • p83sht2 "1508 adults, margin of error 2.5% points, 95% confidence"
  • p83sht3,"learn by doing". resultant confidence interval was "(27%+-4%) or (23%, 31%)" //don't overthink it !
  • p83sht3,"did I get this". (21% +- 0.2%) or (20.8%,21.2%) at 95%
degrees of freedom
m19, p116sht1 btm; 'df'; df = n-1
shows up in problems using the T-model
dependent, independent
p51
disjoint
module 10 (?). when A and B events have no outcomes in common, are disjoint.
if disjoint: P(A or B) = P(A) + P(B)
distributions
module 4 p14 early
dotplots
p14.2 There's a dot for every measurement; whereas, a histogram shows 'bins' dotplot better than histo at showing shape, center, spread. //p17.0, histo p17.2
double-blind experiment
p39, test info hidden from participant to reduce or eliminate bias wikip, subjects and test conductors uninformed
empirical rule
for normal curves. p62.8, p63.0
  • 0.68 within 1 sd, ?? 0.645 sd // i had written 1.645
  • 0.95 within 2 sd, actually 1.96 sd
  • 0.997 within 3 sd, actually 2.575 sd
Excel's functions
there's an exercise at the end of histo 3/4 TODO: point to slideShow

NORM.DIST(_,____,______,0) don't use 0 as last arg. see sideBySide.xlsx right side
NORM.DIST(x,mean,stdDev,1) returns P(x) given mean and stdDev. for normal curve
NORM.DIST(135,100,15,1) returns 0.990185
--------------------------------
NORM.INV(0.990185,100,15) rets 135, inverse of NORM.DIST call above.
-----------------------------------------------------------
T.DIST(x,df,FALSE) returns P(x) for left-tailed student T's distribution
T.DIST.2T(x,df) returns P(x) for 2-tailed student T's distribution
T.DIST.RT(x,df) returns right-tailed student T's distribution
T.DIST(1.96, 18,-1) gave 0.9672
T.DIST(1.96, 18, 0) gave 0.0626
T.DIST(1.96, 18, 1) gave 0.9672 TODO: understand and doc 'tails' argument
T.DIST(1.96, 18, 2) "
T.DIST(1.96, 18, 3) "
-----------------------------------------------------------
T.INV(probability, deg_freedom ) "returns left tail" t= 2.10 where z was 1.96
T.INV(0.025, 18 ) gave -2.10092
T.INV(0.025, 39 ) gave -2.02269 or 0.023
-----------------------------------------------------------
T.TEST(array1, array2, tails, type )
-----------------------------------------------------------
AVERAGE(cell_a,cell_b) from 'a' to 'b', inclusive
STDEV(cell_a,cell_b) from 'a' to 'b', inclusive

expected value
p54.7, same as mean.
explanatory variable
see 'variable, explanatory' TODO: link
exploration, cycle of
module 4, see diagram p38.9
exploratory data analysis
mentioned p46.5, 2nd part of CopyCat case p71.0 top // cheating
five-number summary
module 6, == minimum, quartile 1, median, quartile 3, and maximum
given
p43.3
p79,3/7: it's the word given to the vertical bar in P(pHat >= 0.15 | p = 0.10)
histograms
m4, link frequency, relative frequency.
made fm (link dotplot), see p16.4 especially 97 dots becoming ht of a bin
changing bin size, start point can distort histo. p17, histo p17.4
avoid 'pancake', 'skyscaper'
hypothesis
p39 ??, reading m15 now but the term has been around ?since m13?. Early m15 notes that 'claims' will morph into hypotheses.
hypothesis, alternative, Ha
p95.5. hypothesis about the value of the parameter. Claim in the research question about the value of the parameter. The alternative hypothesis says the parameter is “greater than” or “less than” or “not equal to” the value we assume to true in the null hypothesis.

Examples
p95.8 The proportion of smokers among adults who have a degree < 0.22

WHEN AN EXPERIMENT IS DEFINED, THERE IS BOTH A NULL AND AN ALTERNATIVE "WE CAN EITHER REJECT THE NULL HYP OR FAIL TO REJECT IT. WE NEVER CONFIRM NULL", P96.8 eg: 96.3, H0 "still 62MB", Ha "> 62MB"

How likely it is that in a sample of 375 we find that as low as 16.5% have used marijuana, when the true rate is actually 21.5%. P96.8

How likely it is in a random sample of 1,500 students to observe students studying an average of at most 27 or at least 33 hours per week outside of class, if the mean number is actually 30 hours per week.

hypothesis, conclusion
p96.4.
good to include the pValue in the conclusion. "the data do not provide signif evidence that the propor of comm colleges without X is lessThan 25%".
p97sht2
hypothesis, null, 'H0'
p95.5
hypothesis about the value of the parameter.
We assume the null hypothesis true then see if we can reject it.
"We never accept the null hypothesis or state that it is true.", p96.7
A 'true' result means there is (effectively) no relation between the defined parts.
A 'false' null hyp means there is a relation.
"pValue" == 96.4: how much variability to expect in random samples when the null hyp true

Examples
p95.7 The proportion of smokers among adults who have a degree == 0.22
p95.8 The mean IQ of Raider fans is the same as Niner fans.

The null hypothesis is a general statement or default position that there is no relationship between two measured phenomena, or no association among groups...
...the field of statistics gives precise criteria for rejecting a null hypothesis.
null intro, wikip
null details, wikip
hypothesis testing
p95.8, 96.1, p95.0
confidence interval: wrt population parameter(s)
estimate value or difference in popu param
hypothesis test: wrt population parameter(s)
test a claim about a population parameter(s) or difference in them

The process of forming hypotheses, collecting data, and using the data to draw a conclusion about the hypotheses. summary p97sht3/
  • get or make the research question
  • determine hypotheses, 'null' and 'alternate'
  • collect the data. (random sample. calc statistic (mean or proportion).
    Formulate the exact test, using the statistic obtained from the data, eg: "How likely is it that, in a sample of 'n', we'll find that (as much, at least etc) the statistic is true while the null's value is also true"
  • assess the evidence
  • state a conclusion, ie. accept Ha or conclude "don't know"
hypothetical two-way
p39,p43.1 purpose is to answer complex questions
independence/dependence
2 tests:
  1. if P( issue ) * P( other | issue ) ~= P(issue and other) marginal prob * conditional prob sorta== joint Prob # can ratio smaller/larger and see how close to 1.0 as meas of equality
  2. if P( issue and other ) ~= P(issue) * P(other)
independent
(2nd wrapup to m10).
When the knowledge of the occurence of one event A does not affect the probability of another event, B, the events are 'independent' and
P(A and B) = P(A) * P(B)
inference
module 4, diagram p38.9
the eventual goal of explanatory analysis p46.5
In inference, we use a sample to draw a conclusion about a population. p95.0
inference, types of
p83.0
confidence interval. use when goal is estimate population parameter.
hypothesis tests. use to test a claim about a population parameter
inflection point
p62sht2 TODO
IQR
p26.3, InterQuartile Range
interval
== confidence interval(?!), p83sht2(m13). I think I saw it before m13
Z-score interval...
matched pair (design)
module 19. cool stuff
margin of error, ME
mentioned p83sht2
p83sht4. report a margin of error based on the standard error.

p115,1/4: for proportions, mOfErr() = 2 * sqrt(p*(1-p)/n)
for means, mOfErr() = 2 (stdDev==sigma) * sqrt(n)

wikip:
a statistic expressing the amount of random sampling error in a survey's results. The larger the margin of error, the less confidence one should have that the poll's reported results are close to the "true" figures; that is, the figures for the whole population. Margin of error is positive whenever a population is incompletely sampled and the outcome measure has positive variance (that is, it varies).

The term "margin of error" is often used in non-survey contexts to indicate observational error in reporting measured quantities.

marginal proportion
p39 (module 8)
The margins are simply the Column Totals and Row Totals, outlined in heavy black. The gray area is measured data.

A lot can be simply answered by forming ratios between different entries, sometimes between 2 margin numbers, sometimes between a data entry and a margin number.
The 'fat pets' proportion = 235/1200 = 0.195

marginal percentage
p39.
Dogs in the table above were 760/1200 = 0.633 = 63.3% of the pets surveyed.
marginal probability
p39, eg: prob that a random pet is a cat: P(cat) from the above table, is 440/1200 = 0.367
See 'probability marginal' and p41,s1 top
matched pairs design
p123,sht1
math model
p78.0 has:
  • math model center: mean of the sample proportions is p, the population's proportion
  • math model spread: A sample's sd ('s') is sqrt(p(1-p) / n) "standard error"
  • math model shape: A normal curve is good if the 'normality test' passes.
mean
aka 'expected value', 'typical value'. p22

p23.5: Use mean for center only for distributions that are reasonably symmetric with a central peak. When outliers are present, the mean is usually not a good choice - over using the median. module 5 Mean&Median:p23.9: both mean and median are good. histo sort of sym.

model
p53; see 'math model' above.
μ, 'mu'
population's average. 'mu'; p67 1/2 .4
as opposed to the mean of a sample (xBar).
normal curve, conditions
p78.4 The 'conditions' for using a normal curve are just the 'normality test' below.
normality test
This test consists of two short calculations:
  1. np >= 10 ; or, in more detail, n * p = sampleSize_n * probability >= 10;
  2. n(1-p) >= 10
'np' is the probability of success and
'n(1-p)' is the probability of the other outcome.

If these conditions are met, we can assume a 'normal model' will work and, therefore, we can use the techniques which accompany normal distributions.
example 1: If we expect 70% of a population of 40 participants to be overweight;
np=40(0.70) is 28 which is larger than 10.
We expect 30% (1.00 - 0.70) of the 40 to not be overweight.
n(1-p) = 40(0.30) = 12 which is also larger than 10.

example 2: If you try sample size n=40, probability p=90, the n(1-p) calculation becomes 40(0.10)=4 which is not larger than 10.

outliers
m4,p26.4 deviations from the pattern
module 6 the 1.5 * IQR rule
outliers, extreme. more than 3 sigma away from mean
p
parameter (population, not sample).
p0, p-zero
used in m16,p102 as probability (proportion) used in the null hypothesis
p4z
'p4z' is the name of a python function which returns a 'p' (probability) for a given Z value ("P for Z").

A person can extract it from a "low res" table, online or on paper.
Other alternatives are: Excel's NORM.DIST function, 'asking' online...

p-hat, pHat
A mean for a sample, not the whole population; table p74,6/7
P( ) syntax
p39, m8
P(statement) = proportion // see p40.8, P(HealthSci given female)
P(male AND Info Tech), p41
P(girl | predict girl) // the | is same as 'given'
reviewed on p68 1/2 or's and's
parameter
p74
'a number (mean or proportion) for a population, not a sample.' while a 'statistic' is the mean or proportion calculated from a sample, not a population'
if variable is categorical, parameter and statistic are proportions
if variable is quantitative, parameter and statistic are means
population (not sample): p, mu, sigma; sample (not pop): p-hat,x-bar,'s'
poisson
(beyond the present course). for right-skewed distributions. home prices, incomes
wikip, Poisson dist
probability
p40.2,p40.5, m4, value range 0 to 1
properties,rules: 1. they add. P(A+B) = P(A) + P(B) except when they start to double count a cell, p50.4
  • P(not A) = 1 - P(A)
  • always true: P(A and B) = P(A) x P(B | A)
  • when independent: P(B | A) = P(B)
  • if P(A and B) = P(A) * P(B) then independent
probability distribution
p49.0 intro to
probability model
p74. describes the long-run behavior of sample measures.
ref'd p109, beginning. If you have a normal curve that fits your data well, the curve (mean, std dev) are a 'probability model'.
probability model is written "(mean, std dev)"; it assumes a Normal distribution.
probability types
p42sht4/4 probability, conditional; p39,p42.4 "probability of a categorical variable taking on a particular value given the condition that the other categorical variable has some particular value." eg: the percentages in <this row> are based on the condition that the student is male.

probability, conditional; p39, see 'probability, conditional' m8,p40,4 discussing it. P(HealthSci given female); female is the condition

probability, joint; p39,p42.4 eg: P(female AND HealthSci) "probability that the two categorical variables each take on a specific value. "

probability, marginal; p39,p42.4, probability of a categorical variable taking on a particular value without regard to the other categorical variable, eg: P(female). ; eg: P(female); probability of a categorical variable taking on a particular value without regard to the other categorical variable.

probability, empirical
Near start of m9. vs Theoretical P
empirical P will approach theoretical P for a large sample P(event) = rel freq from long series of repetitions. m9.9 of sht3
probability, theoretical
1st real section in m9
vs probability, empirical
probability notation
p39,p40.2 p40.5 P(female AND HealthSci)
ref 1 (https://en.wikipedia.org/wiki/Notation_in_probability_and_statistics)
sometimes written as Pr( )
proportion
TODO. explain utility when handling chapter 13 stuff.
pValue, 'P-value'
The pValue is the calculated area, the probability, lying outside the Z-score area defined by alpha, the 'significance level' . A successful experiment has a pValue less than the alpha - so a small alpha is desired. The alpha value is a function of the chosen confidence level and is chosen before the data is collected.
p103sht5, topic is "how to determine pValues (w/ OLI applet)".
for <, > (one-tailed): NORM.DIST(zScore,0,1,TRUE)
for NotEqual (two-tailed): 2 * NORM.DIST(zScore,0,1,TRUE)
The term 'P-value' looks like a calculation subtracting value from P. So I prefer to write it as 'pValue'.
quartile
module 6. based on word 'quarter'
p66.7: when 'quartile' used w/ bell curve, the 1st Q is the area left such that probability is 0.25
quantitative variable
see 'variable, quantitative'
random variables
"usually written in upper case" Particular realizations of a random variable are written in corresponding lower case letters
range
modules 5, 6 range, overall, p15
relationship
p39 //correlation? 'linear function'
related variables? (issue) theme in m8.
relative frequency
TODO. used p79sht1
research question
"statistical investigations begin w/ research questions that require us to test a claim", p96.1.

examples of types:
average student course load less than ? quant => param is a mean _claim_ uses word 'mean': does 'mean' course load...
do majority of students qualify for loans? categorical => param is a proportion variable is 'qualify for loans'
do female and male students have different GPAs? compares 2 population means. quant. are athletes more likely than non-ath to get advising? compares 2 popu proportions. var 'receive advice'

response variable
see 'variable, response'
risks
p42.1 probability, but for a negative outcome
p42sht1, risk reduced. neg chg in risk.
reference risk. often the path using a placebo
p42sht4 eqn at btm of page very good
robust
m19. p122,sht1/5. CIs and hypothesis tests are 'robust' if they're mostly insensitive when the conditions for use are shaky.
rossman
Rossman Chance applets. seen in class 10/29. their app, One Proportion inference
sample size, n
increasing 'n' will reduce the std deviation (aka standard error). p76.0

Increasing the sampleSize 4 times gave reduced the standard deviation by 2. p76.8

The lower equation shows how to set the sample size given P, and std err (or SD).

sampling distribution
p78sht1;
p109sht5: sampling distribution of xBar
related: see central limit theorem
memorize!

Spread! dif terms for Cat vs Quant. x\ for sample mean... p0 called 'p'.

SD
see Standard Deviation
sigma
see Standard Deviation
significance level, 'alpha', 'α'
p96.3sht4: "If the pValue is <= alpha, we accept Ha, the alternate hypothesis". "If the pValue > alpha, we fail to reject the null" - meaning we don't know...
p96.6 "significant difference",
??(seen in m14, m15 (?)), α-level, p96.7
p125.7, General Guidelines for Choosing a Level of Significance;
(see type errors)
If the consequences of a type I error are more serious, choose a small level of significance (α). If the consequences of a type II error are more serious, choose a larger level of significance (α).
But remember that 'alpha' is the probability of committing a type I error.
In general, we pick the largest level of significance that we can tolerate as the chance of a type I error.
skew
TODO
SRS, Simple Random Sample
TODO
simulations
possible where the data is truly random and can be modeled without special knowledge about the data.
skew
m7,p15.1 "right skewed" means the distribution is lower on the right.
standard deviation, 'stdDev', 'SD', 'σ'
181203: Accumulated notions about 'stdDev'
Very rarely is 'stdDev' actually calculated in a typical statistics problem; rather, its value is just given to the reader as a part of a publication about a study. The information at the level of the published study can't be changed. Similar data can be collected and each glob of the new data is called a 'sample' for which the equivalent of a 'stdDev' can be calculated - but it must be called the 'stdErr'. The collection of samples retaken on the same basic source doesn't make a stdDev.

"similar to the average deviation from the mean", p32.1, sqrt( Σ(x-xBar)2/(n-1) ),  
p116top, "SAMPLE stdDev", called 's', replaces σ. sigma/sqrtN becomes s/sqrtN
Note that the following differs from the graphic...

p55.3; SUM(x-xBar)² * p(x)

for sample proportions. see doc/pMinusP2.png, p78.3,sqrt(p*(1-p)²/n)

2 stdErr eqns (=sqrt(p(1-p)/n) and (=σ/n) imply that σ = sqrt(p(1-p)) as shown.

Fascinating... my gut doesn't tell me it's true...

From wikip:

... a measure that is used to quantify the amount of variation or dispersion of a set of data values. A low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values.

In addition to expressing the variability of a population, the standard deviation is commonly used to measure confidence in statistical conclusions. For example, the margin of error in polling data is determined by calculating the expected standard deviation in the results if the same poll were to be conducted multiple times. This derivation of a standard deviation is often called the "standard error" of the estimate or "standard error of the mean" when referring to a mean. It is computed as the standard deviation of all the means that would be computed from that population if an infinite number of samples were drawn and a mean for each sample were computed.

It is very important to note that the stdDev of a population and the stdErr of a statistic derived from that population (such as the mean) are quite different but related (related by the inverse of the square root of the number of observations). The reported margin of error of a poll is computed from the standard error of the mean (or alternatively from the product of the standard deviation of the population and the inverse of the square root of the sample size, which is the same thing) and is typically about twice the standard deviation—the half-width of a 95 percent confidence interval.

standard error, 'stdErr', 'SE', 's'
(m12,78 "the std dev of the sampling propor (sqrt( (p(1-p)/n) )[where p is mean of sample propors] is also called the std err")
has the same basic equation as 'standard deviation' but this term, 'stdErr', only applies to samples while 'standard deviation' pertains to the full population.

p115,2/4,btm: "b/c normal model 95% 2 stdDevs, 2 stdErrs (1 stdErr = sigma/sqrt(n))

From wikip:

The standard error (SE) of a statistic (usually an estimate of a parameter) is the standard deviation of its sampling distribution or an estimate of that standard deviation. If the parameter or the statistic is the mean, it is called the standard error of the mean (SEM).

The sampling distribution of a population mean is generated by repeated sampling and recording of the means obtained. This forms a distribution of different means, and this distribution has its own mean and variance. Mathematically, the variance of the sampling distribution obtained is equal to the variance of the population divided by the sample size. This is because as the sample size increases, sample means cluster more closely around the population mean.

Therefore, the relationship between the standard error and the standard deviation is such that, for a given sample size, the stdErr = pop.σ / sqrt(n=smpl size). In other words, the standard error of the mean is a measure of the dispersion of sample means around the population mean.

standard normal distribution
p65.0: "distribution of z-scores is also a normal density curve" their example shows 2 bell curves, one w/ x axis measured in foot length, the other curve in SD. and it shows the corresponding areas (probabilities) are equal.
statClass.py
python code. many functions, far from done
statistic
p74.7
"a single measure of some attribute of a sample (e.g. its arithmetic mean value). It is calculated by applying a function (statistical algorithm) to the values of the items of the sample, which are known together as a set of data".
wikip
statistical significance
p96.6
== "statically significant" == "significantly different"
"When the pValue is less than (or equal to) 0.05, we also say that the difference between the actual sample statistic and the assumed parameter value is statistically significant. In the previous example, the pValue is less than 0.05, so we say the difference between the sample mean (75 MB) and the assumed mean from the null hypothesis (62 MB) is statistically significant. You will also see this described as a significant difference. A significant difference is an observed difference that is too large to attribute to chance. In other words, it is a difference that is unlikely when we consider sampling variability alone. If the difference is statistically significant, we reject H0."
success
p78.2 true when datum matches desired category (eg 'female').
"category of interest" is their definition.
I would prefer 'matches' over 'successes'.
T-model
m19, p115 "Student's T" distribution is wider and, generally lower, than the Normal curve.
If population stdDev unknown, must use T-model (?)
If sample size small, < 30, check for outliers. And consider disclaimer "On the basis of the sample, we are assuming that the variable is distributed without strong skew or extreme outliers in the population. The conclusion from this test is valid only if this assumption is true". p122,sht3

wikip notes that it's used "when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown."

See Excel entry describes relevant functions.
example
n = 40  # df=39, 1 less
sigma not given (in 1st of 2 examples)
so, w/o sigma, use T's
xBar + t * s/sqrt(n)  # or sqrt(df) ?
# need 't' for 95% conf so each of 2 tails has 2.5% or 0.025
T.INV(0.025, 39) => 2.023 which is greater than 1.96, the Z for 2 sigma
Something = xBar +/- 2.023 * 0.3/sqrt(40)
T-score
online calc takes df and conf interval (typ 0.05)
T-table
This printable table from wikia.com can be used to perform calculations w/o Excel or other such technologies. A printed copy of this table is kept with the small folder of calculation aids.
TODO. write up how to use... SOON build table into statClass.py
Online calculator gives 'p4tdf' probability for T-x, df (degrees of freedom, n-1)

Student's t-distribution Calculator

The T function math

T-test
m19
"The t-test is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis.", wikip.

"A t-test is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is unknown and is replaced by an estimate based on the data, the test statistics (under certain conditions) follow a Student's t distribution. The t-test can be used, for example, to determine if two sets of data are significantly different from each other." wikip

table margins
p39. Has nothing to do with data being 'marginal'.
This simply refers to values appearing in the margins, on the edges, of a table, eg: sums across and down. Row Totals, Column Totals.
Tc
p117 "critical T-value = output of table(df, confidence level)" or python table
two-way table
p38, tbl showing 2 issues, one horz, one vert, w/ margins.
initially for analyzing relationship between 2 categorical vars (p39 top)
explanatory values vert, response values horz; see p38sht3
summarize'd. m8,p39sht/9 ...'conditional percentages'
two-way table, hypothetical
p43sht1 built fm other, known P's P that person has disease a test suggests...
type errors
Type I Error: H0 is true but we accepted Ha // mistakenly rejected H0 (and accepted Ha)
Type II Error: We failed to reject null but didn't have enuf to accept Ha // mistakenly thought H0 good p125,m19 (?)
typical
within 1 std dev of the mean. see 'unusual'
uniform
p15 not much variability
unusual
more than 2 std deviations away from the mean. syn: "surprising"
variability
p15.4, p35.1
variable, categorical
p39 eg: bodyImage={underwt,overwt,ok_wt}, gender{m,f}
variable, discrete random variable
"When the range of X is finite or countably infinite, the random variable is called a discrete random variable". Term seen in discussions on standard deviation. wikip
variable, explanatory
p39.5, gender. say, choose value 'male' // must ID this var. "want totals fm it to calc %'s " see also 1st page of p42. treatment =
variable, quantitative
p39 eg: #students in class
variable, response
p39.5 eg: comparing body image from explanatory variable (eg 'male') // p42, 1st. {heart attack, not}
work-backwards problems
p66.5: given z-score, find X.
Use app or table to convert Z to SD. Then use Mean and SD to make X eg: XL sock fits largest 30% of men. What's smallest foot len which fits XL.
  1. note foot lengths. mean 11", SD 1.5"
  2. mark off P=0.70 on app's window; find Z = 0.5244 printed
  3. [ 0.5244 * (SD=1.5") = 0.75"] + [mean==11"] = 11.75
xBar, Xbar, x-bar
this is another 'mean value'. just for samples (?). therefore this is a 'statistic'. The mean for the population is μ , mu.
Zc
p91sht2/4 btm.
'Zc' was introduced in m14. Note that 'stdErr' is the familiar standard error (for samples. using proportions, not quantitive).

The 'c' subscript on the Z is to emphasize that the z-score is connected to the confidence interval. The following from p92,1/4

ConfidenceZc
90% 1.645
95% 1.960
99% 2.576

z-scores
181125 note: TODO: eqn at p65.9 below doesnt show the eventual eqn (m12 or so) interesting that, on p79sht3, the text shows Z as (statistic-parameter)/stdErr. The text up to this pt (?) has been calling the numerator 'pHat - p' // pHat - p0 (?). Calling the numerator terms 'statistic' and 'parameter' may reveal the path to quant problems...

The difference beween a probability, pHat (sample probability), and the sample's mean is scaled by the std error (the denominator) to produce 'Z' values.
p65.9: z = (x - u)/sigma ; x = u + z * sigma

p65.5 shows zscores used as 'flags' and the prob is the AREA between them. And theres an app good sample problems p66sht6, towards the end

p64sht2/5,btm; the _UNITS_ are std deviations = (value-mean)/standard deviation (88 - (mean==82))/5 = 6/5 = 1.2 -2.4z-score * (sd==5) + (mean==82) = -122 + 82 = 70

see applet, p65sht4,top

Conventions used here

My references to page numbers in the CMU materials will be most valuable if the reader signs up for the class online. It cost me $20, a pittance given the depth and breadth of the material.

(My) inconsistent page numbering conventions:
"m12" refers to 'module number 12'
Note that a numbered 'page' in the course material generally consists of more than one of what we'd call 'pages'. The '91' in p91 is OLI's page number and you can type in '91' when you want to see that page. But page 91, if printed, generates several sheets of paper. 'p91sht3' would be the 3rd printed page of p91. At times I just noted that something was 30% or so thru the material and I'd note it as p96.3 . I believe that sometimes meant 30% down a sheet, other times 30% of the way thru all of p96's material.
"p91" points to page 91
"p96.3" is page 96 about 30% thru that page (or material)
"p96sht2 is printed sheet 2 of page p96


'sqrt' means 'square root'
'*' means multiplication, (not 'x').
np is two variable names, 'n' and 'p'. Putting them together implies multiplication, that is, 'np' is equivalent to 'n * p'.

my Review

I think CMU's materials for this class were 'the best' at teaching the nuts and bolts but several students had troubles seeing the 'big picture' (me included). The online text did present big picture ideas but figuring out the best approach to a particular problem turned out to be difficult (imho). I wrote other, "calc" file(s), to support true calculation sequences. If they look usable, I'll provide links to those too.

I coded some of the calculation functions in python without any plan to really base an app on the functions. I felt this coding effort would force me to confront the operations at a more detailed level.

The feature I liked the most of the course materials were the frequent 'tests' to see that the reader had comprehended the just-finished passage. I was very impressed to see and use that feature.

The alphabetic organization is just my preference when collecting notes before I see any other, more favorable organization. This organizational bias also comes from decades writing software where the definition the terms was all-important.

The Unit and Module layout of the course

references:
ref1. wikip, 'notation'...
ref2. handout, week ending 9/28

#Unit 1: Introduction to Concepts in Statistics Course
module01 Intro to Concepts in Statistics Course
module02 Learning Strategies
module03 The Big Picture
----------------
#Types of Statistical Studies and Producing Data.  ?
#Unit 2: Summarizing Data Graphically and Numerically
module04 Distributions of Quantitative Data
checkpoint: Distributions of Quantitative Data
----------------
module05 Measures of Center
module06 Measures of Spread about the Median
checkpoint: Quantifying Variability Relative to the Median
----------------

:===== =============================================================

ref p115 top says this is similar to the stuff in m18.
"when we used a sample proportion to estimate a population propor

module07 Quantifying Variability Relative to the Mean
checkpoint: Quantifying Variability Relative to the Mean
checkpoint: Summarizing Data Graphically and Numerically


Unit 3: Relationships in Categorical Data with Intro to Probability
module08 Two-Way Tables
p38
checkpoint: Relationships in Categorical Data with Intro to Probability



:===== =============================================================

Unit 4: Probability and Probability Distributions
module09 Probability and Distributions, pp46-57
checkpoint: Probability and Probability Distributions
----------------
module10 Continuous Random Variables, pp59-68
checkpoint: Continuous Random Variables
checkpoint: Probability and Probability Distributions
----------------
Unit 5: Linking Probability to Statistical Inference
module11 Intro to Statistical Inference, pp70-72
p70
p71
p72
module12 Distribution of Sample Proportions, parameters vs statistics, pp74-81
    categorical vars. so parameters are proportions
    testing a 'claim' //eg: "a majority of students qualify for loans"
p74 parameters vs statistics
p75
p76
p77
p78
p79
p80
p81 wrap up

module13 Intro to Statistical Inference, pp83-87
----------------
Unit 6: Inference for One Proportion
module14: Estimating a population proportion, pp89-93
module15: hypothesis testing
p95  distinguish 1 popu mean, 1 popu proportion,2 popu means, or 2 popu proportions
     h0,ha, forming, stating
     null hyp
p96  determine hyps,collect,assess,state
     teenager smart phone usage over time, alpha, 40 hour work week.
     blacks and marijuana, student study time per week.
p97  steps (again). a little more detail (?)
     yellow boxed steps
P98  pValue (more).
     obama, death penalty popularity, tea party, portion of students who work,
     choosing the level of significance,
p99  2 types of errors. vv_type_errors
     data usage on smart phones, obama, cell phones and brain cancer, telepathy,

module16: hypothesis test for a population proportion
p102 defines, clarifies claims for this section.
     shows 1 or 2 populations, show quant==means vs cat==proportions
p103 pValue calc; determining H0 Ha, 'test statistic'== z-score
     summarizes steps in making a Hyp
p104 Hyp test; more on pValues

----------------
Unit 7: Inference for Means
module17: Distribution of Sample Means
p108 intro
p109sht1,2 birth wts. categorical = LOW_WT/not; quant is weight.
     #s used to find confidence interval to estimate propor for country.
p110 sigma/sqrt(n)
     pell grants. $2600 = mean, sd = $400. se = sigma/sqrt(n=20)
p111 samplings of skewed data are normal
     central limit theorem. sample size 30 usually enuf
     Pell grants.
p112 intervals of Z scores
     basketball player hts. normally dist.
     converting Z's to P statement
     teacher salaries. salaries skewed.
p113

module18: Estimating a Population Mean
module19: Hypothesis Test for a population mean
module20: Inference for a difference between population means
----------------
Unit 8: Inference for Two populations
module21: Distribution of differences in sample proportions
module22: Estimate the difference between population proportions
module23: hypothesis test for a difference in population proportions
----------------