Statistics
They might be ...
'Statistics' class notes, Laney College, Fall 2018
These notes, sometimes incomplete, were taken from a lectured class based
on a CMU's OLI class:
CMU  Carnegie Mellon Univ.
OLI  OnLine Instruction (?)
The purpose of these notes is to help the writer (me) retain the
information and to provide support for 'reentry', the moment in life
where I really need to apply parts of this.
Attribution: many of the diagrams are images captured from the CMU
material. Sometimes the supporting links end up in Wikipedia which I $upport
monthly. I don't believe any of this material comes directly from Dr. Liang's
fine lectures.
See Conventions
See My Review of the course.
See Layout of course materials
 ADM
 p31.3, Average Deviation from the Mean
 alpha, α
 α is a synonym for significance level.
This value is compared to the pValue to determine whether the
alternate hypothesis is supported by the collected data. The significance level is early, during the
design of the sampling effort. A smaller pValue is a stronger indication that the issue
sought by the effort is true. A common value for alpha is 0.05  for 95% confidence (TODO check this)
 applet
 TODO: THERE ARE SEVERAL APPLETS. Show typical link. calc,gen model data...
This 'applet' is frequently used in the OLI material to find the probability
corresponding to a known Zscore.
I strongly recommend that the reader not get too attached to
the applet  which 'goes away' at the end of the class. Alternatives to
the applet can be found below in the discussion about 'p4z
useful locations of the applet is/are: App p65sht4
 apps

Pinot:: ../p/slideShow/drafts/normalDistrib.xlsx (also on thumb drive ??and yahoo website??) TODO
 1st block (upper left) calcs P left of 'x value'
 2nd block "Normal Dist, Between x's.
uses mean, std dev from A4,A5. Uses A11,A12 for xLeft, xRight
 3rd block. F3 to F7. calcs 'x value' given probability,mean,std dev
p66,1/7,top std normal curve. find P for any normal curve; pages > 1/7 show op
 association
 p39
"association (statistics)"
Correlation.., wikip
In statistics, dependence or association is any statistical relationship, whether causal
or not, between two random variables or bivariate data. In the broadest sense correlation
is any statistical association, though in common usage it most often refers to
how close two variables are to having a linear relationship with each other. "
 bellshaped

p15. aka 'symmetric with central peak'
 categorical variable
 see variable, categorical
The name of the entire study, "Cereals", p13, is not a variable at all.
 central limit theorem
 p111.4
"for large samples, the sampling distribution of
sample means is approx normal". Even if the original data is skewed
In practical terms this seems to mean that, when dealing with the means of samples,
a set of 30 samples is usually all that's needed to have a normal distribution.
 CI
 CI='confidence interval'
 claim
 p75sht1 SOON
relates to 'hypothesis'
 conditional distributions
 p39
 conditional percentages
 p39.6 calc'd separately each val of variable, explanatory.
 confidence interval

A "C.I.", "CI", or "Confidence Interval" can be described as (xmin, xmax) or as (x_middle +/ amount).
'xmin', 'xmax' bound the range of interest. see top of p83sht5 for eg
Example: (35.0 , 42.5) or ( low bound, high bound ) and there's also a percentage value
given with the interval, eg: 95%, as in "95% of any samples of the population data will give
values falling between 35.0 and 42.5".
p83.0 "REQUIRES a random sample"; p91sht2/4 btm.
(p115) "We are 95% conf that the mean SAT math score in this state is between 467.2 ..."
The Z_{c} zscore, introduced in m14, is about the 'c' (=confidence interval).
'stdErr' is like stdDev but for samples.
examples:
 p83sht2 "1508 adults, margin of error 2.5% points, 95% confidence"
 p83sht3,"learn by doing". resultant confidence interval was "(27%+4%) or (23%, 31%)" //don't overthink it !
 p83sht3,"did I get this". (21% + 0.2%) or (20.8%,21.2%) at 95%
 degrees of freedom
 m19, p116sht1 btm; 'df'; df = n1
shows up in problems using the Tmodel
 dependent, independent
 p51
 disjoint
 module 10 (?). when A and B events have no outcomes in common, are disjoint.
if disjoint: P(A or B) = P(A) + P(B)
 distributions
 module 4
p14 early
 dotplots
 p14.2
There's a dot for every measurement; whereas, a histogram shows 'bins'
dotplot better than histo at showing shape, center, spread. //p17.0, histo p17.2
 doubleblind experiment
 p39, test info hidden from participant to reduce or eliminate bias
wikip, subjects and test conductors uninformed
 empirical rule
 for normal curves. p62.8, p63.0
 0.68 within 1 sd, ?? 0.645 sd // i had written 1.645
 0.95 within 2 sd, actually 1.96 sd
 0.997 within 3 sd, actually 2.575 sd
 Excel's functions
 there's an exercise at the end of histo 3/4
TODO: point to slideShow
NORM.DIST(_,____,______,0) don't use 0 as last arg. see sideBySide.xlsx right side
NORM.DIST(x,mean,stdDev,1) returns P(x) given mean and stdDev. for normal curve
NORM.DIST(135,100,15,1) returns 0.990185

NORM.INV(0.990185,100,15) rets 135, inverse of NORM.DIST call above.

T.DIST(x,df,FALSE) returns P(x) for lefttailed student T's distribution
T.DIST.2T(x,df) returns P(x) for 2tailed student T's distribution
T.DIST.RT(x,df) returns righttailed student T's distribution
T.DIST(1.96, 18,1) gave 0.9672
T.DIST(1.96, 18, 0) gave 0.0626
T.DIST(1.96, 18, 1) gave 0.9672 TODO: understand and doc 'tails' argument
T.DIST(1.96, 18, 2) "
T.DIST(1.96, 18, 3) "

T.INV(probability, deg_freedom ) "returns left tail" t= 2.10 where z was 1.96
T.INV(0.025, 18 ) gave 2.10092
T.INV(0.025, 39 ) gave 2.02269 or 0.023

T.TEST(array1, array2, tails, type )

AVERAGE(cell_a,cell_b) from 'a' to 'b', inclusive
STDEV(cell_a,cell_b) from 'a' to 'b', inclusive
 expected value
 p54.7, same as mean.
 explanatory variable
 see 'variable, explanatory' TODO: link
 exploration, cycle of
 module 4, see diagram p38.9
 exploratory data analysis
 mentioned p46.5, 2nd part of CopyCat case p71.0 top // cheating
 fivenumber summary
 module 6, == minimum, quartile 1, median, quartile 3, and maximum
 given
 p43.3
p79,3/7: it's the word given to the vertical bar in P(pHat >= 0.15  p = 0.10)
 histograms
 m4, link frequency, relative frequency.
made fm (link dotplot), see p16.4 especially 97 dots becoming ht of a bin
changing bin size, start point can distort histo. p17, histo p17.4
avoid 'pancake', 'skyscaper'
 hypothesis
 p39 ??,
reading m15 now but the term has been around ?since m13?.
Early m15 notes that 'claims' will morph into hypotheses.
 hypothesis, alternative, Ha
 p95.5.
hypothesis about the value of the parameter.
Claim in the research question about the value of the parameter.
The alternative hypothesis says the parameter is “greater than” or
“less than” or “not equal to” the value we assume to true in the null hypothesis.
Examples
p95.8 The proportion of smokers among adults who have a degree < 0.22
WHEN AN EXPERIMENT IS DEFINED, THERE IS BOTH A NULL AND AN ALTERNATIVE
"WE CAN EITHER REJECT THE NULL HYP OR FAIL TO REJECT IT. WE NEVER CONFIRM NULL", P96.8
eg: 96.3, H0 "still 62MB", Ha "> 62MB"
How likely it is that in a sample of 375 we find that as low as 16.5% have used marijuana,
when the true rate is actually 21.5%. P96.8
How likely it is in a random sample of 1,500 students to observe students
studying an average of at most 27 or at least 33 hours per week outside
of class, if the mean number is actually 30 hours per week.
 hypothesis, conclusion
 p96.4.
good to include the pValue in the conclusion.
"the data do not provide signif evidence that the propor of comm colleges without X is lessThan 25%".
p97sht2
 hypothesis, null, 'H0'
 p95.5
hypothesis about the value of the parameter.
We assume the null hypothesis true then see if we can reject it.
"We never accept the null hypothesis or state that it is true.", p96.7
A 'true' result means there is (effectively) no relation between the defined parts.
A 'false' null hyp means there is a relation.
"pValue" == 96.4: how much variability to expect in random samples when the null hyp true
Examples
p95.7 The proportion of smokers among adults who have a degree == 0.22
p95.8 The mean IQ of Raider fans is the same as Niner fans.
The null hypothesis is a general statement or default position that
there is no relationship between two measured phenomena, or no
association among groups...
...the field of statistics gives precise criteria for rejecting a
null hypothesis.
null intro, wikip
null details, wikip
 hypothesis testing
 p95.8, 96.1, p95.0
confidence interval: wrt population parameter(s)
estimate value or difference in popu param
hypothesis test: wrt population parameter(s)
test a claim about a population parameter(s) or difference in them
The process of forming hypotheses, collecting data, and using the data
to draw a conclusion about the hypotheses. summary p97sht3/
 get or make the research question
 determine hypotheses, 'null' and 'alternate'
 collect the data. (random sample. calc statistic (mean or proportion).
Formulate the exact test, using the statistic obtained from the data, eg:
"How likely is it that, in a sample of 'n', we'll find that (as much, at least etc)
the statistic is true while the null's value is also true"
 assess the evidence
 state a conclusion, ie. accept Ha or conclude "don't know"
 hypothetical twoway
 p39,p43.1 purpose is to answer complex questions
 independence/dependence
 2 tests:
 if P( issue ) * P( other  issue ) ~= P(issue and other)
marginal prob * conditional prob sorta== joint Prob
# can ratio smaller/larger and see how close to 1.0 as meas of equality
 if P( issue and other ) ~= P(issue) * P(other)
 independent
 (2nd wrapup to m10).
When the knowledge of the occurence of one event A does not affect the
probability of another event, B, the events are 'independent' and
P(A and B) = P(A) * P(B)
 inference
 module 4, diagram p38.9
the eventual goal of explanatory analysis p46.5
In inference, we use a sample to draw a conclusion about a population. p95.0
 inference, types of
 p83.0
confidence interval. use when goal is estimate population parameter.
hypothesis tests. use to test a claim about a population parameter
 inflection point
 p62sht2
TODO
 IQR
 p26.3, InterQuartile Range
 interval
 == confidence interval(?!), p83sht2(m13).
I think I saw it before m13
Zscore interval...
 matched pair (design)
 module 19. cool stuff
 margin of error, ME
 mentioned p83sht2
p83sht4. report a margin of error based on the standard error.
p115,1/4: for proportions, mOfErr() = 2 * sqrt(p*(1p)/n)
for means, mOfErr() = 2 (stdDev==sigma) * sqrt(n)
wikip:
a statistic expressing the amount of random sampling
error in a survey's results.
The larger the margin of error, the less confidence one should have that the poll's
reported results are close to the "true" figures; that is, the figures for the whole
population. Margin of error is positive whenever a population is incompletely sampled
and the outcome measure has positive variance (that is, it varies).
The term "margin of error" is often used in nonsurvey contexts to indicate
observational error in reporting measured quantities.
 marginal proportion
 p39 (module 8)
The
margins are simply the Column Totals and Row Totals, outlined
in heavy black. The gray area is measured data.
A lot can be simply answered by forming ratios between
different entries, sometimes between 2 margin numbers, sometimes
between a data entry and a margin number.
The 'fat pets' proportion = 235/1200 = 0.195
 marginal percentage
 p39.
Dogs in the table above were 760/1200 = 0.633 = 63.3% of the pets surveyed.
 marginal probability
 p39,
eg: prob that a random pet is a cat: P(cat) from the above table, is 440/1200 = 0.367
See 'probability marginal' and p41,s1 top
 matched pairs design
 p123,sht1
 math model
 p78.0 has:
 math model center: mean of the sample proportions is p, the population's proportion
 math model spread: A sample's sd ('s') is sqrt(p(1p) / n) "standard error"
 math model shape: A normal curve is good if the 'normality test' passes.
 mean
 aka 'expected value', 'typical value'. p22
p23.5: Use mean for center only for distributions
that are reasonably symmetric with a central peak.
When outliers are present, the mean is usually not a good choice  over using the median.
module 5 Mean&Median:p23.9: both mean and median are good. histo sort of sym.
 model
 p53; see 'math model' above.
 μ, 'mu'
 population's average. 'mu'; p67 1/2 .4
as opposed to the mean of a sample (xBar).
 normal curve, conditions
 p78.4
The 'conditions' for using a normal curve are just the 'normality test' below.
 normality test

This test consists of two short calculations:
 np >= 10 ; or, in more detail, n * p = sampleSize_n * probability >= 10;
 n(1p) >= 10
'np' is the probability of success and
'n(1p)' is the probability of the other outcome.
If these conditions are met, we can assume a 'normal model' will work and, therefore,
we can use the techniques which accompany normal distributions.
example 1: If we expect 70% of a population of 40 participants to be overweight;
np=40(0.70) is 28 which is larger than 10.
We expect 30% (1.00  0.70) of the 40 to not be overweight.
n(1p) = 40(0.30) = 12 which is also larger than 10.
example 2: If you try sample size n=40, probability p=90, the
n(1p) calculation becomes 40(0.10)=4 which is not larger than 10.
 outliers
 m4,p26.4 deviations from the pattern
module 6 the 1.5 * IQR rule
outliers, extreme. more than 3 sigma away from mean
 p
 parameter (population, not sample).
 p0, pzero
 used in m16,p102
as probability (proportion) used in the null hypothesis
 p4z
 'p4z' is the name of a python function which
returns a 'p' (probability) for a given Z value ("P for Z").
A person can extract it from a "low res" table, online or on paper.
Other alternatives are: Excel's NORM.DIST function, 'asking' online...
 phat, pHat
 A mean for a sample, not the whole population; table p74,6/7
 P( ) syntax
 p39, m8
P(statement) = proportion // see p40.8, P(HealthSci given female)
P(male AND Info Tech), p41
P(girl  predict girl) // the  is same as 'given'
reviewed on p68 1/2 or's and's
 parameter

p74
'a number (mean or proportion) for a population, not a sample.'
while a 'statistic' is the mean or proportion calculated from a sample, not a population'
if variable is categorical, parameter and statistic are proportions
if variable is quantitative, parameter and statistic are means
population (not sample): p, mu, sigma; sample (not pop): phat,xbar,'s'
 poisson
 (beyond the present course). for rightskewed
distributions. home prices, incomes
wikip, Poisson dist
 probability
 p40.2,p40.5, m4, value range 0 to 1
properties,rules: 1. they add. P(A+B) = P(A) + P(B) except when they start to double count a cell, p50.4
 P(not A) = 1  P(A)
 always true: P(A and B) = P(A) x P(B  A)
 when independent: P(B  A) = P(B)
 if P(A and B) = P(A) * P(B) then independent
 probability distribution
 p49.0 intro to
 probability model

p74. describes the longrun behavior of sample measures.
ref'd p109, beginning. If you have a normal curve that fits your data well,
the curve (mean, std dev) are a 'probability model'.
probability model is written "(mean, std dev)"; it assumes a Normal distribution.
 probability types
 p42sht4/4
probability, conditional; p39,p42.4 "probability of a categorical variable taking on a
particular value given the condition that the other categorical variable has
some particular value." eg: the percentages in <this row> are based on
the condition that the student is male.
probability, conditional; p39, see 'probability, conditional'
m8,p40,4 discussing it. P(HealthSci given female); female is the condition
probability, joint; p39,p42.4 eg: P(female AND HealthSci) "probability that the
two categorical variables each take on a specific value. "
probability, marginal; p39,p42.4, probability of a categorical variable taking on a
particular value without regard to the other categorical variable, eg: P(female). ;
eg: P(female); probability of a categorical variable taking on a
particular value without regard to the other categorical variable.
 probability, empirical

Near start of m9. vs Theoretical P
empirical P will approach theoretical P for a large sample
P(event) = rel freq from long series of repetitions. m9.9 of sht3
 probability, theoretical

1st real section in m9
vs probability, empirical
 probability notation

p39,p40.2 p40.5 P(female AND HealthSci)
ref 1 (https://en.wikipedia.org/wiki/Notation_in_probability_and_statistics)
sometimes written as Pr( )
 proportion
 TODO. explain utility when handling chapter 13 stuff.
 pValue, 'Pvalue'

The pValue is the calculated area, the probability, lying outside the
Zscore area defined by alpha, the 'significance level' .
A successful experiment has a pValue less than the alpha  so a small alpha is desired.
The alpha value is a function of the chosen confidence level and is
chosen before the data is collected.
p103sht5, topic is "how to determine pValues (w/ OLI applet)".
for <, > (onetailed): NORM.DIST(zScore,0,1,TRUE)
for NotEqual (twotailed): 2 * NORM.DIST(zScore,0,1,TRUE)
The term 'Pvalue' looks like a calculation subtracting value from P. So
I prefer to write it as 'pValue'.
 quartile
 module 6. based on word 'quarter'
p66.7: when 'quartile' used w/ bell curve, the 1st Q is the area left such that
probability is 0.25
 quantitative variable

see 'variable, quantitative'
 random variables
 "usually written in upper case"
Particular realizations of a random variable are written in corresponding lower case letters
 range
 modules 5, 6
range, overall, p15
 relationship
 p39 //correlation? 'linear function'
related variables? (issue) theme in m8.
 relative frequency
 TODO. used p79sht1
 research question

"statistical investigations begin w/ research questions that require us to test a claim", p96.1.
examples of types:
average student course load less than ? quant => param is a mean
_claim_ uses word 'mean': does 'mean' course load...
do majority of students qualify for loans? categorical => param is a proportion
variable is 'qualify for loans'
do female and male students have different GPAs? compares 2 population means. quant.
are athletes more likely than nonath to get advising? compares 2 popu proportions. var 'receive advice'
 response variable
 see 'variable, response'
 risks
 p42.1 probability, but for a negative outcome
p42sht1, risk reduced. neg chg in risk.
reference risk. often the path using a placebo
p42sht4 eqn at btm of page very good
 robust
 m19. p122,sht1/5. CIs and hypothesis tests are 'robust'
if they're mostly insensitive when the conditions for use are shaky.
 rossman

Rossman Chance applets. seen in class 10/29.
their app, One Proportion inference
 sample size, n

increasing 'n' will reduce the std deviation (aka standard error). p76.0
Increasing the sampleSize 4 times gave reduced the standard deviation by 2. p76.8
The lower equation shows how to set the sample size given P, and std err (or SD).
 sampling distribution
 p78sht1;
p109sht5: sampling distribution of xBar
related: see central limit theorem
memorize!
Spread! dif terms for Cat vs Quant. x\ for sample mean... p0 called 'p'.
 SD
 see Standard Deviation
 sigma
 see Standard Deviation
 significance level, 'alpha', 'α'

p96.3sht4: "If the pValue is <= alpha, we accept Ha, the alternate hypothesis".
"If the pValue > alpha, we fail to reject the null"  meaning we don't know...
p96.6 "significant difference",
??(seen in m14, m15 (?)), αlevel, p96.7
p125.7,
General Guidelines for Choosing a Level of Significance;
(see type errors)
If the consequences of a type I error are more serious, choose a small level of significance (α).
If the consequences of a type II error are more serious, choose a larger level of significance (α).
But remember that 'alpha' is the probability of committing a type I error.
In general, we pick the largest level of significance that we can tolerate as the chance of a type I error.
 skew
 TODO
 SRS, Simple Random Sample
 TODO
 simulations
 possible where the data is truly random
and can be modeled without special knowledge about the data.
 skew
 m7,p15.1 "right skewed" means the distribution is lower on the right.
 standard deviation, 'stdDev', 'SD', 'σ'

181203: Accumulated notions about 'stdDev'
Very rarely is 'stdDev' actually calculated in a typical statistics problem;
rather, its value is just given to the reader as a part of a
publication about a study. The information at the level of the published study
can't be changed. Similar data can be collected and each glob of the new data
is called a 'sample' for which the equivalent of a 'stdDev' can be calculated  but
it must be called the 'stdErr'. The collection of samples retaken on the same
basic source doesn't make a stdDev.
"similar to the average deviation from the mean", p32.1, sqrt( Σ(xxBar)^{2}/(n1) ),
p116top, "SAMPLE stdDev", called 's', replaces σ. sigma/sqrtN becomes s/sqrtN
Note that the following differs from the graphic...
p55.3; SUM(xxBar)² * p(x)
for sample proportions. see doc/pMinusP2.png, p78.3,sqrt(p*(1p)²/n)
2 stdErr eqns (=sqrt(p(1p)/n) and (=σ/n) imply
that σ = sqrt(p(1p)) as shown.
Fascinating... my gut doesn't tell me it's true...
From wikip:
... a measure that is used to quantify the amount of variation or dispersion
of a set of data values. A low standard deviation indicates that the
data points tend to be close to the mean (also called the expected value)
of the set, while a high standard deviation indicates that the data points
are spread out over a wider range of values.
In addition to expressing the variability of a population, the
standard deviation is commonly used to measure confidence in statistical
conclusions. For example, the margin of error in polling data is determined
by calculating the expected standard deviation in the results if the same poll
were to be conducted multiple times. This derivation of a standard deviation
is often called the "standard error" of the estimate or "standard error of the mean"
when referring to a mean. It is computed as the standard deviation of all the
means that would be computed from that population if an infinite number of samples
were drawn and a mean for each sample were computed.
It is very important to note that the stdDev of a population
and the stdErr of a statistic derived from that population
(such as the mean) are quite different but related (related by the inverse
of the square root of the number of observations). The reported
margin of error of a poll is computed from the standard error of the mean
(or alternatively from the product of the standard deviation of the population
and the inverse of the square root of the sample size, which is the same thing)
and is typically about twice the standard deviation—the halfwidth of a
95 percent confidence interval.
 standard error, 'stdErr', 'SE', 's'

(m12,78 "the std dev of the sampling propor (sqrt( (p(1p)/n) )[where p is mean of sample propors] is also called the std err")
has the same basic equation as 'standard deviation' but this term,
'stdErr', only applies to samples while 'standard
deviation' pertains to the full population.
p115,2/4,btm: "b/c normal model 95% 2 stdDevs, 2 stdErrs (1 stdErr = sigma/sqrt(n))
From wikip:
The standard error (SE) of a statistic (usually an estimate of a parameter) is
the standard deviation of its sampling distribution or an estimate of that
standard deviation. If the parameter or the statistic is the mean, it is called
the standard error of the mean (SEM).
The sampling distribution of a population mean is generated by repeated sampling and
recording of the means obtained. This forms a distribution of different means,
and this distribution has its own mean and variance. Mathematically, the variance
of the sampling distribution obtained is equal to the variance of the population
divided by the sample size. This is because as the sample size increases,
sample means cluster more closely around the population mean.
Therefore, the relationship between the standard error and the standard deviation
is such that, for a given sample size, the stdErr = pop.σ / sqrt(n=smpl size).
In other words, the standard error of
the mean is a measure of the dispersion of sample means around the population mean.
 standard normal distribution

p65.0: "distribution of zscores is also a normal density curve"
their example shows 2 bell curves, one w/ x axis measured in foot length, the other curve in SD.
and it shows the corresponding areas (probabilities) are equal.
 statClass.py

python code. many functions, far from done
 statistic
 p74.7
"a single measure of some attribute of a sample (e.g. its arithmetic mean value).
It is calculated by applying a function (statistical algorithm) to the values of
the items of the sample, which are known together as a set of data".
wikip
 statistical significance
 p96.6
== "statically significant" == "significantly different"
"When the pValue is less than (or equal to) 0.05, we also say that the
difference between the actual sample statistic and the assumed parameter
value is statistically significant. In the previous example, the pValue
is less than 0.05, so we say the difference between the sample mean (75 MB)
and the assumed mean from the null hypothesis (62 MB) is statistically
significant. You will also see this described as a significant difference.
A significant difference is an observed difference that is too large to attribute
to chance. In other words, it is a difference that is unlikely when we consider
sampling variability alone. If the difference is statistically significant,
we reject H0."
 success
 p78.2 true when datum matches
desired category (eg 'female').
"category of interest" is their definition.
I would prefer 'matches' over 'successes'.
 Tmodel
 m19, p115
"Student's T" distribution is wider and, generally lower, than the Normal curve.
If population stdDev unknown, must use Tmodel (?)
If sample size small, < 30, check for outliers. And
consider disclaimer "On the basis of the sample, we are assuming that the variable
is distributed without strong skew or extreme outliers in the population. The
conclusion from this test is valid only if this assumption is true". p122,sht3
wikip notes
that it's used
"when estimating the mean of a normally distributed population in situations where
the sample size is small and population standard deviation is unknown."
See Excel entry describes relevant functions.
example
n = 40 # df=39, 1 less
sigma not given (in 1st of 2 examples)
so, w/o sigma, use T's
xBar + t * s/sqrt(n) # or sqrt(df) ?
# need 't' for 95% conf so each of 2 tails has 2.5% or 0.025
T.INV(0.025, 39) => 2.023 which is greater than 1.96, the Z for 2 sigma
Something = xBar +/ 2.023 * 0.3/sqrt(40)
 Tscore

online calc
takes df and conf interval (typ 0.05)
 Ttable

This printable table from wikia.com
can be used to perform calculations w/o Excel or other such technologies.
A printed copy of this table is kept with the small folder of calculation aids.
TODO. write up how to use... SOON build table into statClass.py
 Ttest

m19
"The ttest is any statistical hypothesis test in which the test statistic
follows a Student's tdistribution under the null hypothesis.", wikip.
"A ttest is most commonly applied when the test statistic would follow a
normal distribution if the value of a scaling term in the test statistic
were known. When the scaling term is unknown and is replaced by an estimate
based on the data, the test statistics (under certain conditions) follow a
Student's t distribution. The ttest can be used, for example, to determine
if two sets of data are significantly different from each other."
wikip
 table margins

p39. Has nothing to do with data being 'marginal'.
This simply refers to values appearing in the margins, on the edges, of a table,
eg: sums across and down. Row Totals, Column Totals.
 Tc
 p117
"critical Tvalue = output of table(df, confidence level)" or python table
 twoway table

p38, tbl showing 2 issues, one horz, one vert, w/ margins.
initially for analyzing relationship between 2 categorical vars (p39 top)
explanatory values vert, response values horz; see p38sht3
summarize'd. m8,p39sht/9 ...'conditional percentages'
 twoway table, hypothetical

p43sht1 built fm other, known P's P that person has disease a test suggests...
 type errors

Type I Error: H0 is true but we accepted Ha // mistakenly rejected H0 (and accepted Ha)
Type II Error: We failed to reject null but didn't have enuf to accept Ha // mistakenly thought H0 good
p125,m19 (?)
 typical
 within 1 std dev of the mean.
see 'unusual'
 uniform
 p15 not much variability
 unusual
 more than 2 std deviations away from the mean. syn: "surprising"
 variability
 p15.4, p35.1
 variable, categorical

p39 eg: bodyImage={underwt,overwt,ok_wt}, gender{m,f}
 variable, discrete random variable

"When the range of X is finite or countably infinite, the random variable is called a
discrete random variable". Term seen in discussions on standard deviation.
wikip
 variable, explanatory

p39.5, gender. say, choose value 'male' // must ID this var. "want totals fm it to calc %'s "
see also 1st page of p42. treatment =
 variable, quantitative
 p39 eg: #students in class
 variable, response

p39.5 eg: comparing body image from explanatory variable (eg 'male')
// p42, 1st. {heart attack, not}
 workbackwards problems
 p66.5: given zscore, find X.
Use app or table to convert Z to SD. Then use Mean and SD to make X
eg: XL sock fits largest 30% of men. What's smallest foot len which fits XL.
 note foot lengths. mean 11", SD 1.5"
 mark off P=0.70 on app's window; find Z = 0.5244 printed
 [ 0.5244 * (SD=1.5") = 0.75"] + [mean==11"] = 11.75
 xBar, Xbar, xbar

this is another 'mean value'. just for samples (?). therefore this is a 'statistic'.
The mean for the population is μ , mu.
 Z_{c}
 p91sht2/4 btm.
'Zc' was introduced in m14. Note that 'stdErr' is the
familiar standard error (for samples. using proportions, not quantitive).
The 'c' subscript on the Z is to emphasize that the zscore is connected
to the confidence interval. The following from p92,1/4
Confidence  Zc 
90%  1.645 
95%  1.960 
99%  2.576 
 zscores

181125 note: TODO: eqn at p65.9 below doesnt show the eventual eqn (m12 or so)
interesting that, on p79sht3, the text shows Z as (statisticparameter)/stdErr.
The text up to this pt (?) has been calling the numerator 'pHat  p' // pHat  p0 (?).
Calling the numerator terms 'statistic' and 'parameter' may reveal
the path to quant problems...
The difference beween a probability, pHat (sample probability), and the sample's mean
is scaled by the std error (the denominator) to produce 'Z' values.
p65.9:
z = (x  u)/sigma ; x = u + z * sigma
p65.5 shows zscores used as 'flags' and the prob is the AREA between them. And theres an app
good sample problems p66sht6, towards the end
p64sht2/5,btm; the _UNITS_ are std deviations = (valuemean)/standard deviation
(88  (mean==82))/5 = 6/5 = 1.2
2.4zscore * (sd==5) + (mean==82) = 122 + 82 = 70
see applet, p65sht4,top
Conventions used here
My references to page numbers in the CMU materials will be most valuable if
the reader signs up for the class online. It cost me $20, a pittance given the
depth and breadth of the material.
(My) inconsistent page numbering conventions:
"m12" refers to 'module number 12'
Note that a numbered 'page' in the course material generally consists of more
than one of what we'd call 'pages'. The '91' in p91 is OLI's page number and
you can type in '91' when you want to see that page. But page 91, if printed,
generates several sheets of paper. 'p91sht3' would be the 3rd printed page of
p91. At times I just noted that something was 30% or so thru the material and
I'd note it as p96.3 . I believe that sometimes meant 30% down a sheet, other
times 30% of the way thru all of p96's material.
"p91" points to page 91
"p96.3" is page 96 about 30% thru that page (or material)
"p96sht2 is printed sheet 2 of page p96
'sqrt' means 'square root'
'*' means multiplication, (not 'x').
np is two variable names, 'n' and 'p'. Putting them together implies multiplication,
that is, 'np' is equivalent to 'n * p'.
my Review
I think CMU's materials for this class were 'the best' at teaching the nuts
and bolts but several students had troubles seeing the 'big picture' (me included).
The online text did present big picture ideas but figuring out the best approach to a
particular problem turned out to be difficult (imho). I wrote other, "calc" file(s),
to support true calculation sequences. If they look usable, I'll provide
links to those too.
I coded some of the calculation functions in python without any
plan to really base an app on the functions. I felt this coding effort would force
me to confront the operations at a more detailed level.
The feature I liked the most of the course materials were the frequent 'tests'
to see that the reader had comprehended the justfinished passage. I was very
impressed to see and use that feature.
The alphabetic organization is just my preference when collecting notes before
I see any other, more favorable organization. This organizational bias
also comes from decades writing software where the definition the terms
was allimportant.
The Unit and Module layout of the course
references:
ref1. wikip, 'notation'...
ref2. handout, week ending 9/28
#Unit 1: Introduction to Concepts in Statistics Course
module01 Intro to Concepts in Statistics Course
module02 Learning Strategies
module03 The Big Picture

#Types of Statistical Studies and Producing Data. ?
#Unit 2: Summarizing Data Graphically and Numerically
module04 Distributions of Quantitative Data
checkpoint: Distributions of Quantitative Data

module05 Measures of Center
module06 Measures of Spread about the Median
checkpoint: Quantifying Variability Relative to the Median

:===== =============================================================
ref p115 top says this is similar to the stuff in m18.
"when we used a sample proportion to estimate a population propor
module07 Quantifying Variability Relative to the Mean
checkpoint: Quantifying Variability Relative to the Mean
checkpoint: Summarizing Data Graphically and Numerically
Unit 3: Relationships in Categorical Data with Intro to Probability
module08 TwoWay Tables
p38
checkpoint: Relationships in Categorical Data with Intro to Probability
:===== =============================================================
Unit 4: Probability and Probability Distributions
module09 Probability and Distributions, pp4657
checkpoint: Probability and Probability Distributions

module10 Continuous Random Variables, pp5968
checkpoint: Continuous Random Variables
checkpoint: Probability and Probability Distributions

Unit 5: Linking Probability to Statistical Inference
module11 Intro to Statistical Inference, pp7072
p70
p71
p72
module12 Distribution of Sample Proportions, parameters vs statistics, pp7481
categorical vars. so parameters are proportions
testing a 'claim' //eg: "a majority of students qualify for loans"
p74 parameters vs statistics
p75
p76
p77
p78
p79
p80
p81 wrap up
module13 Intro to Statistical Inference, pp8387

Unit 6: Inference for One Proportion
module14: Estimating a population proportion, pp8993
module15: hypothesis testing
p95 distinguish 1 popu mean, 1 popu proportion,2 popu means, or 2 popu proportions
h0,ha, forming, stating
null hyp
p96 determine hyps,collect,assess,state
teenager smart phone usage over time, alpha, 40 hour work week.
blacks and marijuana, student study time per week.
p97 steps (again). a little more detail (?)
yellow boxed steps
P98 pValue (more).
obama, death penalty popularity, tea party, portion of students who work,
choosing the level of significance,
p99 2 types of errors. vv_type_errors
data usage on smart phones, obama, cell phones and brain cancer, telepathy,
module16: hypothesis test for a population proportion
p102 defines, clarifies claims for this section.
shows 1 or 2 populations, show quant==means vs cat==proportions
p103 pValue calc; determining H0 Ha, 'test statistic'== zscore
summarizes steps in making a Hyp
p104 Hyp test; more on pValues

Unit 7: Inference for Means
module17: Distribution of Sample Means
p108 intro
p109sht1,2 birth wts. categorical = LOW_WT/not; quant is weight.
#s used to find confidence interval to estimate propor for country.
p110 sigma/sqrt(n)
pell grants. $2600 = mean, sd = $400. se = sigma/sqrt(n=20)
p111 samplings of skewed data are normal
central limit theorem. sample size 30 usually enuf
Pell grants.
p112 intervals of Z scores
basketball player hts. normally dist.
converting Z's to P statement
teacher salaries. salaries skewed.
p113
module18: Estimating a Population Mean
module19: Hypothesis Test for a population mean
module20: Inference for a difference between population means

Unit 8: Inference for Two populations
module21: Distribution of differences in sample proportions
module22: Estimate the difference between population proportions
module23: hypothesis test for a difference in population proportions
