Methods of Experimental Research (LIMV08001)

Methoden van Experimenteel Onderzoek (MTMV08008)

2011-12, period 2, November-January



  1. [2012.02.08] As you know, the supporting website at will be terminated later this year. Please save what you need from the site (including data and useful materials) on your local computer. The website at may become unavailable without further notice!
  2. [2012.02.08] All the work has now been graded. Unfortunately there are some administrative hurdles in registering your grades. Don't worry: everybody has passed! Your grades are (or will be) visible in Osiris within a few days. For the MSc LW students this may take a bit longer; your grades are meanwhile available at this link.
  3. [2012.02.08] Assignment 8: Note that Simpson's Paradox is about the direction of a difference, and not about its significance.
  4. [2012.01.05] The R add-on package car (Companion to Applied Regression) has a useful function leveneTest to test for homogeneity of variance:
    leveneTest( breaks~tension*wool, data=warpbreaks )
    Levene's Test for Homogeneity of Variance (center = median)
          Df F value  Pr(>F)  
    group  5   2.891 0.02322 *
    The significant F(5,48), with p=.02, tells us that these variances are indeed not homogeneous. See help(leveneTest) for more information.
  5. [2012.01.03] Grades for the first part of the course (35% of total grade) are now available at the surfgroepen course website, see under Shared documents > extra > test1res.txt.
  6. [2012.01.03] Tip: In R you can import a data set directly from a web-based file, by calling function url in the file specification: see Lab Session 2 for an example. You can copy-and-paste the file's URL address from your browser to the read.table command.
  7. [2011.11.25] Slides from previous classes are on the surfgroepen course website, see under Shared documents > slides.
  8. [2011.11.18] For students in Clinical Linguistics etc (Logopediewetenschap): the Dutch article about EPVs (Logopedie en Foniatrie, Nov 2011, p.328-336) nicely illustrates various methodological considerations in developing a clinical screening test.
  9. [2011.11.10] Information added about supporting website at
  10. [2011.11.01] The two title courses have been merged. Classes will be taught jointly for the two groups of students, in English.



Hugo Quené
e-mail h dot quene AT uu nl,
Trans 10, room 2.12
office hours Tue 14:00-16:00 and by appointment


Reading materials are indicated below for each session. These may be found in digital form on the UU Library pages, on the course website in WebCT, or in paper form in the UU Library.

Some recommended additional reading materials about reseach methododology and data analysis are: Butler (1985), Maxwell & Delaney (2004), Statsoft (2004), Johnson (2008, with helpful examples in R), Rosenthal & Rosnow (2008), and Moore e.a. (2009) [details].


The most recent schedule will be available at the course schedule.


This course requires a basic insight in and experience with statistics and of data analysis, including hypothesis testing, t tests, and analysis of variance. This is typically acquired in one or two introductory statitics courses, and/or from an introductory statistics textbook.

You should be comfortable with most questions in exams or in self-assessment tests about Statistics, such as the one by Jones.


The course has weekly class meetings on Wednesday afternoon. In addition, there are computer lab sessions on Thursdays/Fridays, in which we will practice data analysis techniques for the weekly assignments.
The focus in this course is on independent study, assignments, and peer review, and less on class meetings.
The course will be taught in English.

Before each class meeting you'll have to do the following:

  1. make assignments about the topics covered in the last meeting;
  2. hand in your assignments (see below), by Sunday 18:00h at the latest;
  3. review and judge the assignments of a fellow student, by Tuesday 18:00h at the latest;
  4. read and study new materials.

During a class meeting we will discuss your work, using your mutual reviews, and new topics will be introduced.

After each class meeting, assignments have to be handed to the supporting site [], so all information is available to all.
Put your work in one document per week, this has to be in PDF format. Name your document as LASTNAMEN.pdf (use your last name and assignment number N). Upload your document to the appropriate folder on the additional teamsite. This should be done by Sunday 18:00h at the latest. You should plan some time between the weekly lab session (Thu/Fri) and Sunday evening, for completing the weekly assignments.
Retrieve the document of your selected peer student for this week, and write a review of her/his work in a separate document. Name your review document as LASTNAMErevNREVIEWED.pdf (replace with your last name and assignment number N and name of reviewed student). Place your review on the group bulletin board in the same folder as the assignments, by Tuesday 18:00h at the latest.
Before the next class session, you should read the review of your assignment. Notice that everybody's cooperation is required to make this schedule work! Failure to meet deadlines will cause problems "downstream", so make sure to finish and upload your work on time.

For the most part of the course, there will also be a "data lab" on Thu/Fri, to practice and rehearse your skills in data analysis.

Peer Review

Peer review, commenting the work of a peer or colleague, is a serious business. You can learn more about it through these web pages:


Your final grade is determined by the weekly assigments (35%+35%) and the final assigment (30%). Your collected works and class participations of the first part of the course will be graded halfway during the course (weight 35%), and similarly for the second part of the course (also weight 35%). This means that your weekly assignments and reviews will not be graded weekly! It is your responsibility to bring up questions and to ask for clarifications about your work during class meetings. Remember to use the other students' assignments and peer reviews as well.
The final assignment determines 30% of your final grade.


session 1: Wed 16 Nov

Experimentation. General methodology. The experimental method. Testing hypotheses. How to peer-review.

  • H. Quené (2010). How to design and analyze language acquisition studies. In: S. Unsworth & E. Blom (Eds.) Experimental Methods in Language Acquisition Research (pp.269-284). Amsterdam: Benjamins. [preprint].
  • Butler, Ch. (1985) Statistics in Linguistics. s.l.: Blackwell. [out of print, but see the web version]. Chapter 6.
  • This course requires and presumes that you already have previous knowledge of statistics, equivalent with an introductory course in statistics. You may test yourself by means of this tentamen of the Statistiek course.
  • Make sure that you have an account on the Solis UU netwerk.
  • Browse the various websites listed below. Make sure to browse the Research Methods Knowledge Base.
Write clearly, correctly, and concisely. Make a document in PDF format with a maximum length of about 2000 words. Submit your work to the supporting site at under folder 1one.
  1. Visit the University library — you could even do this physically. The location at Drift 27 is convenient and holds excellent collections.
    Take a recent printed issue (2010 or 2011) of an experimental linguistics journal, such as Journal of Phonetics, Journal of Memory and Language, Phonetica, etc, and select an article that reports an experiment.
    (a) Which questions does the study attempt to answer?
    (b) Which independent and dependent variables are involved in the study?
    (c) Describe the design of the experiment.
  2. Surf to the Online Statistics website. Read Chapter 11 (in version 2) about "Logic of Hypothesis Testing", all sections.
    Answer the questions in the section "Interpreting Significant Results" and in the section "Interpreting Non-Significant Results". How many questions did you answer correctly?
    Print out and study and memorize the section "Misconceptions".
  3. This last assignment is not for peer review but for independent study. Now is the perfect time to brush up your statistical skills. Answer the tentamina of my Statistics course (see above). Afterwards, check your answers with those provided on the course webpage. Determine what parts of your statistics proficiency are still deficient. Design a plan of action, to remedy your shortcomings during this teaching period.

Thu 17 / Fri 18: no lab session

session 2: Wed 23 Nov

The logic of experimental design. Validity. Confounds. Strong and weak designs.


Additional readings:

  • about medical experiments: Searching for Clarity: A Primer on Medical Studies, by Gina Kolata, New York Times, 30 Sept 2008, F1.
  • about survey methods: How the poll was conducted, New York Times, 15 Oct 2008, A22.
  • article mentioned in class, about effects of 9/11 events on language use: Fivush, R., Edwards, V. J., & Mennuti-Washburn, J. (2003). Narratives of 9/11: relations among personal involvement, narrative content and memory of the emotional impact over time. Applied Cognitive Psychology, 17(9), 1099-1111. [doi:10.1002/acp.988]


For this assignment you have to provide the experimental design of a prospective (future) study of your own. You could, for example, select an idea for your masters thesis, or a research project for one of your classes, or a follow-up study building on a previous experiment. Your prospective study should in principle be suitable for publication in a top peer-reviewed journal in your field; this means that not only the question being addressed, but the design and methodology need to be very good as well! Your experimental design and methods should be adequate to provide answers to your question.

Give a brief introduction about the issues your study attempts to answer, and describe and motivate the experimental design and methods. Which are the dependent and independent variables? Discuss the construct validity of your manipulations (treatments) and observations. Describe and classify your design according to the schemes in the reading materials (within-subject, split-plot, etc). Can you give some estimate of the expected effect size? And if so, what would be the power of your study? How many units (children, participants, sentences, items) do you need to achieve that power? Think about plausible alternative explanations, and other threats to the validity of your study, and how to neutralize these threats in your design.

As before, your elaborations have to result in a PDF document to be placed (or announced) on the group webpage (see above). Write clearly, correctly, and concisely (you'll probably need about 2 or 3 pages of text).

lab 1: Thu 24 / Fri 25 Nov

Introduction. Practicalities. Working with SPSS (and with R). Descriptive statistics. Inferential statistics: t tests and ANOVA.

In this course we will introduce and support two programs for data analysis.
SPSS can be used in the computer labs, and it can be obtained for a low fee under the UU campus license, from the surfspot web store.
R is a more recent program, more flexible than SPSS. R is quickly gaining in popularity, and becoming the standard in academic research. It can be obtained as open-source software from; for an introduction see my tutorial.

We will use these data sets: a toy data set (created by this R script); teacher ratings from a single-shot study (reflect on possible threats of internal validity).

class 3: Wed 30 Nov

Significance, power, effect size, hypothesis testing. Dependent and independent t test.

Reading materials:
  • Find your Introduction in Statistics textbook, and re-read the relevant chapters about hypothesis testing, Type I and Type II errors, significance, and power.
  • Lenth, R.V. (2001). Some Practical Guidelines for Effective Sample Size Determination. The American Statistician, 55(3), 187-193. [online via UBU, after logging in with Solis ID]
  • H. Quené (2010). How to design and analyze language acquisition studies. In: E. Blom & S. Unsworth (Eds.) Experimental Methods in Language Acquisition Research (pp.269-284). Amsterdam: Benjamins. ISBN 978-90-272-1997-8. [preprint].
Additional reading::
  • Cohen, J. (1992). A Power Primer. Psychological Bulletin, 112(1), 155-159. [from after logging in with Solis ID]
Hyperlinks: Assignments:
  1. a. Download the data set mer3between to your computer. This data set contains outcomes of throws with two dice, numbered 1 and 2. Each throw was performed by a different participant, i.e. a between-subjects design. The research question to evaluate is whether there is a difference between the two dice under study. Formulate your H0 and your H1. Formulate your possible Type I and Type II errors in terms of detecting a difference between the two dice.
    b. Read a random sample of N=60 cases from this data set into your statistical package. (Alternatively, you could read all cases and randomly discard all but 60 cases.)
    c. Use the appropriate t test to answer the research question, using α=.05, two-sided. Draw a clear conclusion.
    d. Determine the standard deviations, s1 and s2, and determine the difference between the dice, M2-M1.
    e. Calculate (manually) the post-hoc effect size for your sample, d=(M2-M1)/spooled, where you may assume that spooled=1.329.
    f. Go to the power applets by Lenth (see above), and select the appropriate t test. What was the observed power in your study, using your answers from part d above? Also determine the minimum sample size, using your answers from part d. above, to obtain a power of .80 in detecting "your" difference. Also determine the minimum sample size to obtain a power of .90.
    g. Use the same applet to determine the minimum difference between means of the dice that can be assessed with "your" numbers of cases (note the possibility of different numbers of cases in the two groups) with a power above 0.80.
    h. Discuss your findings. Explain why you did or did not find a significant difference, while your sample size was or was not sufficient.
  2. a. Download the data set mer3within to your computer. This data set contains outcomes of two throws with two dice, numbered 1 and 2. Both throws were performed by the same participant, i.e. a within-subjects design. The research question to evaluate is again whether there is a difference between the two dice under study. Again formulate your H0 and your H1, and formulate your possible Type I and Type II errors in terms of detecting a difference between the two dice.
    b. Read a random sample of N=60 cases from this data set into your statistical package. (Alternatively, you could read all cases and randomly discard all but 60 cases.)
    c. Use the appropriate t test to answer the research question, using α=.05, two-sided. Draw a clear conclusion.
    d. Determine the difference between the dice, M2-M1, as well as the standard deviation of this difference, sdiff.
    e. Calculate (manually) the post-hoc effect size for your sample, d=(M2-M1)/sdiff.
    f. Go to the power applets by Lenth (see above), and select the appropriate t test. What was the observed power in your study, using your answers from part d above? Also determine the minimum sample size, using your answers from part d. above, to obtain a power of .80 in detecting "your" difference. Also determine the minimum sample size to obtain a power of .90.
    g. Use the same applet to determine the minimum difference between means of the dice that can be assessed with "your" numbers of cases with a power above 0.80.
    h. Discuss your findings. Explain why you did or did not find a significant difference, while your sample size was or was not sufficient.
  3. Discuss the similarities and differences in outcomes between the two studies, with regard to sample size, power, effect size, and experimental design.

effect size

If we are comparing two groups of means, as in a pairwise t test, then the effect size d is defined as: d = (m1-m2)/s (Cohen, 1969, p.18; m represents mean).

A value of d=.2 is regarded as small, d=.5 as medium, d=.8 as large. It is left to the researcher to classify intermediate values (ibid., p.23-25).
The difference in body length between girls of 15 and 16 years old has a small effect size, just as male-female differences in sub-tests of an IQ test. "A medium effect size is conceived as one large enough to be visible to the naked eye," e.g. the difference in body length between girls of ages 14 and 18. Large effect sizes are "grossly perceptible", e.g. the difference in body length between girls of ages 13 and 18, or the difference in IQ between PhD graduates and freshman students.

If we are comparing k groups of means, as in an F test (ANOVA), then the effect size f is defined as: f = sm/s, where sm in turn is defined as the standard deviation of the k different group means (ibid., p.268). If k=2, then d=2f (ibid., p.278). These rules apply only if all groups are of the same size; otherwise different criteria apply.

A value of f=.10 is regarded as small, f=.25 as medium, f=.40 as large. Again, it is left to the researcher to classify intermediate values (ibid., p.278-281).
Small-sized effects can also be meaningful or interesting. Large differences may correspond to small effect sizes, due to measurement error, disruptive side effects, etc. Medium effect sizes are observed in IQ differences between house painters, mechanics, carpenters, butchers. Large effect sizes are observed in IQ differences between house painters, mechanics, carpenters, (railroad) engine drivers, and lab technicians.

Adapted from: Cohen, J. (1969). Statistical Power Analysis for the Behavioral Sciences (1st ed.). New York: Academic Press.

Additional reading: Rosenthal, R., R. L. Rosnow, & Rubin, D.B. (2000). Contrasts and Effect Sizes in Behavioral Research: A correlational approach. Cambridge: Cambridge University Press. ISBN 0-521-65980-9.

QQ plot of normally distributed data

A Normal Probability plot (NP plot or QQ plot) is a great tool to verify whether a variable has a normal distribution. Such a plot can be made in SPSS by choosing Analyze > Descriptive Statistics > Explore, or by means of the Examine command in a Syntax window. But what is "normal"? Here is an SPSS script to generate a QQ Plot of random data from a normal distribution. To run the script, you also need this dummy data file. Repeat the marked commands in the script, at least 8 times, to train your eyes to QQ plots of normally distributed data. Also notice the reported Kolmogorov-Smirnov Test of Normality and its significance.

An easier way might be to use R, e.g. by means of a web-based interface so you don't have to install R on your computer. One such web interface for R can be found at In R or in the web interface, enter the following R command:
This will result in a random sample from a normal distribution, with sample size n=100, and these data are then used as argument for the QQ plot command.

Thu 1 / Fri 2 Dec: no lab

session 4: Wed 7 Dec

Linear regression. True scores, error scores, error of measurement. Correlation. Reliability.

  • Ferguson, G. A., & Takane, Y. (1989). Statistical Analysis in Psychology and Education (6th ed.). New York: McGraw-Hill. Chapter 24 "Errors of Measurement", pp.466-478.
  • Trochim, W.M. (2002). Measurement. In: Research Methods Knowledge Base (Web Center for Social Research Methods).


Let us assume that we have 2 observations for each of 5 persons. These observations are about the perceived body weight, as judged by two 'raters' or judges, x1 and x2. The data are as follows:

	person  x1  x2
	 1      60  62
	 2      70  68
	 3      70  71
	 4      65  65
	 5      65  63

Because we have only two measures (variables), there is only one pair of measures to compare in this example. Very often, however, there are more than two judges involved, and hence many more pairs.

First, let us calculate the correlation between these two variables x1 and x2. This can be done in SPSS with the Correlations command (Analyze > Correlate > Bivariate, check Pearson correlation coefficient). This yields r=.904, and the average r (over 1 pair of judges) is the same.

If you need to compute r manually, one method is to first convert x1 and x2 to Z-values [(x-mean)/s], yielding z1 and z2. Then r = SUM(z1×z2) / (n-1).

This value of r corresponds to Cronbach's Alpha of (2×.904)/(1+.904) = .946 (with N=2 judges). Cronbach's Alpha can be obtained in SPSS by choosing Analyze > Scale > Reliability Analysis. Select the "items" (or judges) x1 and x2, and select model Alpha. The output states: Reliability Coefficients [over] 2 items, Alpha = .9459 [etc.]
If the same average correlation r=.904 had been observed over 4 judges (i.e. over 4×3 pairs of judges), then that would have indicated an even higher inter-rater reliability, viz. alpha = (4×.904)/(1+3×.904) = .974.

Exactly the same reasoning applies if the data are not provided by 2 raters judging the same 5 objects, but by 2 test items "judging" a property of the same 5 persons. Both approaches are common in language research. Although SPSS only mentions items, and inter-item reliability; the analysis is equally applicable to raters or judges, and inter-rater reliability.

Note that both judges (items) may be inaccurate. A priori, we do not know how good each judge is, nor which judge is better. We know, however, that their reliability of judging the same thing (true body weight, we hope) increases with their mutual correlation.

Now, let's regard the same data, but in a different context. We have one measuring instrument of the abstract concept x that we try to measure. The same 5 objects are measured twice (test-retest), yielding the data given above. In this test-retest context, there is always just one correlation, and the idea of inter-rater reliability does not apply in this context. We find that rxx=.904.

This reliability coefficient r = s2T / s2x . This provides us with an estimate about how much of the total variance is due to variance in the underlying, unknown, "true" scores. In this example, 90.4% of the total variance is estimated to be due to variance of the true scores. The complementary part, 9.6% of the total variance, is estimated to be due to measurement error. If there were no measurement error, then we would predict perfect correlation (r=1); if the measurements would contain only error (and no true score component at all), then we would predict zero correlation (r=0) between x1 and x2.
In this example, we find that
se = sx × sqrt(1-.904) = sqrt(15.484) × sqrt(.096) = 1.219
check: s2x = 15.484 = s2T + s2e = s2T + (1.219)2,
so s2T = 15.484 - 1.486 = 13.997
and indeed r = .904 = s2T / s2x = 13.997 / 15.484.

Supposedly, x1 and x2 measure the same property x. To obtain s2x, the total observed variance of x (as needed above), we cannot use x1 exclusively nor x2 exclusively. The total variance is obtained here from the two standard deviations:
s2x = sx1 × sx2
s2x = 4.18330 × 3.70135 = 15.484

In general, a reliability coefficient smaller than .5 is regarded as low, between .5 and .8 as moderate, and over .8 as high.

session 4 (continued)

Your answers and solutions to the questions below have to be handed in as described above. As always, write clearly, correctly, and concisely.
  1. We have constructed a test consisting of 4 items, with an average inter-item correlation of 0.4.
    a. How many inter-item correlations are there, between 4 items? (Ignore the trivial correlation of an item with itself.)
    b. Compute the Cronbach Alpha reliability coefficient of this test of 4 items.
    Now we add a new 5th item.
    c. How many new inter-item correlations are added to the correlation matrix when a 5th item is added to the test?
    Unfortunately the coding of this item happens to be incorrect, that is, the scale was reversed for this new item. The inter-item correlation of this 5th item with each of the 4 older items is -0.4 (note the negative sign).
    d. What is the average inter-item correlation after adding this 5th test item?
    e. Compute the Cronbach Alpha coefficient of the longer test of 5 items.
    f. Compare and discuss the reliability and usefulness of the shorter and of the longer test.
  2. A student weights an object 6 times. The object is known to weigh 10 kg. She obtains readings on the scale of 9, 12, 5, 12, 10, and 12 kg. Describe the systematic error and the random errors characterizing the scale's performance.
    Adapted from: R.L. Rosnow & R. Rosenthal (2002). Beginning Behavioral Research: A conceptual primer (4th ed.). Upper Saddle River, NJ: Prentice Hall. Ch.6, Q.7, p.159.
  3. Let us assume that in this course, in addition to writing a peer review, you would also have to grade each other's work as part of the peer review process. Grades would have to be on the Dutch scale from 1 (bad) to 10 (good). Discuss the reliability and validity of this method to assess student performance. What are the possible threats to reliability and validity, and how could these be reduced?

lab 2: Thu 8 / Fri 9 Dec


We will attempt to determine the reliability of a set of data (4 variables, 30 units), and perhaps work on the above assignments.

If you want to analyze these data in R, the following commands may be useful:

R> reldata <- read.table( file=url(""), header=TRUE) # retrieve from url
R> require(psych) # make sure that package psych is available
R> alpha(reldata) # compute cronbach alpha, function alpha is defined in psych

session 5: Wed 14 Dec

Multiple regression, multivariate analyses. Collinearity. Factor Analysis.

  • Moore, McCabe & Craig (2009). Chapter 11 "Multiple Regression". Available on the group page for this course, under Shared documents > extra.
  • optional: Peck & Devore (2008) Statistics: The Exploration and Analysis of Data. Chapter 14 "Multiple Regression Analysis".
  • optional: chapter on Multiple Regression, from the excellent online statistics textbook at StatSoft, Inc.
correlation cartoon,
Your answers and solutions to the questions below have to be handed in as described above. As always, write clearly, correctly, and concisely.
  1. Answer the following questions: Moore McCabe & Craig (2009), Chapter 11: Exercises 1, 2, 3, 4, 30.
    Data for the last question are available here in plain text format (the first line of this file contains variable names, SC stands for self-concept).

Forward, Backward or stepwise?

For question 30 the FORWARD method is most appropriate. This means that you start with an empty model (only intercept b0) to which predictors are added step by step. After each addition of a predictor, you check whether the model performs significantly better than before (e.g. by checking whether R2 increases).
The questions are about the increment in R2 by adding a predictor. The relevant information is easier to find in the SPSS output if you specify the FORWARD method.
The STEPWISE method is almost identical, "except that each time a predictor is added to the equation, a removal test is made of the least useful predictor" (Field, 2009, p.213).
The BACKWARD method starts with a full model containing all predictors, from which useless predictors are removed step by step. After each removal of a predictor, you (or your software) checks whether the model performs significantly worse than before (e.g. by checking whether R2 decreases). If yes, then the removed predictor is in fact useful in the regression model.
As a bonus, you could check what happens if you exclude case #51 from the data set, e.g. by marking it as a missing value. This is quite easy if you keep the regression command in a Syntax window for repeated use.


The chapter by Moore, McCabe & Craig draws heavily on American concepts. In the USA, your achievements are all that counts, in life as well as in study. The US grading system ranges from A+ (excellent) to F (fail).
For admission to a university, two things are taken into account: (a) your average grades in the final years of high school (HSM, HSS, HSE), and (b) your score in a national admissions exam, like the Dutch CITO test (Scholastic Aptitude Test, SAT). Top-class universities, like Harvard, Yale, Stanford, etc., use both parameters in selection. You have to be the best in your class (but your classmates are strongly competing for this honor), plus you need a minimal score on your SAT.
During your academic study, all your grades and results contribute to your Grade Point Average (GPA), a weighted average grade. This GPA is generally used as an indication of academic achievement and success. The authors attempt to predict the GPA from the previously obtained indicators (a) and (b).


Why is it "regression"? This has to do with heredity, the field of biology where regression was first developed by Francis Galton (cousin of Charles Darwin) in the late 19th century.
Take a sample of fathers, and note their body length (X). Wait for one full generation, and measure the body length of each father's oldest adult son (Y). Make a scattergram of X and Y. The best-fitting line throught the observations has a slope of less than 1 (typically about .65). This is because the sons' length Y tends to "regress to the mean" — outlier fathers tend to produce average sons, and average fathers also tend to produce average sons. Galton called this phenomenon "regression towards mediocrity". Thus the best-fitting line is a "regression" line because it shows the degree of regression to the mean, from one generation to the next. (Note that any slope larger than 0 suggests an hereditary component in the sons' body length, Y.)
Questions: Which variable has the larger variance, X or Y? Does the variation in body length increase or decrease (regress) over generations? Why?

partial correlation

The partial correlation between X1 and X2, with X3 removed from both, is given by:
r12.3 = ( r12-r13r23 ) / sqrt[ (1-r213)(1-r223) ]
  • Ferguson, G. A., & Takane, Y. (1989). Statistical Analysis in Psychology and Education (6th ed.). New York: McGraw-Hill. p495.

lab 3: only Fri 16 Dec

Multiple regression.

We will analyze the prices of energy bars, using data from Peck & Devore (2008), 6th edition, example 14.11, p.653. If there is time we can also work on the homework assignments.

inspecting residuals

A regression model may also be evaluated by inspecting its residuals, i.e. the observed score minus the predicted score. If the model is approximately correct, then the residuals should be normally distributed around zero, for the whole range of observations (because of the assumption that errors are independent, and distributed normally, with mean zero). This can be inspected by specifying residual plots, in SPSS by means of the SPSS Regression command or menu:

The first subcommand produces the following scattergram, which shows that the residuals tend to be more negative for lower GPAs (i.e. predictions are too high for lower GPAs). The worst-performing students obtain GPAs that are lower than predicted from their IQ and SC. Can you imagine why?
multiple regression residuals, figure 2
The second subcommand produces a QQ Normal plot as we've seen before. If normally distributed, then the residuals should vary in random fashion around a straight line:
multiple regression residuals, figure 1
This plot confirms that the residuals vary in somewhat non-random fashion, and hence that the residuals are not quite normally distributed.

In R the residuals may be inspected by means of qqnorm(resid(mymodel)), etc.

Christmas Break

No classes in weeks 51 and 52 (19 to 31 Dec)

session 6: Wed 4 Jan

ANOVA: general principles, one-way, effect size. Post-hoc tests, multiple-test problem, Bonferroni adjustements.


Additional Readings:

Your answers and solutions to the questions below have to be handed in as described above. As always, write clearly, correctly, and concisely.
  1. In a study of cardiovascular risk factors, joggers who run at least 15 miles per week were compared with a control group described as "generally sedentary". Both men and women participated in this study. The design is a 2×2 between-subjects ANOVA, with Group and Sex as two factors. There were 200 participants for each combination of factors. One of the dependent variables is the rate of heartbeat of a participant, after 6 minutes on a treadmill, expressed in beats per minute.
    Data from this study are available here in SPSS format, or as plain text (the latter file contains variable names in the first line).
    (a) What do you think of the construct validity? Please comment.
    (b) Is it allowed to conduct an analysis of variance on these data? Motivate your answer with relevant statistical considerations.
    (c) Conduct a two-factor (or "two-way") ANOVA on these data.
    (d) Write a summary of the results of this study, including the (partial) effect size (η and/or η2 and/or ηp2. Draw your conclusions clearly.
    (e) From each cell (combination of factors), draw a random sample of n=20 individuals, out of the 200 in that cell. Explain how you have performed the random sampling. Repeat the two-way ANOVA on this smaller data set.
    (f) Discuss the similarities and differences in results between (d) and (e).
    This exercise is adapted from: Moore, D.S., & McCabe, G.P. (2003). Introduction to the Practice of Statistics (4th ed.). New York: Freeman. Example 13.8, pp.813-816.

lab 4: Thu 5 / Fri 6 Jan

  • Perform one-way ANOVA of the datasets warpbreaks (number of breaks, wool type, tension condition) and Pitt_Shoaf1.txt (participant ID, condition, reaction time) that are provided in the Surfnet group (under Shared Documents > Extra).
  • Explore one-way ANOVA by means of this Java applet. For example, what happens if you add outliers or change variances?
  • Introduction to two-way ANOVA, interaction, fixed and random effects, error terms.
  • Work on above assignment involving two-way ANOVA about runners and sitters.

Adjusting t or adjusting df?

If two variables have unequal variances, then the t test statistic may become inflated. The computed t value is larger than it should be. Consequently H0 may be rejected while in fact it should not be rejected. This is known as a Type I error. To prevent this error, we should decrease the t test statistic by some amount. However, in practice it is easier to decrease not the t value itself, but its associated degrees of freedom. In this way we pretend that the t value is based on fewer observations than it was. Thus we are more conservative while testing our hypotheses.

The figure below shows the critical values of t (on the vertical axis) for a range of df (on horizontal axis). critical t values
As you can see, decreasing the value of the t statistic with unchanged df (down arrow) yields a similar effect as decreasing the df with unchanged t (left arrow). Both adjustements would result here in an insignificant outcome, and H0 would not be rejected. Because it's easier to compute the adjustement in df (length of left arrow) than the adjustement in t (length of down arrow), we commonly adjust the degrees of freedom, and not the t value, if we need to be more conservative.

We will encounter the same reasoning with F values used in ANOVA; those adjustements are known as the Huynh-Feldt and Greenhouse-Geisser corrections to the degrees of freedom.

session 7: Wed 10 Jan

ANOVA with subjects as random factor. Repeated Measures ANOVA.


Additional Readings:

  • Johnson (2008, Ch.4), chapter on Psycholinguistics also introduces ANOVA methods (book details below).
  • Additional ANOVA Topics, by Burt Gerstman (sections on RM ANOVA and following)
Compare these notes from similar courses in experimental research methods, at other universities: Assignments:
Your answers and solutions to the questions below have to be handed in as described above. As always, write clearly, correctly, and concisely.
  1. Conduct a Repeated Measures ANOVA of the data from a split-plot design, as provided in file md593wide.txt. These are imaginary response times to the same task under three treatment conditions. Each participant is tested under all treatments conditions. Participants are from two groups, of young and old persons. Hence group is a between-subjects factor, and treatment is a within-subjects factor.
    Of course, you should start out with some exploratory data analysis, and have a look at the interaction pattern, and verify whether the data meet the assumptions for Repeated Measures ANOVA. You should also evaluate and discuss the effect sizes of the main effects and interactions.
    If you want to do this in R, then you should use the same data in "long" format, in md593long.txt. (Conversion between long and wide data formats can be done with the reshape function in R.)
    These data are from Maxwell & Delaney (2004, p.593, Tables 12.7 and 12.15).

lab 5: Thu 12 / Fri 13 Jan

RM-ANOVA in SPSS and in R. Converting between wide and long data layouts. Interpreting results.

session 8: Wed 18 Jan

Logistic regression (GLM).

  • Moore, McCabe & Craig (2009), Chapter 14 "Logistic Regression", only available online, on groups webpage.
  • optional: Generalized Linear Models, from StatSoft, Inc — this is not an easy text! Concentrate on the first part, until "Types of Analyses". The sections on matrix algebra may be skipped. Make notes about your questions and problems with this text.
  • optional: Johnson (2008), Chapter 5 "Sociolinguistics".
Links: odds and log odds curves
Your answers and solutions to the questions below have to be handed in as described above. As always, write clearly, correctly, and concisely.
  1. Answer the following questions: Moore, McCabe & Craig (2009), Chapter 14, Exercises 26, 28, 30, 43 (exercise numbers revised for 6th ed.). In order to speed up your work on exercise 14.43, I've put the data on the web, in a plain text data file. The first line contains the names of the variables. Data (N=2900) start on line 2, and are coded as follows:
    	# hospital:  0=hosp.A, 1=hosp.B
    	# outcome:   0=died, 1=survived
    	# condition: 0=poor, 1=good
    Variables are separated by commas.
    In your logistic regression, the variables hospital and condition must be treated as categorical variables. For easier interpretation of the results, I prefer to use the zero codes as references or baselines (in SPSS choose Reference: First).
    SPSS does not provide you with 95% confidence intervals; you need to calculate these by hand. The Wald statistic in the SPSS output is the same as the test statistic for β as defined on p.46 in the reading material.
    Contrary to our regular schedule, you do not need to write a peer review for this assignment!

lab 6: Thu 19 / Fri 20 Jan


We're going to analyze the fictional data in file polder.txt.
observed outcomes and estimated probabilities of new language variant, broken down by speaker's sex and age
The above figure shows the estimated probabilities and observed outcomes of the fictitous responses: whether a person speaks the new 'polder' language variant (hit, resp=1) or the old standard language variant (resp=0). The estimated probabilities come from a model having two predictors, viz. sex (F=female, M=male) and age (continuous, centered to median). Dashed lines represent 95% confidence intervals. The observed outcomes (pink=female, blue=male) are plotted with a small jitter to prevent visual overlap of observations.

final assignment

For your final assignment you can choose either one of the two assignments described below. Deadline is Wed 25 Thu 26 Jan 2012, 23:59 h.

option one

This final assignment is to submit a revised or improved version of one previous assignment of this course. You're free to choose which one you want to revise.
As always, the revised paper should be (as much as possible) a running text, not a collection of incomplete sentences and statistical output.
In the revised version you have to accommodate the comments of your reviewer — if you agree of course. Also use the reading materials and hyperlinks provided.
You may discuss the reviewer's comments in the text of your revised version. But perhaps you find it easier to write a coherent (revised) text on your own, plus a second document with revision notes, in which you discuss the reviewer's comments explicitly, stating which comments you have taken into account, which comments you have ignored, and why.

option two

There are considerable similarities between analysis of variance (ANOVA) and multiple regression (MR), especially in designs without repeated measurements. You can read more about these similarities in the sources given below. Your assignment is to analyze a given dataset with both methods, and to discuss the differences and similarities among the two methods. The ANOVA must use a single independent variable named opleiding (type of study: 1=alfa, 2=beta, 3=gamma). The MR must use the so-called dummy factors named isalfa, isbeta, isgamma (0=false, 1=true, for each dummy factor), or a subset of these dummies. (Note that the given dataset already contains the categorical factor as well as the associated dummy factors.) Each row or unit represents a single participant of a fictional survey about students' work load. The dependent variable studietijd represents the time (in hour/week) a student spends on study-related activities. In your analyses, do not forget to inspect all relevant relationships between the factor(s) and the DV, to test whether assumptions are met, and to inspect residuals of all models.


deadline for final assignment: Wed 25 Thu 26 Jan 2012, 23:59h

Please send your work in PDF to hquene .at. gmail .dot. com, with subject line "assignment MER".

Further reading and browsing