How analyze/model a test that demonstrates that a factor affects perception

I am having trouble figuring out the proper analysis for an experiment I designed, and am hoping for some guidance. Following is a rough analogy to what is happening in the test. It's not the actual test, but it is far easier for me describe this than the actual test which has technical terms pertaining to audio and music that can be very confusing.

Let's say we have a two sets of cubes. In one set of 10, each cube is 2 cm square, and otherwise visually identical. In the other set of 10, each cube is 4 cm square, and otherwise visual identical. In each set, the cubes range in weighs from 1kg to 2 kg, in 0.1 increments, with the weight 1.5 kg omitted. [1, 1.1, 1.2, 1.3, 1.4, 1.6, 1.7, 1.8, 1.9, 2]

In the test, for a single round, the participant first picks up and returns to the test platform a "reference" cube of 1.5 kg. The reference cube that they are given may be either the larger or smaller size. Then they pick up a cube selected for them from one of the two sets, and return that to the test platform. Lastly, they write down whether they think the 2nd cube is lighter or heavier than the first cube. During the course of the test, the test rounds cover every possible sequence. Thus we have four types of comparisons:
  1. 1st cube small, 2nd cube small
  2. 1st cube small, 2nd cube large
  3. 1st cube large, 2nd cube small
  4. 1st cube large, 2nd cube large
And in each type, every one of the set of 10 weights will be presented to the subject once. The participant does 40 comparisons, total. Let's say we test 100 participants in total.

In the results, let's say that when comparing like-to-like sizes, the difficulty in correctly identifying whether the second cube is lighter or heavier is shown in the data to be more difficult as the second-cube's weight gets closer to 1.5 kg. If we tabulate the "misses", the second-cubes of weights 1.4kg and 1.6kg will have the highest number of "misses", while the second-cubes of 1kg or 2kg will have the most "correct" responses.

Now, let's suppose that when we handle cubes of different sizes, our brain tricks us into thinking the larger is heavier than it actually is (by maybe 0.3 kg), and this is reflected in the responses about relatives weights. Thus the results skew to reflect the mistaken perception. For example, the large-sized 1.2 kg to 1.4 kg weights will now more often be misjudged as heavier than the small-sized 1.5 kg reference cube, and the large-sized 1.6 kg weight will be much less likely to be mistaken for being lighter than the reference cube.

Is this a non-parametric scenario? I was initially tempted to compare the mode, median or mean between the four "types" of comparisons listed earlier, but I'm not confident that this qualifies as a data set where that is valid.

Is my thesis that the cube-size should have no influence on the relative weight judgment task success, that only the weight matters?

How do we measure our confidence that test results which demonstrate a skew are due to the influence of the cube-size and not a random occurrence?

I apologize if this is rather convoluted or improperly presented. I've only had one course, decades ago, in basic statistics, and am having trouble finding anything similar in design to this experiment in my old text book.
Last edited:


TS Contributor
If you treat the comparisons as study subjects, then you have 40 subjects, and your
measured variables for each subject are proportion of correct responses as dependent
variable, and weight of the second cube relative to the reference cube and size of the second
(0=small, 1=large) and size of the reference cube (0=small, 1=large) as predictors.
Maybe the interaction between both size variables should and/or between weight and size
variables should be included. You could then perform a multiple linear regression.

Alternatively, one could maybe think about a multilevel model.

With kind regards

my thesis [is] that the cube-size should have no influence on the relative weight judgment task success
Restate the thesis as a hypothesis and an accompanying null hypothesis

H1: There is no relationship between size difference and judgement accuracy
H0: There is a relationship between size difference and judgement accuracy

Partition the test results into two sets, { size differed, identical size }
Then follows a comparison of two sets of test results: Are the averages for these two sets identical or are they different?

There are four possible outcomes to hypothesis testing
Last edited:
I wish to thank you both for your helpful replies!

@Karabiner I see your point about breaking the analysis first into 40 different test subjects, and obtain a dependent variable for each. My "homework" is to research the term "multilevel" and to work through how I might apply multiple linear regression to the results of the 40 subjects.

@AngleWyrm I have seen this chart before--it is in my textbook. I appreciate, also the restatement of the thesis into hypothesis and null hypothesis. My "homework" is to clarify for myself how different results would fit into the different quadrants. I'm wondering if calculating "averages" is a suggestion meant to be taken literally or if it is a stand-in for the type of analysis needed (as suggested by @Karabiner). My task would certainly be simplified if I could simply give the different averages! I think the following charts (from a "pilot" test using 9 participants), show the averages for like-to-like cases is near zero, and the averages of the unlike cases skew consistently based on size.

Note: the actual test involves audio perception, but in trying to explain the parameters to folks not versed in the terminology, a lot of confusion ensued. I came up with the device of using size and weight in hopes of it being more intuitive so that focus could be given to the data analysis task. Also, I didn't want to get side-tracked on explaining/justifying the "0" result-giving us 44 tests, not 40 as I originally stated.

Thus, the X-axis shows the "weight difference" from the first cube to the second cube. The Y axis shows the number of "incorrect" responses (the judgement task gets more difficult as the perceived weight difference gets smaller). My pilot had 9 participants.