Hello all,

Apologies for posting an elementary query, but my stats is very rusty. Not looking for an explicit solution, necessarily, just a pointer in the right direction. (And if I've posted to the wrong sub-forum, I'd be grateful for suggestions.)

I have N records. Each contains M real values (an individual's known characteristics) and one measurement of that individual's result on a particular test. I know that N>>M. As a simple example, suppose I have the age/height/weight of 1000 individuals (thus M=3, N=1000), as well as each person's time t_run on a 10km run at maximum effort. Importantly, in some cases, the person could not complete the run at all, so t_run for those records is undefined.

I would appreciate any help with understanding the following:

Thanks very much in advance for any pointers or suggestions!

-Heywood

Apologies for posting an elementary query, but my stats is very rusty. Not looking for an explicit solution, necessarily, just a pointer in the right direction. (And if I've posted to the wrong sub-forum, I'd be grateful for suggestions.)

I have N records. Each contains M real values (an individual's known characteristics) and one measurement of that individual's result on a particular test. I know that N>>M. As a simple example, suppose I have the age/height/weight of 1000 individuals (thus M=3, N=1000), as well as each person's time t_run on a 10km run at maximum effort. Importantly, in some cases, the person could not complete the run at all, so t_run for those records is undefined.

I would appreciate any help with understanding the following:

- Assuming this data is representative of some (larger) population, what is a reasonable way to predict someone's test result (here, 10km time) as a function of known characteristics (here, age/height/weight)? Since N>>M, one idea I had was to compute the least-squares coefficients, k_m (for m = 1 .. M), such that t_run_predicted = k_1*age + k_2*height + k_3*weight.
- What is the correct term for the approach described in (1) -- linear regression? correlation analysis? (I just need to figure out where to start looking.)
- I'm concerned that setting t_run = (infinity) for those records where the test subject was unable to complete the run will cause problems (e.g. undefined matrix inverse and/or pseudoinverse). Would setting t_run as, say, 10X the largest t_run recorded by anyone who completed the run be a reasonable workaround?
- I'm uncertain if the problem is linear in the known characteristics. For example, the run time might be roughly linear in height and weight, but quadratic in age. Is there a standard approach for estimating the best exponents (orders?) in such a polynomial, if any or all of them are not unity? (Again, I'm not necessarily asking for the answer -- just what this analysis is
*called*, so I can try to teach myself how to do it)

Thanks very much in advance for any pointers or suggestions!

-Heywood

Last edited: