GaryM

New Member
#1
Hi all I hope all is great with everyone. I'm new on here so sorry to get straight down to my query and hope that someone might be able to point me in the right direction. Thank you.

So:
I have a prospects direct mail file and I'm testing a new message, (One change). To ensure it’s the message change and not the differences in the files that drive my results I want to make sure that the test and control files are similar / not significantly different from each other for my key variable/s.
The test & control files will be randomly split 50:50, using a random number generator, from the total prospect file which could range from 200 to 5000 records in total.

I believe I have two independent samples for which I want to measure the difference in the means using a two tailed test (H0: x̅1=x̅2 H1: x̅1 ≠x̅2 ) with sample sizes >30 therefore using a Z-test.

Q1: How to choose an acceptable critical value - How close to ‘similar’ or how different is acceptable? –
As long as the p-value is greater than 0.05 (95%) is that enough to accept (H0) that there are no differences between the two means or for this particular check would 90% or lower be better as the means are closer and therefore less likely to be different? Thoughts?

Q2: With that in mind do I really only want to make sure they are not significantly different from each other or is there merit in forcing them to be as similar as possible? (i.e sorting on key variable and selecting every 2nd record as the test 50%) Does adding such bias render the overall test meaningless or would it still be valid?

Q3: Would Stratified sampling be a better solution and maintain proportions within key variables/

Many thanks in advance?

Gary
 
Last edited:
#3
Generally, it is better to use the t-test if you don't know the population standard deviation. t limit to Z when n limit infinity, and when n>30 z will also give you a reasonable result. but I assume it is relevant to the time people used tables instead of computers ...

to reduce the standard deviation of your sample data you can plan your test and take random per criteria (for example if one of your variables is gender and you know the odds are 50:50 you can plan it instead of getting totally randomly lie 40:60

I don't understand if the test you mention is to ensure the data is similar in both groups or to compare between your parameters??
 
#4
Thanks Obh, I'm not sure i totally understand your points but i can answer your question:

I am comparing the means of a key variable within the test & control files to make sure the files are not significantly different OR they are similar enough to accept that these differences are normal and not driving any potential differences in performance between the two files. I want any differences in performance between test and control to be due only to the message change alone and limit all other possible causes.

As far as T-test v z-test is concerned: Is there a population standard deviation when comparing two samples? Is that not the sd of the total prospects list from which the two samples were split from? If so i can calculate that as my two samples make up 100%

So:
- Should I still use the t-test?
- Is a critical value of 0.05 the best to use in this case to ensure there are no differences between the files or would a high value be advised to ensure differences are limited?
- If the latter can i just sort on my key variable score and take every 2nd record as my test thus ensuring similarity or does this bias invalidate my test?

Thank you.
 
#5
If you know the standard deviation from the entire population you should use z-test. if you calculate the standard deviation only from the sample you should use the t-test.

Before ...can you plan your sample to have a similar key variable, instead of taking it randomly and then test it?
 
#6
If you know the standard deviation from the entire population you should use z-test. if you calculate the standard deviation only from the sample you should use the t-test.

Before ...can you plan your sample to have a similar key variable, instead of taking it randomly and then test it?
That is in essence my question yes. Is it valid and correct to force them to be similar (i.e sort on key variable and take every 2nd record as the test) or does the potential bias introduced mean this is never recommended and Pure Random samples or Stratified Random samples are always recommended instead?

Any thoughts on setting the critical value if randomly selected and testing diff between the two groups. Is 0.05 high enough to judge them as not significantly different in your opinion?
 
#7
I don't understand why every second record?

A simple example (hopefully I understand you correctly):

Let assume your main variable is color: 40% red, 60%blue.
You want to take a sample of 10% and change the message in this sample. (to check if the message change will influence Y)
If instead of taking randomly 10% sample from the entire population, your sample will be 10% randomly of the red and 10% randomly of the blue, it won't be a biased sample.
 
#8
I don't understand why every second record?

A simple example (hopefully I understand you correctly):

Let assume your main variable is color: 40% red, 60%blue.
You want to take a sample of 10% and change the message in this sample. (to check if the message change will influence Y)
If instead of taking randomly 10% sample from the entire population, your sample will be 10% randomly of the red and 10% randomly of the blue, it won't be a biased sample.
_____________________________
Hi obh, yes i believe you are referring to stratified random samples which is one of my options.
As to why every second record - I am however the whole prospect file into two 50% files therefore sorting on my key variable (say age) and taking every 2nd (even) record as the test and every odd record as my control will provide me with two files with very similar ave ages (as long as there is not an extremely large stepped reduction between two records). I may want to do this as I know for example that under 40 year old's respond far better than over 40's.

However, my issue and query here is, does selecting / sorting this way introduce bias that should be avoided? is it bad practice or an acceptable and reliable process to eliminate the chance of the differences between test and control files driving the results as opposed to the impact of the message change.

Or if I should avoid such a process due to the bias introduced I could simply carry out a pure random sample and test the difference between the two files on ave age.

Or as outlined by you, justifiably force proportions within the test and control files using stratified sampling. Splitting out the under 40's and over 40's and carrying out pure random sampling within each of the two age groups, splitting them 50:50 test and control. So the proportion of under and over 40s within the test file is the same as that in the control file.


1 - I know pure random and stratified random sampling are valid techniques but just need to understand the best placement for the critical value 0.05 or higher or is any p score higher than 0.05 high enough for it to be judged as similar enough not to impact the result?

2 - Or is it valid to force the similarities by sorting on the key variable or does the potential bias introduced make this selection process for our test and control file not advisable.

Cheers.
 
#9
"Hi obh, yes i believe you are referring to stratified random samples which is one of my options.
As to why every second record - I am however the whole prospect file into two 50% files therefore sorting on my key variable (say age) and taking every 2nd (even) record as the test and every odd record as my control will provide me with two files with very similar ave ages (as long as there is not an extremely large stepped reduction between two records). I may want to do this as I know for example that under 40 year old's respond far better than over 40's.


However, my issue and query here is, does selecting / sorting this way introduce bias that should be avoided? is it bad practice or an acceptable and reliable process to eliminate the chance of the differences between test and control files driving the results as opposed to the impact of the message change."

Or if I should avoid such a process due to the bias introduced I could simply carry out a pure random sample and test the difference between the two files on ave age.

I assume that in most cases taken every second file will be okay. if you have a periodic process it may not be good, like an extream example of sending 2 mails a day,1 red mail every morning and 1 blue mail every evening

Or as outlined by you, justifiably force proportions within the test and control files using stratified sampling. Splitting out the under 40's and over 40's and carrying out pure random sampling within each of the two age groups, splitting them 50:50 test and control. So the proportion of under and over 40s within the test file is the same as that in the control file.

I think this is a better method.

1 - I know pure random and stratified random sampling are valid techniques but just need to understand the best placement for the critical value 0.05 or higher or is any p score higher than 0.05 high enough for it to be judged as similar enough not to impact the result?

If you use the stratified random you won't need to check the similarity. The significant level is the maximum allowed chance
of type I error (rejecting a correct H0) allowed. under the assumption that H0 is correct and we want to reject it only if you pretty sure it isn't correct. this is not the case in your question.
You are more interested in the power of the test to reject incorrect H0, and this related to the required effect you want to identify with the test.
I don't think this is the correct direction ...

2 - Or is it valid to force the similarities by sorting on the key variable or does the potential bias introduced make this selection process for our test and control file not advisable.
If you mean stratified? then it is valid
 
#10
Thanks again for your continued response, i do appreciate it and it is helping. So are you saying

1- it's not bad practice to sort on the identified key variable and select the test records using every 2nd record and that using Stratified Sampling is however a better method?

2- And, with Stratified Sampling there is no need to carry out a pre-mailing test to ensure that the selected Test and Control files are not too (significantly) dissimilar, right? But why not need to still check there are no significant differences between the test & control files?

3- The significant level is the maximum allowed chance of type I error (rejecting a correct H0) allowed. under the assumption that H0 is correct and we want to reject it only if you pretty sure it isn't correct. this is not the case in your question.
You are more interested in the power of the test to reject incorrect H0, and this related to the required effect you want to identify with the test.
I don't think this is the correct direction ...

Not sure I fully followed this. but i think you agree that by testing the control and test files I am trying to ensure I do not have a Type II error. i.e accept H0 - the files are not significantly different when they actually are. But you don't think this is the correct direction. Could you please advise what you mean by not the correct direction and what is.

Thanks.
 
#11
1- it's not bad practice to sort on the identified key variable and select the test records using every 2nd record and that using Stratified Sampling is however a better method?
I thought you mean taking totally randomly files every second one without sorting.
sorting probably will be good as well. (for the first attribute will be like Stratified Sampling, for the rest like random)
Stratified Sampling, probably will be the best.


2- And, with Stratified Sampling there is no need to carry out a pre-mailing test to ensure that the selected Test and Control files are not too (significantly) dissimilar, right? But why not need to still check there are no significant differences between the test & control files?
Correct, since you ensure it by stratified Sampling method...

3- The significant level is the maximum allowed chance of type I error (rejecting a correct H0) allowed. under the assumption that H0 is correct and we want to reject it only if you pretty sure it isn't correct. this is not the case in your question.
You are more interested in the power of the test to reject incorrect H0, and this related to the required effect you want to identify with the test.
I don't think this is the correct direction ...

Not sure I fully followed this. but i think you agree that by testing the control and test files I am trying to ensure I do not have a Type II error. i.e accept H0 - the files are not significantly different when they actually are. But you don't think this is the correct direction. Could you please advise what you mean by not the correct direction and what is.

significant level(alpha) is the maximum allowed probability for type 1 error (rejecting correct H0)
Beta is the maximum allowed probability for type 2 error (rejecting correct H1). Power=1-Beta
In your case rejecting correct H1 will say: the test will say that the data is a "good sample" while it isn't
 

ondansetron

TS Contributor
#13
significant level(alpha) is the maximum allowed probability for type 1 error (rejecting correct H0)
Beta is the maximum allowed probability for type 2 error (rejecting correct H1). Power=1-Beta
In your case rejecting correct H1 will say: the test will say that the data is a "good sample" while it isn't
A Type II error is in reference to incorrectly accepting the null hypothesis, not something to do with the alternative (H1/Ha).

Rejecting the null hypothesis does not mean the data are “good” or that the sample is “good”. This is a common misconception that leads to p-hacking (my data aren’t “good”, let me try again...and again, and again.)
 
#14
When you incorrectly accepting H0, you don't accept the correct H1 ...
Of course, H0 is the default so it is more accurate to use "not accept" than to use "reject", since you reject only the default.
But I used reject to emphasize that you need to look on both sides, both types of error.

I didn't write the goal is to reject Ho...the goal is to build a test with the correct power to reject H0 when H0 is not correct (base of the required effect size).
Many people treat only the significant level but forget to treat the other side, the test power.