Okay, warning. This is going to majorly nerd out. I have studied this in great detail, aided by years of math research experience.
@Aaron91, you bring up a great discussion, but I think you may be missing a few things. This post is broken into varying degrees of technical analysis. I will often refer to the attached Figure A. The vertical line is placed at the baseline 90% mark (45/50). Only data to the left of the line are considered.
Less Technical
View attachment 44083
Firstly, in the published paper, they mention that of the n=23 ears, a total of 10 ears (<= 90% baseline) were eligible (Section
Human Hearing Assessment and Outcomes). This 10 then split further into 6 FX-322 and 4 placebo. So the nontechnical analysis is that 4/6 FX-322 ears showed clinically meaningful improvements (two-sided test at 5% significance level, per Thornton and Raffin, 1978). None of the 4 placebo ears reached this mark.
The bear thesis points out that the 6 eligible FX-322 ears sit lower than the 4 eligible placebo ears. This is imbalanced data and occurs by chance. The <=90% requirement was set pre study and everything was selected at random. It's just sort of unfortunate. I won't comment on the group level comparisons because the fine details are not disclosed (that I can see) in the patent submission. I do understand the individual statistics very well, even down the nitty gritty of how the Thornton and Raffin 95% confidence intervals are derived and why it is justified (will be discussed down below).
As we can see from the figure, 2/4 of the placebos retested below the y=x line (same performance at baseline and day 90) and 2/4 rested above. Of note, the 2 that
worsened started with lower baselines than the 2 who improved. So the theory that the only reason the 4 responders came from FX-322 is "low starting baselines" is at least somewhat tenuous since the placebo improvers started with higher scores.
In summary, 4/6 FX-322 eligible ears saw clinical significance and 2-sided statistical significance (at alpha=.05 significance level) per Thornton and Raffin. For placebo ears, this became 0/4.
More Technical
Without any derivations, what is a 2-sided 95% confidence interval and what does it mean here? Why is a 2-sided more conservative than a 1-sided confidence interval?
In layman's terms, a 95% confidence interval for an unknown population parameter (in this case the parameter is re-test percentage score) is an interval of the form (L,U), where L is the lower bound and U is the upper bound, such that if the experiment was repeated indefinitely, 95% of the time,
if it really was the same experiment (i.e. no drugs, same testing conditions, etc.), the re-rest score would land in that interval. Essentially, "statistical significance" means that the re-rest score landed outside that interval, which means that although there's a chance it happened at random, there's statistical reason to believe that the experiments really were different (i.e. the drug did something).
Okay, so what's with this 2-sided vs 1-sided business? Note that in the Figure A, the dotted lines above and below the dashed line represent the L and U of what I described in the paragraph above. This is 2-sided because one can fall outside of the confidence from
above or
below.
A 2-sided test is more conservative because it assumes that
worsening is possible. In other words, to clear the upper mark, U (obviously the goal is to see noteworthy improvements) requires more evidence. On the other hand, a 1-sided test would be only looking for improvement. In this case, in Figure A, the lower dotted curve (L) would be gone and the upper dotted curve (U) would be shifted down. In other words, it would be easier for the participants to clear the line, indicating statistical significance.
Anyways, this is what they mean when they show the 4 FX-322 ears above the dotted line.
Some natural thoughts may come to mind. We can say someone saw statistically significant improvement or that they didn't. But what's actually the case? Obviously, data is random and there's no way of knowing their true word score due to limitations in sample size. Hence, we could be right or wrong.
Being wrong in the "false positive" sense is called a type I error. This is worse because it's saying someone improved when they didn't. Imagine approving an ineffective or even unsafe drug. Big problems.
A "false negative" is called a type II error. This is when we say the null hypothesis could be true (i.e. no improvement) when actual improvement did occur. Though still undesirable, this error is preferred. Both errors are possible in any study and it has nothing at all to do with incompetence. It's kind of the point of statistical inference.
I won't get into it, but there's all kind of analysis on sample size, what's called "statistical power", and setting up the experiment to control Type I and Type II errors. Perhaps unsurprisingly, they are inversely related because they are opposite ideas. If one lowers the significance level, they make it harder to prove efficacy, but reduce Type I errors. This raises the chances of a Type II error.
To understand this, think of an extreme example. Say they were 99.9% confidence intervals (alpha=.001, very small). It would be
very hard to prove statistical significance and we would essentially never reject the null hypothesis. The whole process is a balancing act. They want the drug approved, have a vision going in as to what the results would be, but have to test as if they don't know and that even a worsening could occur.
Much More Technical
What is it that Thornton and Raffin do to come up with the 95% WR confidence intervals for test-retest?
Think of it like this. Say I take the test twice. Let's assume that my performance is modeled by a Binomial Distribution (i.e. trials (words) are independent and I have the same probability at getting each word right (which depends on my hearing ability)).
The problem with test-retesting is that we don't know my true success rate, p. Say I take the test at baseline and have p=50% success rate. Then at 90 days, I do it again and score p=52% success rate. What is my true p? If I took the test hundreds of times, what would my p converge to?
Okay, so by the Law of Large Numbers, if I answer a certain number of words, my percentage score should get closer and closer to my true score. Let's assume that after I answered 50 words on 2 separate tests, my true percentage is
almost known.
Then through some fancy math, we can justify a "variance stabilizing" angular transform (Freeman and Tukey, 1950). To put a long story short, the reason why this is done is because the fact that we don't know the true score produces vast differences in the 95% error margin size based on which p we pick.
This gives us the confidence that it doesn't matter if we didn't quite capture the true p. We can safely perform hypothesis testing with the 95% confidence interval and it's pretty reliable.
If the WR scores do not follow a binomial distribution (which they don't perfectly, as some words are harder than others), this method is somewhat flawed. However, it's pretty reasonable assumption. Thornton and Raffin actually looked at this in the first part of the paper.
For the group level stuff, I really want to know more, but they don't disclose which test they did other than say it's a MMRM (Mixed Model for Repeated Measures) approach. I hope to learn more about this at some point.