We are often asked what we think of new ideas and unproven causes. We think that in the right circumstances the charity sector can be substantially improved through testing out new ideas and we highly respect work being done in this area by organizations like Innovations for Poverty Actionand Animal Charity Evaluators . However, we do not support all new and unproven ideas equally, so we thought it would be worth expanding on how we decided to choose what to support as an experiment, and what makes it a model suitable for a good test.
How to Choose What to Test
The first question is determining what is worth testing. We mainly look for four things when considering whether to test an idea.
1. Decision Relevance
The first thing we look for is decision relevance, which is essentially whether the outcome of this experiment might change choices we would otherwise make. I and many others often spend far too much time thinking about things that are not decision relevant, issues that would not affect our actions. These types of situations may be fun to think about but do not carry the same importance as questions that would majorly change our lives. An example of this would be a consideration that would change the course of our career in the long term or affect where we would move thousands of dollars. If a consideration is not relevant to important choices we are making, it often does not seem worth the time it would take to adequately test.
2. Promising Evidence
The second thing we look for is promising evidence. The most promising areas to experiment in have some prior evidence suggesting they are important areas. This evidence is often not strong enough to answer the important questions being asked, but it is enough to make the area worth exploring.
This evidence should be weighed against the resources required to perform the test. For example, if something would only take an hour to test, we would require much weaker evidence than if the test would take six months of work.
3. Base Rates
The third thing we look at are base rates, that is, the success rates of those who have tried similar things. Everyone tends to think they are better than average. We try to temper this assumption with real data on how successful all past similar efforts to ours have been, and not just looking at the most successful ones.
The example we often use for illustrating this point is technology startups. Most people when looking at a given technology startup would assume it will not do as well as Microsoft has done. Yet in other areas people often assume they will not only do as well as the best in that area, but several times better.
In the absence of strong evidence to the contrary, we tend to think this perspective is naive and try instead to think of ourselves as average. Although we could come up with dozens of ways in which we are not average, the truth is almost anyone could do the same, or come up with reasons that their idea is more likely to succeed than others.
4. What Do Unbiased People Think?
We look at whether other unbiased people, whose values and epistemology we respect, think the idea is worth testing. We tend to find that many people think their own ideas/projects are substantially better than an outside observer thinks they are. For this reason we are often wary of projects that just seem good to the people invested in them, and try to run our idea past a large number of people to get a sense of what others think of it. People tend to be overly optimistic so if many people think it is a bad idea, this is usually a bad sign.
Characteristics of a Good Test
If we determine an idea is worth testing, how do we make sure our test is as strong as possible, while taking minimal time and resources.
The first thing we aim to do is to make our test quickly falsifiable. We want to gain evidence of impact as fast as possible. If the project is not worth doing, we do not want to spend excessive time or resources performing an overly long test. There has to be a clear way that the test could fail.
This may seem like an obvious claim, but many tests we have seen could only provide positive evidence or very weak negative results; never enough to end the project or prove something ineffective. For example the argument that a certain activity, say a protest, “plants a seed” of change is nearly impossible to falsify.
We have written before on the interesting fact that so few charities fail based on impact. In fact, we are not sure what it would take for most other organizations to fail, or if it would even be possible for them to fail based on impact rather than for lack of funding..
We are try to ensure that every project The Greatest Good attempts is falsifiable, and that we will be held accountable based on its results. If an organization does not have a clear benchmark or point at which it would fail, it’s likely the staff or board will move the goal post to accommodate whatever happens, even if it is a sign they should shut down.
The second thing we aim to do is to be transparent in our testing strategies and methodology. This is part of what makes our model both accountable and falsifiable. If we only published the most flattering information, our project could look high impact even if it is not. We want it to be clear to our supporters if the project is not working. This will help ensure we personally do not fall victim to a sunk cost situation.
Test on a Small Scale
Another thing we attempt to do is test on a small scale. We see this as beneficial for a number of psychological and practical reasons. In regards to psychology, when testing on a small scale people are less likely to get overly attached to the project. They are also less likely to attach the project’s success to their perceived personal success. Viewing the project as an experiment also helps with this.
From a more practical perspective, testing on a smaller scale takes less time and fewer resources. It minimises the investment in ideas which prove not to work - as many if not most new and untested ideas do. Although it is more fun and gives us more social status to accept large donations and grow staff quickly, its not the right thing for an experiment. The money could likely be better spent unless we have strong evidence that our idea causes a lot of good for the world. This is particularly true in the case of evidence-based philanthropy, as new experiments are competing with highly efficient charities.
We really wish more organizations would only expand after testing on a small scale and finding strong evidence of impact. We think this would result in fewer but much stronger organizations.
Measuring the Right Metrics
Another thing that is very important is not taking success at fundraising to mean having an impact (unless that is your charity's goal). It is easy to get these two confused as how large or well funded your charity is is often seen as a metric of success and if you simply assume your charity is doing good then growing it is a positive thing.
We would argue this is measuring the wrong metric and it is comparatively easy to get funding compared to being a truly high impact charity. There are hundreds of thousands of charities that get funding and yet top charity recommenders only recommend a few as the highest impact, and they are not the best funded charities.
The right metric is the ultimate goal - reducing suffering and increasing happiness in the broadest sense of the words. The way The Greatest Good does this is by measuring how many lives are saved, how many poor individuals have their income raised, and how many more people become happier. These are some of the ultimate outcomes of money moved to evidence-based charities, as has been demonstrated by the rigorous research of GiveWell, an external charity evaluator.
One more thing we attempt to do to improve our testing is to get an unbiased external review from critical and informed sources that we then publish on our website. Once again this holds us to a higher standard than we could hold ourselves to, and ensures that the public can see any problems or questionable claims we are making.
Have a Plan B
When looking at a project as a test instead of a definite plan for long term execution, it's advisable to have multiple contingency plans. As you do not yet know the result of your test, you have to prepare for both positive and negative outcomes. It's good for an organization to have tentative plans both on expanding its project and shutting it down. After an organization sets up safety measures so the failure of the project would not majorly harm the individuals in the organization, employees became much more willing to accept the possible failure of the project.
In conclusion we wish organizations were more self-skeptical, transparent, and tested things on a smaller scale. We also wish they were genuinely ready to shut themselves down if their impact was not high enough. We hope to provide an example of this in the experiments we run.
You also might be interested in our operations blog that details: our month to month organizational progress, the more technical ideas we have, and our board meeting minutes