Wrong way to do it. We know from psychometrics that forced binaries like this just create junk (people disagree with the question, so just choose a forced answer based on some heuristic for each such question like "closest to my mouse / finger" or "most socially desirable" or "same as last time"). So you aren't measuring what you think when you force choice like this.
If you're going to go with linguistic self-report and a single item, you really want something like an 11-point Likert scale. A smart design might get e.g. a person's rating of "blue-ness vs. green-ness" on an 11-point scale, then determine the optimal cutpoint via e.g. clustering, logistic regression, or some other method, to really get something meaningful.
Is it really junk though? There are several comments in this thread like “people tell me I call stuff blue that they think is green and this quiz confirms that.”