How to deal with 'blending' and 'shredding'?

by JeanTate

The last Live Hangout has, as usual, its own GZ blog post: UPDATE: Next Live Hangout: Tuesday, 19th of November, 7 pm GMT.

To get some deeper discussion going - than what's possible in comments to blog posts (and as Kyle Willett recommends) - I've copied the three substantive ones here, as separate posts.

Posted November 24, 2013 7:16 PM
by JeanTate

On November 18, 2013 at 7:45 pm, I wrote:

I’ve been reading about aperture effects, specifically how conclusions from an analysis of the SDSS spectrum of an extragalactic object – ‘galaxy’ – can be wrong or misleading, especially when the fraction of the galaxy’s light that falls in the fiber which feeds the spectrograph is small (say ~20% or less; for interested readers of my comment, Kewley et al. 2005 is a cool paper on this).

That led me to another cool paper, Salim et al. (2007); here’s something in this paper which really struck me:

A problem with matching in general is that what is considered to be a single object in one catalog can be resolved into multiple objects in another catalog, whether they are indeed separate objects (blending), or actually belong to the same system (shredding).

This matching problem – in its blending and shredding guises, and more – has come up quite a few times in GZ projects, from Keel et al. (2013) trying to get reliable estimates of the magnitudes of overlapping galaxies, to Land et al. (2008) adopting a (flawed) method to identify shredded galaxies, and the difficulty of matching “Stripe 82″ objects (in any of the three ‘sets’) in GZ2 with corresponding galaxies in the GZ1DR, to take just three examples.

Now that Galaxy Zoo is using UKIDSS data, I guess the matching problem will become even worse, when it comes time to write up the classifications. Could you guys please talk about how you plan to address this? In particular, what have you learned from past iterations of GZ, which involved only the one primary data source (SDSS)?

My second question is triggered by the GZ2DR (Willett et al. 2013). The method described treats (GZ2) classification classes as discrete, and parent choices as binary and absolute. So, for example, in the gz2_class an absolute decision is made to class a galaxy as either ‘E’ (‘smooth’ in Task 01) or ‘S’ (‘features of disk’); within E, as either ‘r’ (completely round), ‘i’ (in-between’), or ‘c’ (cigar-shaped’).

It seems to me that a lot of valuable information – coded in the raw zooites’ clicks – is lost by doing this.

For example, the degree of roundedness is surely a cline, and a measure which captures the (perhaps weighted) distribution of zooites’ clicks – e.g. mean and standard deviation, ‘r’=0, ‘i’ = 1, ‘c’=2′ – richer than simply a weighted vote for a single discrete class.

Another example: as I discovered when I analyzed some of the Quench project objects, ‘cigar-shaped’ objects are largely disk (‘spiral’) galaxies, not ellipticals (many, perhaps most, are more elongated than E7). This means that there could be valuable information in the classifications of (a minority of) zooites who made the choice ‘features or disk’. Perhaps, as a ‘group’ (as Simpson et al. 2013 call such distinct subsets of a community), they have sharper eyes (or better computer monitors!) and have spotted features which may be obvious in UKIDSS images, making comparisons of the classifications particularly interesting?

Would you guys please talk a bit about this? In particular, why did you choose to go down the ‘absolute, discrete classes’ route rather than a ‘sampling of a smooth continuous’ one?

Posted November 24, 2013 7:19 PM
by JeanTate

On the day of the Hangout (November 19, 2013 at 4:27 am), Kyle replied:

I’ll weigh in briefly on one of your points, Jean: the absolute classes were included at the specific request of the referee. I agree with you that one can potentially lose a lot of information based on that; the range of probabilities that can be inferred from Galaxy Zoo is one of the strengths of the project. They’re in the catalog, but I almost always recommend that people use the vote fractions themselves (either as weights or by setting a threshold), rather than those classes.

Posted November 24, 2013 7:20 PM
by JeanTate

The last comment (up to now anyway) is mine (November 19, 2013 at 4:52 pm):

Thanks Kyle. Especially as to why you’d started using gz2_class.

I know there’s not much time left before the Hangout, but there are some aspects of what I wrote which go beyond absolute discrete classes vs vote fractions, and I’d be interested in hearing what you guys think.

One: In GZ2, you can get to the question ‘Is there anything odd?’ via two different routes, starting with ‘features or disk’ or starting with ‘smooth’. The seven choices are not mutually exclusive – a zooite can select both ‘ring’ and ‘dust lane’, say – but there’s no way to tell which choices cluster (e.g. ring, dust lane, and merger may each have totals of 12 votes say, but you can’t tell if every ‘ring’ vote was also a ‘merger’ vote, but that all ‘dust lane’ votes are solos; contrived example, of course).

Similarly, you can’t work out if the voting on ‘anything odd’ is different, depending on whether the zooites came to the question from ‘smooth’ or from ‘features or disk’.

Two: As Simpson et al. (2013) report, in SN Zoo zooites could be divided quite cleanly into five distinct groups, in terms of their classification behavior. To what extent have you considered attempting to find out whether something similar it true of the zooites who took part in GZ2?

Posted November 24, 2013 7:20 PM
by JeanTate

As those of you who've viewed the Hangout - either in real time or later - know, none of my questions were mentioned.

So, now there's a thread here in GZ Talk on this, let's get a discussion going! 😃

Posted November 24, 2013 7:22 PM
by JeanTate in response to JeanTate's comment.

Kyle added a comment, in the GZ blog, in response to my questions (November 27, 2013 at 6:24 am):
1. That depends strongly on the version of Galaxy Zoo, actually. In GZ2 and GZ Hubble, you could only select one option from the “anything odd” category. In that case, all the votes for odd features are definite solos. In all of the images since the relaunch one year ago, it’s been possible to select multiple options (if desired) for the “anything odd” question. We do record all the votes individually, so that can be analyzed to see what the relationship is between these various options. I think that has the potential to be an interesting project.
2. We’re working with several external groups on that question, including ones in the UK and Switzerland. We don’t have results to report at the moment; it’s a much trickier problem than the paper you referenced on GZ: Supernovae, due to the fact that GZ2 has dozens of potentially intersecting choices that can be made about a galaxy. GZ: SNe only had three choices per object, and so characterizing “behavior classes” is easier. It’s a big priority to model this behavior for all Zooniverse projects in the near future, though.
Posted November 27, 2013 8:11 PM
by JeanTate in response to JeanTate's comment.

We’re working with several external groups on that question, [...] It’s a big priority to model this behavior for all Zooniverse projects in the near future, though.

That's really cool, Kyle! 😄

From my own experience, I've noticed that there are likely fairly big variations in my classifications, not least due to what machine I'm using and the environment. For example, on a laptop running on batteries during the day in a sunny location, I see far fewer features than if I'm using my 'extreme contrast' monitor at night, with little ambient light. Also, sometimes using 'invert image' I see features that are all but invisible in the direct image (but I don't always click on that button), but what I can see in the 'invert image' seems to vary even more by monitor/environment. Do you guys have plans to try to address this effect?

Posted November 30, 2013 10:31 PM