Galaxy Zoo Talk

How is the data integrity of the Galaxy Zoo 'clicks databases' assured?

  • JeanTate by JeanTate

    The various iterations/versions/whatever of Galaxy Zoo get their 'raw astronomical' data from such surveys as SDSS, UKIDSS, CANDELS, and ATLAS.

    Arfon, who now works at GitHub, briefly hints at - to nicely summarizes - how this raw data is transformed into "Subjects", served up to "Users", and how the resulting "Classifications" are handled in a series of Zooniverse blog posts: How the Zooniverse Works: The Domain Model, How the Zooniverse Works: Tools and Technologies, and How the Zooniverse Works: Keeping It Personal1; in Zoo Tools: A New Way to Analyze, View and Share Data ttfnrob introduces Tools, which involves manipulation of some mixture of some subset of the Subjects and Classifications data.

    As far as I can tell, there is nothing public on how the integrity of the Subjects and Classifications data is assured2.

    However, there is at least one recent, public example of what seems like a serious problem with data integrity, at least in Tools (PM me if you'd like details, if you don't already know what I'm referring to).

    Hence this thread: how is the data integrity of the various Galaxy Zoo 'clicks databases' assured? Specifically, the Subjects, and Classifications databases, and whatever databases Tools uses?

    This post is a copy of one in the GZ forum, with the same title; later today I may also copy it into SW Talk, Quench Talk, and RGZ Talk as well.

    1 It's important to keep in mind that what Arfon describes does not refer to the way the original Galaxy Zoo worked, nor GZ2; see this post for more
    2 for example, there are no GZ published papers reporting results from any version of GZ that used the current set-up (Willett et al. 2013, for example, describes GZ2)

    Posted

  • JeanTate by JeanTate

    I added a second content post to the GZ forum thread; here it is (with various formatting edits):

    However, there is at least one recent, public example of what seems like a serious problem with data integrity, at least in Tools (PM me if you'd like details, if you don't already know what I'm referring to).

    I've just finished writing about another one, albeit perhaps not serious, and not to do with the current Zooniverse setup.

    It concerns GZ2, and what Kyle Willett describes as 'incomplete classifications'.

    Here's a copy of what I wrote in the Questions about the GZ2DR paper (Willett et al. 2013) thread, over in the GZ forum:

    Four days' ago, in announcing the Kaggle Galaxy Challenge reboot, Kyle Willett wrote:

    Over the first couple of weeks of the contest, several participants (notably @sedielem and others) found that some of the data was not behaving in the way expected from the Galaxy Zoo decision tree. The administrators and scientists have been looking into this in detail over the last week or so, and have confirmed that it's indeed a genuine error on our part.

    The cause is that a fraction of the original classifications for which the total number of votes didn't completely carry through each step of the decision tree. The cause isn't completely known; possibly recording of incomplete classifications, or the method by which we removed duplicate classifications. The effect on the data that you received, though, meant that for many of the lower nodes in the decision tree (the values of which are expressed as normalized, cumulative fractions), it was possible to record a zero when the value should have been higher than that.

    Shortly afterwards, I asked: "Does this mean that the Galaxy Zoo 2 catalog contains erroneous classification data too?" to which Kyle replied:

    Yes, although I'm not sure "erroneous" is the correct description (potentially incomplete?). There's a small fraction of the classifications in which the total number of subsequent votes doesn't equal the sum above it. That being said, they're only ever off by ~+/- 1 vote, which doesn't have a significant effect on the vote fractions if there are sufficient counts. Our data and debiasing process did take this into account, and it's one of the reasons we emphasize that using the catalog intelligently should incorporate BOTH vote fractions and total numbers.

    As far as I can tell, Willett+ 2013 - the paper which this thread is about - says only this concerning what might be the root cause of this:

    In a small percentage of cases, individuals classified the same image more than once. In order to treat each vote as an independent measurement, classifications repeated by the same user were removed from the data, keeping only their votes from the last submission.

    There's nothing about 'incomplete classifications' (that I could find). Given that the "data and debiasing process did take this into account", how?

    Until I read this, I did not know that it was possible for the Galaxy Zoo clicks databases - GZ1, bias study, GZ2, and all later (as yet unpublished) ones - to contain incomplete classifications. Does any reader know if any of the earlier GZ catalogs - the original Galaxy Zoo and the bias studies - also contain clicks from incomplete classifications?

    Posted

  • JeanTate by JeanTate

    Another GZ forum post:

    Until I read this, I did not know that it was possible for the Galaxy Zoo clicks databases - GZ1, bias study, GZ2, and all later (as yet unpublished) ones - to contain incomplete classifications. Does any reader know if any of the earlier GZ catalogs - the original Galaxy Zoo and the bias studies - also contain clicks from incomplete classifications?

    Of course, GZ1 and the bias studies could not, even in principle, contain "clicks from incomplete classifications", because these projects were '1-Click classify' (your single click was the only classification, for a particular object). Thanks to zutopian for reminding me of this.

    In Lintott+ 2008, two different kinds of 'data cleaning' are described:

    As the Galaxy Zoo website gathers data, these are stored into a live STRUCTURED QUERY LANGUAGE (SQL) data base. For each entry, we store the timestamp, user identification, galaxy identification and the classification chosen by the user. Classifications by unregistered visitors are discarded and the user requested to register and complete the tutorial described above.

    The first step in data reduction involves removing obviously bogus classifications. A small number of users seem to have recorded a number of these classifications, either using some sort of automated mechanism or due to some unknown problem with their browser. They are easy to discern by the fact that they have multiple classifications for a small number of galaxies. We find all users which have classified two or more galaxies more than five times each. This is extremely unlikely by Poisson distribution and hence all data points from such users are discarded. There are 36 such potentially malicious users, amounting to less than 0.05 per cent of the total number of participants. Furthermore, in order to account for accidental double clicks, if a user has classified the same galaxy more than once, we take into account only the first classification from each user. This latter stage ensures that no single user can unduly influence the classification assigned to a single galaxy. The two steps of this cleaning process together remove about 4 per cent of our data set.

    There is, as far as I know, no way to independently verify the extent to which the 'cleaning' of the GZ1 (and the bias studies) clicks followed the steps described.

    While Land+ 2008 does not explicitly say, it seems that the bias studies used the same two-step cleaning as in GZ1.

    As described in Willett+ 2013, however, this kind of cleaning was done in a rather different way in GZ2:

    In a small percentage of cases, individuals classified the same image more than once. In order to treat each vote as an independent measurement, classifications repeated by the same user were removed from the data, keeping only their votes from the last submission.

    [...]

    The next step is to reduce the influence of potentially unreliable classifiers (whose classifications are consistent with random selection). To do so, an iterative weighting scheme (similar to that used for GZ1) is applied.

    Posted