Galaxy Zoo Talk

Shared learning: Statistics, Data Mining, and Machine Learning in Astronomy

  • JeanTate by JeanTate

    Shared learning: Statistics, Data Mining, and Machine Learning in Astronomy is a thread in the GZ forum, started by me, on May 17, 2014:

    by Ivezic, Connolly, VanderPlas and Gray (2014, Princeton Univ. Press, ISBN 978-0-691-15168-7).

    My aim in starting this thread is to have a single place where those interested in this book+1 can share their:

    • experiences going through it
    • learning using it
    • comments on it
    • suggestions for further reading or learning
    • etc.

    JohnF introduced GZ zooites to it, here in this forum: Review of "Statistics, Data Mining and Machine Learning in Astronomy": [...]

    Posted

  • vrooje by vrooje admin, scientist

    This is the book that includes Python code, right? I haven't tried it yet but I very likely will at some point.

    If anyone is more proficient in R (or just finds it easier to install), there's also Modern Statistical Methods for Astronomy by Feigelson & Babu. I suspect a lot of the methods overlap so if there is any discussion of the statistics (as opposed to the code), people working from either book could probably join in. 😃

    Posted

  • JeanTate by JeanTate in response to vrooje's comment.

    This is the book that includes Python code, right?

    Yes, the associated website - AstroML - contains not only all the Python codes used in the book, but a whole lot more useful resources and background.

    If anyone is more proficient in R (or just finds it easier to install), there's also Modern Statistical Methods for Astronomy by Feigelson & Babu.

    Indeed; I discovered Ivezic+ and Feigelson & Babu at about the same time (see this Quench Talk post); mlpeck - who is very familiar with R - wrote (on April 22 2014):

    I've only looked at the table of contents of Feigelson & Babu, but I am seriously considering buying it. Just from perusing the TOC the book covers the same ground as Ivezic et al. with perhaps a bit more emphasis on statistics and less on data mining and machine learning. If I had the budget for just one I'd choose based on my preferred data analysis environment -- R or Python.

    Posted

  • vrooje by vrooje admin, scientist in response to JeanTate's comment.

    R is a fantastic statistics package with a lot of analysis routines built in that are still being written for Python. But they are being written for Python, and Python is a more general language, so it's not an easy choice!

    By the way, there is an Astrostatistics workshop every year at Penn State University; this year was its tenth, and the syllabus, course notes, and labs are here (mostly the labs are in R, but there were a few in other languages such as C, Python and Julia). I was there this year; if you want to get a feel for what being there was like, I kept a storify of the tweets from the workshop (which admittedly came mostly from me). The storify feed is just for fun; the notes, on the other hand, are complete (I think) and very valuable.

    Posted

  • Capella05 by Capella05 moderator

    I am currently learning Python - there are a lot of free tools available, and it is easy to use - but I very much doubt I am up to the same level as both of you!

    I was recommended a more entry level book by one of the Scientists over at Spacewarps.

    Posted

  • mlpeck by mlpeck in response to vrooje's comment.

    @vrooje: Thanks for the links!

    I still haven't actually purchased the book by Feigelson & Babu, but they do have online data at http://astrostatistics.psu.edu/MSMA/datasets/index.html and scripts at http://astrostatistics.psu.edu/MSMA/MSMA_R_scripts.html. One big difference between their book and Ivezic+ is their code consists of short scripts that use built in R functions or 3rd party packages, while the latter built a large code base of their own in Python. Either book is going to require additional resources actually to learn R or Python.

    Now that GZ forum is dead I guess I will have to check Talk now and then, even though I haven't contributed clicks in some time.

    Posted

  • JeanTate by JeanTate

    Here's an example of what I've been able to produce, using Python (source is RGZ Talk thread ARG0003f5p - close-in triple; far-out doublelobe):

    enter image description here

    Boilerplate: SDSS image per http://skyservice.pha.jhu.edu/DR10/ImgCutout/getjpeg.aspx, FIRST (red) and NVSS (cyan) contours derived from FITS files produced using SkyView with Python code described in this RGZ Talk thread. Image center per the ARG image (ARG0003f5p; J2000.0). "z_sp" is the SDSS spectroscopic redshift of the galaxy in the center.

    Posted

  • Peter_Dzwig by Peter_Dzwig

    Impressive Jean

    Peter

    Posted

  • JeanTate by JeanTate in response to vrooje's comment.

    Thanks for these, vrooje. They are a useful resource.

    Posted

  • JeanTate by JeanTate in response to Peter Dzwig's comment.

    Thanks Peter.

    Most of the code was fairly easy to write; however, Matplotlib has soooo many quirks it took forever to get the plots looking nice. For example, sometimes the kwarg is 'color', others 'colors', and it seems - to complete newbie me - totally arbitrary which to use! 😮

    Posted

  • vrooje by vrooje admin, scientist

    If you're interested in statistics and other computer tools for astronomy, you might be interested in this:
    JPL-Caltech Virtual Summer School: Big Data Analytics

    It's a free online course (not for credit, just for knowledge) about using the tools needed to analyze big datasets. Apparently everyone is welcome to attend, but I don't know what the level of support will be.

    I'm not involved in organizing it and have no idea what the coursework is going to be like; just wanted to post it in case anyone is interested.

    Posted

  • JeanTate by JeanTate

    I'm working my way through the book, slowly; I started Chapter 5 not long ago. It's about Bayesian Statistical Inference, and is wonderful! 😃 Not that I hadn't encountered Bayesian statistics before (I have, many times), but somehow this time it really clicked. I guess one reason it did so is because mlpeck gave his fellow Quench zooites a taste, introducing terms like conjugate priors, the flat prior being an improper prior, and the beta distribution. Although the principle of maximum entropy wasn't/isn't new to me (far from it), and I don't remember if it came up in Quench or not, when it was introduced in the book (section 5.2.2). it just seemed so ... natural (haven't quite got my head around Lagrangian multipliers yet though).

    Although the book is anything but cheap, it's one of the best investments I've ever made!

    Is any other zooite on a similar journey?

    Posted

  • mlpeck by mlpeck in response to JeanTate's comment.

    I've read every chapter except the one on time series -- I'm saving that one for the day I decide I'm actually interested in time series. When you get to section 5.8 I might be able to point you to some resources for doing MCMC.

    A suggestion or two for further reading:

    I posted this link on Quench talk but I'm sure more people will read it here. I think this is the reference KLMasters was trying to recall upthread in that discussion. Cameron, E. 2011: On the Estimation of Confidence Intervals for Binomial Population Proportions in Astronomy: The Simplicity and Superiority of the Bayesian Approach.

    He is also a coauthor of a couple of recent papers submitted to arxiv that are basically introductions to the General Linear Model. And he has a blog that's worth a look.

    de Souza, R.S. et al. 2014: The Overlooked Potential of Generalized Linear Models in Astronomy - I: Binomial Regression and Numerical Simulations.

    Elliott, J. et al. 2014: The Overlooked Potential of Generalized Linear Models in Astronomy-II: Gamma regression and photometric redshifts.

    You've mentioned David Hogg several times in different contexts. He often posts pedagogical papers to arxiv, sometimes intended for publication elsewhere and sometimes arxiv "exclusives." For example Hogg, Bovy and Lang 2010: Data analysis recipes: Fitting a model to data. The section on Bayesian regression with bad data (section 3) forms the basis for section 8.9.1 in Ivezic.

    Posted