Photometric redshift - empirical methods (machine learning)

Photometric redshift - empirical methods (machine learning)

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I am currently learning about estimation of photometric redshifts with machine learning methods (or empirical methods in general). These methods use the knowledge about the photometry and the spectroscopic redshift of many galaxies in order to infer a mapping between the photometry and the redshift. Then, based on this mapping, redshifts can be estimated for the photometry of other galaxies.

I've read that for empirical methods it is crucial that the training data (i.e. the data from which the mapping of the photometry to the redshift is inferred) represents the galaxies for which estimated redshifts are desired in the future. I understand that this is crucial, but in what sense does the training data represent a certain distribution and in what sense are other galaxies represented or not represented by this training data? How do I know if a galaxy is well represented by the training data so that I can estimate a redshift for the galaxy?

Would the galaxy have to be from the same region? Does it have to have the same mass? What are the factors to look at, if I want to know whether a galaxy is from the same distribution/is well represented by the training data?

This was too long for a comment, but is not a real answer since I'm not completely sure, but:

My guess is that "representing the galaxies" refers to the "type" of galaxy that you intent to observe, where by "type" I mean e.g. Lyman-break galaxies (Steidel et al. 1996), Lyman $alpha$ emitters (Partridge & Peebles 1967), sub-millimeter galaxies (Blain et al. 2002), (U)LIRGS, etc.

These terms all refer to selection methods (i.e. observational techniques), and hence also to physical differences. The closer the training data are to the observed sample, the better your algorithm will be at linking their redshifts to their photometric properties.

Mass is just one property, there's also e.g. dustiness, star-burstiness, stellar population, age, and others. "Region" is probably less important, but since the clustering of galaxies also affect their properties (e.g. their morphology, Dressler 1980), it could potentially influence the result.

Determining the photometric redshift means looking at the light from the galaxy through a limited number of color filters (or bands), and infering the redshift from that data. For instance, the light coming from the galaxy can be measured in the visible light band, the infrared band,… This constitutes the photometry. Then the redshift is determined either by fitting a physical model to the light in each band, or by using machine learning.

With machine learning, the training features consist in the amount of light in each available band, and the training labels are the spectroscopic redshift. If the training data is representative of the real data, that means that for each sample in the real data, there are galaxies in the training data that have similar amounts of light through the different bands.

In short, to know if the data is representative, you have to look at where it is in the feature space.

Would the galaxy have to be from the same region?

From a mathematical perspective, yes, it has to be in a region of the feature space where there are several training samples. However, this does not mean that it is in the same region of the sky! Quite the contrary, there is no reason to believe that galaxies that are nearby in the plane of the sky will have similar light curves.

Does it have to have the same mass?

There is a correlation between galaxy mass and galaxy type. And there is a strong correlation between galaxy type and light emitted. So looking at the mass to know it the dataset is representative shouldn't be awful, but still is no replacement for simply looking at the photometric data.

Data-driven, Interpretable Photometric Redshifts Trained on Heterogeneous and Unrepresentative Data

We present a new method for inferring photometric redshifts in deep galaxy and quasar surveys, based on a data-driven model of latent spectral energy distributions (SEDs) and a physical model of photometric fluxes as a function of redshift. This conceptually novel approach combines the advantages of both machine learning methods and template fitting methods by building template SEDs directly from the spectroscopic training data. This is made computationally tractable with Gaussian processes operating in flux-redshift space, encoding the physics of redshifts and the projection of galaxy SEDs onto photometric bandpasses. This method alleviates the need to acquire representative training data or to construct detailed galaxy SED models it requires only that the photometric bandpasses and calibrations be known or have parameterized unknowns. The training data can consist of a combination of spectroscopic and deep many-band photometric data with reliable redshifts, which do not need to entirely spatially overlap with the target survey of interest or even involve the same photometric bands. We showcase the method on the i-magnitude-selected, spectroscopically confirmed galaxies in the COSMOS field. The model is trained on the deepest bands (from SUBARU and HST) and photometric redshifts are derived using the shallower SDSS optical bands only. We demonstrate that we obtain accurate redshift point estimates and probability distributions despite the training and target sets having very different redshift distributions, noise properties, and even photometric bands. Our model can also be used to predict missing photometric fluxes or to simulate populations of galaxies with realistic fluxes and redshifts, for example.

Methods overview

The estimation method is the same as the one used in Data Release 10 following the name used in Csabai et al. (2007), we refer to it as a kd-tree nearest neighbor fit (KF). The KF estimates are stored in the table Photoz.

The method is empirical in the sense that it uses a training set as a reference, then applies a machine learning technique to estimate redshifts. The training set contains photometric and spectroscopic observations for galaxies. We have chosen this approach – as opposed to template fitting methods – because of the machine learning techniques’ higher overall precision. The second estimation method was dropped because we have found that the main limiting factor in the accuracy of the results is the composition and photometric errors of the training set, not the choice of machine learning technique.

To infer values of physical parameters of galaxies, such as k-corrections, spectral type, and rest frame colors, we extend the KF method with a conservative method of template fitting. We determined the best-fitting template via a minimum chi-square fit to the photometric magnitudes, using the composite spectral template atlas of Dobos et al. (2012). The photometric errors were calculated using the prescriptions of Scranton et al. (2005).

The previous method used in Data Release 10 calculated a non-negative linear combination (NNLS) of spectral model templates. While this method is more sophisticated, it is prone to overfitting, and it also allows non-physical spectral solutions, which is especially a problem in cases where the photometric errors are underestimated. The current method is limited by the number and coverage of templates used, but it avoids the aforementioned issues.

Machine Learning in Astronomy

With the development and application of space- and ground-based telescopes, astronomical data experience rapid growth in size and complexity. They are characterized by the large volume, high dimensionality, multi-wavelength, default value, time series, high velocity, different venues, and so on. Astronomy .

With the development and application of space- and ground-based telescopes, astronomical data experience rapid growth in size and complexity. They are characterized by the large volume, high dimensionality, multi-wavelength, default value, time series, high velocity, different venues, and so on. Astronomy enters the era of Big Data. How to collect, save, transfer, handle, mine, analyze such huge data measured by TB, PB, even larger is a hot issue, which depends on the newly developed technologies (databases, cloud storage, cloud computation, high-performance computation, machine learning, deep learning, artificial intelligence, etc.). How to extract useful information and knowledge from Big Data is a big challenge. In these situations, new disciplines of astrostatistics and astroinformatics appear to solve big data problems. Astronomers never stop developing automated and effective tools to suit the requirement of Big Data. In recent years, machine learning has become popular among astronomers and is now used for solving various tasks, for example, classification, regression, clustering, outlier detection, time series analysis, association rule, etc.

One of the aims of this Research Topic is to discuss recent developments of astrostatistics. We also aim to critically review the most promising research advances in machine learning technologies, which may have a significant impact on the scientific output of future ground and space projects.

This Research Topic invites both Review and Original Research articles which address different aspects of Machine Learning in Astronomy such as:
• Data integration from different databases
• Machine learning
• Deep learning
• Algorithms
• Classification and regression.

Keywords: Machine learning, Deep learning, Photometric redshift, Classification, Regression, Data mining, Big Data

Important Note: All contributions to this Research Topic must be within the scope of the section and journal to which they are submitted, as defined in their mission statements. Frontiers reserves the right to guide an out-of-scope manuscript to a more suitable section or journal at any stage of peer review.


Max-Planck-Institut für extraterrestrische Physik, Garching, Germany

Laboratoire d’Astrophysique de Marseille, Marseille, France

University Sternwarte, Munich, Germany

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar


The authors contributed equally to the writing of this Review Article.

Corresponding authors

The neural net solutions for photometric redshifts and their errors (listed as Photoz2 in the CAS, and described in Oyaizu et al. 2008) have not changed since DR6, and do not use the ubercalibrated magnitudes. However, we now provide a value-added catalog containing the redshift probability distribution for each galaxy, p(z), calculated using the weights method presented in Cunha et al. (2008). The p(z) for each galaxy in the catalog is the weighted distribution of the spectroscopic redshifts of the 100 nearest training-set galaxies in the space of dereddened model colors and r magnitude. For the p(z) calculation we also added the zCOSMOS (Lilly et al. 2007) and DEEP2-EGS (Davis et al. 2007) galaxies to the spectroscopic training set used for the Photoz2 solution.

Cunha et al. (2008) showed that summing the p(z) for a sample of galaxies yields a better estimation of their true redshift distribution than that of the individual photometric redshifts. Mandelbaum et al. (2008) found that this gives significantly smaller photometric lensing calibration bias than the use of a single photometric redshift estimate for each galaxy.


We demonstrate that the design of the Sloan Digital Sky Survey (SDSS) filter system and the quality of the SDSS imaging data are sufficient for determining accurate and precise photometric redshifts of quasars. Using a sample of 2625 quasars, we show that "photo-z" determination is even possible for z ≤ 2.2 despite the lack of a strong continuum break, which robust photo-z techniques normally require. We find that, using our empirical method on our sample of objects known to be quasars, approximately 70% of the photometric redshifts are correct to within Δz = 0.2 the fraction of correct photometric redshifts is even better for z > 3. The accuracy of quasar photometric redshifts does not appear to be dependent upon magnitude to nearly 21st magnitude in i′. Careful calibration of the color-redshift relation to 21st magnitude may allow for the discovery of ∼10 6 quasar candidates in addition to the 10 5 quasars that the SDSS will confirm spectroscopically. We discuss the efficient selection of quasar candidates from imaging data for use with the photometric redshift technique and the potential scientific uses of a large sample of quasar candidates with photometric redshifts.

Photometric redshift

I was asked if I you can obtain the redshift of objects from just the ratio between their magnitude in different bands. I didn't think it's that simple, since there are different methods for calculating photometric redshifts, like machine learning etc. But I realized I don't actually know how it's done, so I said I wasn't sure.

Pardon my lack of a brain, but I'm here to bathe in the knowledge of greater people once again.

Is it possible? I'm not even talking about narrow band filters, just regular sdss' ugriz.

Yes it's perfectly possible even with SDSS filters, it's even possible with just two filters but more filters means better results. Photometric redshifts (photo-z) are hugely important because not everything requires the precision of a spectroscopic redshift and for a lot of objects spectroscopy will not be available and may not even be possible.

There are two main types, spectral templates and training methods (which you mentioned). The basic idea behind both is that with photo-z's you're not measuring emission and absorption lines as would be done to measure a spectroscopic redshift, what is being measured is more broad spectral features. For example most galaxies have a break at 400 nanometers, if you have two filter which measure above and below the break you can estimate at what wavelength the break is. See this figure here showing the model spectrum of a galaxy that is being redshifted. Overplotted is the response functions of the different filters that DES has (like SDSS but no u band). You can see the galaxy gets fainter as the redshift increases but also the differences between magnitudes in different bands changes. The two techniques have different ways of converting these magnitudes and differences into a photometric redshift. The complicating factor is that unlike this figure not all galaxies look the same, they vary in brightness and spectral shape.

The first method I mentioned is fitting spectral templates. The basic idea is that you have a big set of model galaxy spectra and you fit to each galaxy the best fitting model and redshift. The difficulty here is that different that you have to be sure your templates represent the galaxy population well. But the benefit is you can calculate model spectra for galaxies you have never detected before. The second method often leads to better results, the idea is that you obtain spectroscopic redshifts for a sub-sample of the galaxies and then you train a neutral network on this training sample to predict redshift from magnitudes, then you apply it to all the other galaxies you have without spectroscopic redshifts. Also with this method it's important that the training sample be unbiased and cover a wide range of galaxy types.

The downside of photometric redshifts is that the uncertainty in the redshift is much larger than for a spectroscopic survey, however they're much easier to obtain. Typically SDSS like data can obtain precision of about 3%. Aside from just low precision they also suffer from catastrophic failures, which is where the photo-z is hugely wrong. Narrower filters can improve the precision, as can adding many more filters sampling different parts of the EM spectrum.

A Peek Into the Future

Within the next few years, image analysis and machine learning systems that can process terabytes of data in near real-time with high accuracy will be essential.

There are great opportunities for making novel discoveries, even in databases that have been available for decades. The volunteers of Galaxy Zoo have demonstrated this multiple times by discovering structures in SDSS images that have later been confirmed to be new types of objects. These volunteers are not trained scientists, yet they make new scientific discoveries.

Even today, only a fraction of the images of SDSS have been inspected by humans. Without doubt, the data still hold many surprises, and upcoming surveys, such as LSST, are bound to image previously unknown objects. It will not be possible to manually inspect all images produced by these surveys, making advanced image analysis and machine learning algorithms of vital importance.

One may use such systems to answer questions like how many types of galaxies there are, what distinguishes the different classes, whether the current classification scheme is good enough, and whether there are important sub-classes or undiscovered classes. These questions require data science knowledge rather than astrophysical knowledge, yet the discoveries will still help astrophysics tremendously.

In this new data-rich era, astronomy and computer science can benefit greatly from each other. There are new problems to be tackled, novel discoveries to be made, and above all, new knowledge to be gained in both fields.

Title: Robust Machine Learning Applied to Astronomical Datasets III: Probabilistic Photometric Redshifts for Galaxies and Quasars in the SDSS and GALEX

19.2 mag, and sigma = 0.343 +/- 0.005 for quasars to i < 20.3 mag. The PDFs allow the selection of subsets with improved statistics. For quasars, the improvement is dramatic: for those with a single peak in their probability distribution, the dispersion is reduced from 0.343 to sigma = 0.117 +/- 0.010, and the photometric redshift is within 0.3 of the spectroscopic redshift for 99.3 +/- 0.1% of the objects. Thus, for this optical quasar sample, we can virtually eliminate 'catastrophic' photometric redshift estimates. In addition to the SDSS sample, we incorporate ultraviolet photometry from the Third Data Release of the Galaxy Evolution Explorer All-Sky Imaging Survey (GALEX AIS GR3) to create PDFs for objects seen in both surveys. For quasars, the increased coverage of the observed frame UV of the SED results in significant improvement over the full SDSS sample, with sigma = 0.234 +/- 0.010. We demonstrate that this improvement is genuine. [Abridged]