Often there are multiple reasonable ways to organize science data within files, and files within an HLSP collection. This article provides advice and some best practices to make the process of incorporating your data in MAST go smoothly. 

On this page...

File Organization

The best organization for files delivered to MAST for an HLSP collection depends mostly upon the number of files, and secondarily on the nature of the products. For small collections consisting of perhaps a few dozen files: it is acceptable to put all files in a single directory. For larger collections it may be better to organize the files in a directory tree, with subfolders named (for instance) by target or field identifier. If organizing by some directory structure, please keep files that apply to the full collection (i.e., the README, the project summary for the Web home page, etc.) in the root directory so that MAST staff can locate them easily. 

If you are uploading products for a new data release (new and/or updated products), please place them in a sub-folder of the delivery area, with a name like "/dr2" to indicates the data release ID associated with those products.

The arrangement of files into a directory tree is mostly for the convenience of the contributing team in preparing the collection, and the MAST team in validating and moving the products to our mass storage devices. The presentation of the collection products in MAST interfaces (e.g., the Portal) does not depend upon the submitted file structure.

Data Organization

There is typically more than one reasonable way to organize data within or among files. In the absence of community standards, the following guidelines will help to ensure that users can:

  • retrieve data from MAST without technical problems
  • use the products with widely available community tools
  • identify the various components of data products (e.g., science vs. error arrays) easily

While many of the guidelines in the following subsections for science data are described in the context of FITS-format, most apply to other formats as well. 

Images

For organizing data within images, consider the following strategies:

  • Concomitant data: Put pixel-level arrays of error/uncertainty, data quality flags, exposure maps, etc. in the same file with the science arrays. 
    • For image data, put the science array in a FITS image extension that includes the keyword EXTNAME = SCI, and put concomitant data in additional extensions with appropriate extension names like "ERR", "DQ", "EXP_MAP", and "WEIGHT".
    • If data are placed in FITS extensions, do not place any data in the Primary header-data unit (PHDU). 
  • File size: While there is no hard upper-limit, it is often best to keep the size of individual files under about 1 GB. This will facilitate downloads for users with poor internet connectivity. This advice may be at odds with storing concomitant data in the same file as the science pixels; if so, consider storing concomitant data in separate files if doing so doesn't unduly increase the complexity of the file organization. 
    • Be sure that the data types of your arrays are consistent with the required precision. For example, 64-bit floating point precision is rarely needed for any quantity other than values of coordinates or timestamps. Similarly, data quality masks may require only 16-bit (short integer) precision. 
  • Null data: Try to avoid creating arrays with large numbers of missing or null pixels. For combined images, this may be as simple as choosing an orientation for the combined array that naturally captures the footprints of all contributing images with minimal dead area; the world coordinate system (WCS) keywords will let downstream applications know the physical orientation without wasting memory. 
  • Image maps: if you have created multiple spatial maps of physical quantities for a given target (e.g., reddening, temperature, star formation rate) for a given target, consider putting them in image extensions within a single file. This will keep information about each target together, and also make it easier to follow the file naming requirements

Spectra

There are two main approaches to storing spectra in files: in images or in tables. Here, spectral data includes pixel-level science and concomitant data, including arrays of: flux(density), uncertainty, data quality (DQ) flags, weights, and wavelength (if tabulated). 

Science-ready spectra have a variety of types, including

  • Spectral image cubes, such as those generated with IFUs, stored as cubes with one dispersion axis and two spatial axes
  • Long-slit spectra, with one dispersion and one spatial axis
  • Time-series spectra, with one dispersion and one temporal axis

HSLP contributors may wish to provide more than one type of spectrum, e.g., long-slit and a reference 1-D extraction. It is generally better to provide separate types of products in separate files.

The strategies for arranging data are summarized below.

Spectra in tables

Two of the most common community conventions for storing one-dimensional extracted spectra in FITS files are: 

  1. One spectrum per BINTABLE extension, such that 1-D arrays are stored in separate fields, one (wavelength, flux, err, dq) tuple per row.
  2. Multiple spectra per BINTABLE extension, with one spectrum per table row. In this case each cell of (wavelength, flux, err, dq) contains an array of the same length. 

A variation on the above options is to express the wavelength array with a function in FITS keywords. If every spectrum in an extension has the same wavelength array, you can use single-valued WCS keywords to describe the function. If the WCS function changes as a function of row in a BINTABLE, you can expand these WCS keywords into BINTABLE columns. This strategy works well for simple 1-D spectra, separate orders of echelle spectra, and Multi-object spectra (from MOS spectrograms of separate targets in a small field of view). Multi-dimensional spectra can also be stored in tables, but it becomes more complicated to describe the WCS in a compact way. 

Spectra in images

Many spectra derive from spectrograms that are multi-dimensional, where the other dimension(s) may be spatial or temporal. These data are sometimes represented as images with two or more dimensions; the dispersion is most commonly expressed as a function, rather than tabulated in a separate array. Examples include: 

  • Long-slit spectra are stored as images with one dispersion and one spatial (cross-dispersion) axis. (Note: it is possible in this case to use an additional, degenerate spatial axis to provide equatorial coordinates (RA, Dec) at all spatial  positions in a long-slit spectrum. Consult MAST staff for details.)
  • Spectral image cubes (sometimes called hyperspectral cubes), where the arrays have one spectral and two spatial axes. In this case the WCS is commonly characterized with a function rather than a separate, tabulated array (in a separate extension). The concomitant data are stored in separate extensions. 
  • Spectral time series, with a dispersion axis and a temporal axis. The spectral coordinate is commonly characterized with a function; the time coordinate may also be included in the WCS if the temporal sampling is regular. 

Descriptions of spectra

Consider the following organizational strategies:

  • Concomitant data: Put pixel-level arrays of error/uncertainty, data quality flags, data quality, etc. in separate columns within the same extension.
    • Use suggestive column names, e.g. FLUX, WAVELENGTHERR, DQ, WEIGHT
  • Constant data: Scalar, date, or categorical data that vary among spectra should be stored in separate columns. Scalar/categorical data that are common to all spectra in the extension may instead be stored in the extension header. Consult MAST staff for details. 

Catalogs

Source catalogs are commonly stored as binary tables (e.g. FITS BINTABLE extensions), with one row per source and columns to contain various quantities (source name, world coordinates, brightness measurements, errors, etc.). It is critical for users that the fields (columns) be properly annotated with units (where applicable), and also with the Virtual Observatory uniform content descriptors (UCD) designations for each column of quantities provided. Metadata that apply to the full catalog should be provided in the FITS primary or extension header (see Required Metadata: Catalog Metadata). 

In some cases the catalog data are complex, and can be best expressed as relationships between data in multiple tables. FITS format does not capture such complicated data well; a better choice is SQLite, which is a serverless database. There are community tools for creating and operating on these data, including the SQLite DB Browser, and python libraries support access to data in this format. Consult MAST staff for details. 

Metadata within files

In order for MAST to provide search interfaces for HLSP data, metadata within files needs to specify the spatial, spectral, temporal, and energy coverage of the data product. Metadata must also specify enough provenance and other information for a user to understand the data product. See Required Metadata for details. 

Where to store metadata

For data products stored in FITS files, metadata take the form of header keywords. But which keywords go in which FITS extension? The following advice will help users and applications discover and use important metadata in your products: 

  • Store metadata that are applicable to every extension in the primary header (PHDU). 
    • DOI, HLSPID, HLSPLEAD, HLSPNAME, HLSPVER, LICENSELICENURL OBSERVAT, TELESCOP, etc.
  • If you have metadata that are required to interpret the data inside extensions, store these metadata within each such extension; one cannot assume that FITS readers will associate metadata in the primary header with metadata in extension headers.
    • WCS keywords CDi_j, CRPIXj, CRVALi, RADESYS, WCSAXES, etc.
    • Coordinate reference systems: RADESYS, TIMESYS
  • Store metadata that document the various products that were combined to make the final product in a separate BINTABLE extension, with EXTNAME= 'PROVENANCE'. See Provenance Metadata for details. 

The required metadata should also appear in data files that are not in FITS format (such as ASCII or ASDF), but the form that they take may differ.

It is important to update metadata for combined products, and to delete metadata that are no longer applicable. For example, keywords such as DATE-OBS may be inherited from files that contribute to a product, in which case the value (if retained) should reflect the date of the first observation.

Units

Units are specified in data files with ASCII strings, and appear in FITS header keywords such as BUNIT (in image extensions) and TUNIT (for columns in table extensions). They are composed of a set of unit substrings; the concept of unit substrings is defined in the FITS v4 Standard (see Sect. 4.3, tables 3, 4, and 5). The Standard allows for valid unit substrings to be combined in multiple ways, but it is best to use simpler syntax when possible, e.g., use "erg/cm^2/s" or "erg cm-2 s-1" rather than "erg*cm**(-2)*s**(-1)". Group substrings with parentheses in cases where necessary to clarify the meaning. A few common unit strings and our recommended FITS-style expressions are given in the table below; for additional examples of allowed units see Sect 2.4 of Units in the Virtual Observatory

QuantityUnit StringMeaning
plane angledegdegree of arc
arcsecsecond of arc, 1/3600 deg
masmilli-second of arc, 1/3600000 deg
flux densityerg/cm^2/s/Angstromerg cm-2 s-1 Å-1
Jyjansky
mag(stellar) magnitude
eventaduanalog-to-digital unit
electroncount of electrons†
ct or countcount
ph or photonphoton
lengthAUastronomical unit
pcparsec
mass ratesolMass/yrsolar mass per year
surface brightnessMJy/srmega-jansky per steradian
mag/arcsec^2magnitude per square arcsecond
timedday
ssecond
yrJulian year

†Counts in units of electrons does not appear in standards documents, but is nevertheless widely used. 

Use standard scientific prefixes for (sub)multiples of quantities, e.g., kpc (kilo-parsec), Mpc (mega-parsec), mmag (milli-magnitude), and uJy (micro-Jansky). 


For Further Reading...

  • No labels

Data Use | Acknowledgements | DOI | Privacy

Send comments & corrections on this MAST document to: archive@stsci.edu