Checklist for Building a Data / Feature / Event Catalog Revision : 0.9.5, 2009 Dec 10 Have questions or suggestions? Send them to : Joe Hourcle, joseph.a.hourcle@nasa.gov Latest version available at: http://sdac.virtualsolar.org/catalogs/catalog_checklist Written with assistance from: Meredith Wills-Davey, Alisdair Davey, Igor Suarez-Sola -- Glossary: (as used in this document) : Catalog : a list of metadata for items of interest Record : metadata about a single item of interest Field : an attribute of the items being cataloged Value : a specific string / number / etc. of an attribute in a given record File : the transfer format used for the catalog [may be generated dynamically on demand] --- Have I ... ... described the catalog? ___ chosen a name that clearly and unambiguously describes my catalog? ___ described how my catalog is different from similar catalogs, and given reference to those other catalogs? ___ declared when the catalog was last updated? [or checked to determine no update was necessary] ___ provided an authoritative URL where people can check for updates to this catalog? ___ mentioned the frequency of possible updates? [or that there are no planned updates] ___ provided contact information in case someone had questions about the catalog? [or wanted to ask you to co-author their paper, etc.] ___ explained any changes in methodology / maintainers over time? ... described the data used as the basis for the catalog ? ___ described specifically which detector(s) in which operating mode(s) [eg, if using LASCO/C2 H-alpha images, don't just say 'LASCO'] ___ explained the processing that was applied to the data for the analysis? [did you calibrate each image, or use difference images of the low level products?; did you use 5min averages or the full resolution time series?; 2x2 binned, or the full res. images?] ___ mentioned where I obtained the data from? [in case of processing differences between archives] ___ mentioned any gaps in that data that might affect my ability to detect an event / feature? ___ specified the temporal extent of the data used? [note -- especially important for sparse events that may only occur a few times per year; ie, even though the first event is on June 12, I analyzed data back to January 1] ___ mentioned if the data was subseted before analysis? [eg, only looked at the first image each hour] ... described the records in my catalog? ___ described what the records represent or describe? ___ described what my qualifications were for including a record (event/feature) in my catalog? ___ specified a primary key that uniquely describes each record? [often, it's start time, unless there is a risk of two events starting at the same time ... but time _is_not_ a good value, as further analysis may change the record's identifier, creating confusion if this is the same event] [primary keys may be composite (multiple fields), eg, (day & active region #) in NOAA SRS section I] ... described the fields within the records? ___ described each of the individual fields in the record: [may be done on a per-column basis depending on catalog format] ___ labeled the field? ___ explained the field in language that would be unambigously explain how it was measured / calculated by someone from my specific (sub) discipline? ___ explained what the general concept of the field is in language that would be understood by the greater physics community? ___ provided a machine-readable description of the field? [note : requires us having an controled vacabulary first ... may not be possible right now; see UCD+, SPASE, PACS, SWEET, SESDI] ___ labeled the units for the field, or defined which field contains the units for this field? [or unitness, if a ratio] ___ explained the precision of the values in the field? ___ explained any markings / other fields describing abnormal precision for values in this field? ___ explained any information conveyed with formatting? [eg, special colors used in MS Excel or HTML tables, italicized fonts] [note: only using color may violate Section 508] ___ explained the possible extents of the field if applicable? [ie, min/max for numeric, max length for strings, possible enumerations] ___ explained the reference scale or coordinate system? [eg, if time: UTC? spacecraft time? adjusted to earth/sun time?] [if pressure: absolute pressure, or gauge pressure?] [if file paths, given the URL prefix] ___ explained the significance of the field being empty? [or a value used to signify the field being unknown] ___ explained the data type used to store values in the field: Special cases: Boolean : How true, false and null values are recorded: [eg, 'T' is true, all others are false] [1 is true, -1 is false, 0 is unknown] Enumerations and Flags: What the possible values are and what they signify. If there is a natural sorting order to the values. Foreign Keys: What table / catalog is this a foreign key to? Dates: How is the date formatted? URLs (or URL parts): Explained what the URL links to. ___ ensured that values within a given field are consistent? (if free text, consistent wording for notes; if numeric, measured / derived the same way?) ... planned for use of the catalog? ___ provided text to use for attribution? [eg, 'this catalog was funded by (x)', etc.] ___ used a well documented, easily accessed format to store the catalog? ___ stored the documentation for the catalog in a well documented, easily accessed and used format? ___ chosen a format that can be freely used? [ie, doesn't require specific proprietary software (eg. IDL) to be purchased] [note: you can distribute the catalog in more than one form] ___ chosen a format that is easily used and available? [eg, FITS and CDF are not used by all science disciplines ... XML (VOTable) or CSV may be better; PDF is difficult to extract back to tables] ___ documented how to extract the individual fields from the file? [eg, if fixed-width ASCII, given the columns for each field] ___ included header fields, if appropriate? ___ provided linkages to where to find the documentation from within the file? ___ used document conventions that translate easily between formats? [eg, don't merge cells in MS Excel & HTML tables]