COMPASS Database System Requirements

STScI ACDSD MASTCASB COMPASS

Products
GSC
DSS
GSPC
Science

Publications

Data Access
Related Science 
Missions
HST
GEMINI
VLT
NGST
Virtual Observatory
XMM
Facilities
Plate Scanning
COMPASS ooDB 
Staff Pages

Last Updated Jan 2001

Copyright © 2001 The Association of Universities for Research in Astronomy, Inc. All Rights Reserved.

Astronomical Overview and Database Generalities

I. Astronomical Overview and Database Generalities

The GSC-II is an all sky catalog of positions, magnitudes, proper motions, and colors. The minimal specifications for GSC-II are given in Appendix A. These specification were created to formulate a minimum deliverable, but actual ambitions for the GSC-II (as well as the requirements of some of the Patrons) go somewhat beyond these. Specifically, the system design should accommodate the following important extensions:

  1. While the limiting magnitude is nominally V=18 (and while that may well be the limit of the early published versions of the GSC-II), the internal version of the catalog should extend, resources permitting, to V=20 or even to the plate limit.
  2. The core catalog is defined by the processing of a minimal set of plates needed to provide colors and proper motions. However, resources permitting, all available plates will be included.
The COMPASS database is an integral part of the GSC-II program. Its primary purposes are to
  1. Provide access to the plate-processing data during catalog construction, i.e., to provide a rapid, random-access, and general environment for calibration and quality control activities.
  2. Provide internal access (meaning ST ScI and OATo, but possibly extended to a carefully limited subset of Patrons) to the GSC-II during its construction, i.e., for use in telescope operations and research support.
The standard export format of GSC-II for telescope operations will be that of the ESO SkyCat, the population of which will be based on appropriate COMPASS-based dump procedures.

It is conceivable that, as the catalog construction activity approaches completion, export or publication of the COMPASS database (not just the calibrated data but the measures too) may be desirable. While this is presently out of scope, the initial design should be extensible to a massive publication (e.g., on DVD) or to network access.

Return to top.

II. Top-Level COMPASS Database Requirements

Database Size
Object Naming Requirements
Database Loading
Database Access
Reference Catalogs
Use of Other Catalogs
Auxiliary Data for Calibrations
Meta-Data
Backup
Procedures and Administration
Interfaces
Operations Performed in the Database Environment
QA and Statistics
Single Plate Calibrations and Recalibrations
Interplate Operations

1.0 Database Size

1.1 Number of Objects

A population may be extrapolated to fainter magnitude cutoffs as

                    N =N(m) = N(m0)10a (m-m0).

Obviously this a tremendous oversimplification, and in practice a may be expected to fluctuate with population and location. For a uniform distribution, a=0.6, while for stars near b=90, a=0.3, and the extragalactic population makes a turn up at around 20 mag. For estimation purposes, we adopt a=0.4.

For extrapolation from the GSC-I, m0=15 and N(m0)=1.9E+07. Then for various m, we find
 
 
 

m N Comments
18 1.7e8 Nominal GSC-II specification
20 1.1e9  
21 2.7e9 Reasonable completeness at m=20
22  6.9e9  

These estimate are probably not trustworthy to better than a factor of three. However, we may note that another estimate (USNO A0) comes up with 4.9e9, at an unspecified completeness that I assume to be about 21. For sizing purposes we should adopt the m=18 and m=21 values given above.

1.2  Data-requirements per Image on Plate

The storage requirements per detected plate image depend on the detailed database design. This value (in bytes) is designated Bo and may be estimated at 300 for sizing purposes.

It is expected that image cutouts will not be stored as database objects but that they may be archived in the same mass storage as the database. For sizing purposes, the average cutout size, Lc, is estimated (from an inspection of five recent pipeline outputs) at 17 pixels.

1.3  Number of Plates per Object per Survey (Np)

The number of plates per object per survey is used to adjust the database size for the plate overlaps. This slightly messy calculation (not reproduced here) gives 1.74 for the 5-degree grids and 1.28 for the 6-degree grids.

Summary: Adopt Np=1.6 (2:1 weighting in favor of 5 degree grid) for initial estimation.

1.4  Number of Surveys Used (Ns)

North, minimum: POSS-I (E), POSS-II (J), and POSS-II (F). The POSS-I photometry is too poor to rely on for GSC-II colors.
North, additional: POSS-II (N), POSS-I (O). The POSS-QV will also be used, but it is not deep enough to affect the database size.

South, minimum: SERC J/EJ, AAO SES, and POSS-I (E) in the equatorial zone. There also are selected short exposure plates, but again none of them are deep enough to affect the database size.

South, additional: UK Schmidt IV-N (nomenclature unclear)

Summary: Ns=3-5. It’s reasonable to use 3 as the initial value

1.5  Database Size Estimate

For the database size in bytes:

                    N(m) Bo Np Ns = N(m) 300 x 1.6 x 3  = 1440 N(m)
                                                                                = 2.6e11 to 18 mag

                                                                                = 3.9e12 to 21 mag

Caution: this estimate does not include any database overhead.

For the cutout size in bytes:

N(m) 2 Lc2 Np Ns = N(m) 172 x 1.6 x 3  = 1387 N(m)
                                                                = 2.4e11 to 18 mag

                                                                = 3.8e12 to 21 mag
Note: The values in this section will fluctuate as our understanding of the value of N(m) improves and as we become more ambitious regarding the use of additional surveys (Ns). The associated growth could be as large as a factor of 5.

Also note: The initial hardware purchase for this application is 3.6e12 bytes, assuming a factor of 2 hardware compression. We do not know how actual data will compress. Also, the initial software license is for 1.0e12 bytes. Additional capabilities will need to be acquired in 1998-99.

2.0  Object Naming Requirements

2.1  Unique GSC-II Names

Each astronomical object will have a unique GSC-II name. This implies that all instances of the same astronomical object, regardless of the number of plates or other data sources on which they are detected, will be named identically. Furthermore, the name may not change even if a recalibration after the initial database entry changes the position of the object.

The names of deleted objects will not be reused.

2.2  Use of GSC-I Names

GSC 1.2 will be preloaded into the database and used to build a unique and stable relation between GSC-I and GSC-II names.

3.0  Database Loading

3.1  Data Sources

The primary data source is OOP files from the plate-processing pipeline. Other sources (c.f., e.g., section 6.0) may be accommodated on a research basis, subject to resource availability.

Index information for related external data, e.g., image cutouts, will be included in the database; such data itself will not be put into the database (even though they may be loaded into the mass-storage system).

3.2  Ingest Rate

The database will be capable of supporting sustained ingest rates of twice the pipeline plate processing rate of 10 plates per day, i.e., an ingest rate of 20 plates per day.

3.2  Provisions for Hardware Contingency

The design will include a plan for responding to the contingency that the hardware capabilities, while anticipated to be adequate for the database size in the long run, are momentarily inadequate. In such a contingency, it will be necessary to temporarily delete or delay the loading of a subset of the data based either on sky area or magnitude.
Such contingency procedures will preserve object names.

3.3  QA for Database loading

QA utilities will be provided to assure that the statistics of the loaded objects are reasonable and that the naming process is not generating false doubles. It may also be useful to flag calibration problems (particularly with classification) at this time. The QA will be as impersonal as possible, but it is accepted that there may be some need for visual inspection.

4.0  Database Access

4.1  By plate

The calibration procedures are intrinsically plate-based, e.g., systematic astrometric residuals are primarily a function of plate position, colors and proper motions depend on operations between plates, and the plate-overlap tests are a primary validation of the integrity of single-plate photometry and astrometry. Therefore, a fundamental access requirement are to provide access to instrumental and reduced data for a specified plate and optionally to provide the same information for other plates which overlap the specified one.

4.2  By polygon on sky

The operational and astrophysical use of the catalog, as well as problem-solving during catalog construction, will involve accessing small areas ofthe sky. These areas are to be defined by the vertices of a polygon.  It is reasonable to give preferred values of 4 vertices and a typical area of 0.1 to 1 square degree in size for the polygons, although functionality in full generality at reduced efficiencey is expected.

4.3  By GSC-II object name

Access is to be provided to GSC-II objects by their names.

4.4  By the Whole

An efficient mechanism is required to access the entire database, i.e., for producing export data. It is clear that there will be sorting requirements on export data, e.g., by coordinates (equatorial or galactic). However, it is debatable whether these have to be included within the database access requirements.

4.5  By Astrophysical Characteristics

Once the catalog begins approaching an interesting level of completeness, scientific activities that fall within the general topic of "data mining" will become important. Find, for example, all objects with low proper motions and J-F, F-N colors in a certain band. While brute force approaches to this kind of task (i.e., read the whole database) are acceptable and while optimizing this functionality is of lower priority than requirements 4.1 - 4.3, the design should, whenever possible, make use of data mining techniques developed for other projects.

4.6  Engineering Considerations on the Access Tools
It is expected that for engineering reasons related both to database architecture and to computer memory availability, the database will be subdivided into about 104 regions. The database access subroutines and programs provided to the internal scientific users will minimize the visibility of these regions for ordinary operations.

As the data volumes of the larger entities (e.g., all objects on a plate) will probably stress or exceed the hardware capabilities available, it will be necessary to handle such entities on a region-by-region basis. The utilities provided will not require the users to handle these regions explicitly.

5.0  Reference Catalogs

Simple relationships and cross-reference methods need to be provided to associate GSC-II objects with their counterparts in various reference catalogs and to provide complete reference-catalog information for calibrations.

The minimum set of reference catalogs will include GSPC-I, GSPC-II, PPM, TYCHO, HIPPARCOS, CMC

6.0  Use of Other Catalogs

Other catalogs of size comparable to the GSC-II are under development, e.g., 2MASS, other Schmidt programs. Provisions for including their data (calibrated, not instrumental) in COMPASS need to be made, although implementation obviously will depend on additional mass storage.

7.0  Auxiliary Data for Calibrations

The calibration/recalibration procedures done in-plate within the database require (in addition to the database elements pertaining to individual inventory objects and, as applicable, the reference objects matched to them) a number of supporting entities:

GSH files
Photometric color tables, vignetting functions, etc

Astrometric masks and filter specifications

Classifier decision trees
It is expected that these data will be produced on the VMS side and will be made accessible with NFS-mounted disks.

Note: This requirement is somewhat vague and does not reflect that eventually a large part of the calibration data will be based on database operations.

8.0  Meta-Data

This is a place-holder for descriptive material about the scans, the pipeline processing, and the calibrations. It is to be debated whether these data need to be part of COMPASS or a separate database.

9.0  Backup

The design will include backup procedures that protect against catastrophic loss in the case of hardware or media failure.
The backup procedures will :

    1.  be consistent with the backup requirements of the hierarchical storage system

    2.  use no more than 10% of the up-time of the database system

    3.  include off-site storage of back-up media

    4.  always protect against the loss of more than one month of production work, even in the presence of multiple failures.

    5.  Subject to item 2 above, it will be permissible to stop other database operations during parts of the backup process.

It is likely that there is a tradeoff between labor and hardware requirements for the backup operations. The design should respond to this with options to be selected in the light of operational needs and available resources.

10.0  Procedures and Administration

As the database will extend over a large disk and DLT system, provisions are required to migrate its elements from one volume to another without reloading the data-objects.

As database restoration from the last full backup (Section 9) will be a disruptive and costly operation, provisions are required to ensure the security of the file system. This should include graceful recovery in the presence of a procedural error or a fatal error in the database ingest task, i.e., it should not be necessary to do a full restore if one corrupts a few plates (or regions). This requirement is non-trivial because of the way object names are defined across plate boundaries.

All operational activities needed to load, backup, and administer the database will be documented as written procedures. To the largest extent possible, these procedures are to be implemented by the GS Operations Group.

It is conceivable that, once or twice after database loading begins, it will be necessary to destroy the database (at the minimum, the astronomical objects, but perhaps extending to the database infrastructure, e.g., regions, federated database, etc.) The recreation of these will also be written procedures.

11.0  Interfaces

The interfaces to the database system are as follows:

  1. The GSC 1.2 catalog. For consistency with previous work, the GSC 1.2 will be loaded into the database.
  2. The products of pipeline processing (OOP files). The database catalog objects will be build from the output of the pipeline processing, i.e., from the OOP records generated as the output of the VMS plate processing pipeline. Note that the GSC-II names of the cataloged objects will not be available until the time of object matching and database load. The possibility of writing GSC-II object names back into the OOP file will be addressed in the detailed design.
  3. The image cutouts. The data objects may include index information for the image cutouts made during pipeline processing. If so, access to the corresponding cutouts, which are external to the database, will be regarded as an interface. If FPA (Fractional Pixel Allocations) are included in the pipeline processing, their indices will also be regarded as an interface.
  4. The SkyCat loading files. Such files are made from the contents of the database and read by a utility that loads the SkyCat export catalog. It may (TBN) be acceptable that any required sorting be done on the SkyCat side.
Files such as those in Section 7, while shared, are regarded as internal to the database, not as interfaces.

12.0  Operations Performed in the Database Environment

This section is TBD. What follows is an outline.

12.1  QA and Statistics

12.2  Single Plate Calibrations and Recalibrations

12.3  Interplate Operations

Return to top.

III. Top-Level Derived Requirements

 
APPENDIX A

 
Top-level GSC-II Specification, as presented at 1997 Torino Meeting

Return to top.