Last Updated Jan 2001 Copyright © 2001 The Association of Universities for Research in Astronomy, Inc. All Rights Reserved. |
I. Astronomical Overview and Database Generalities The GSC-II is an all sky catalog of positions, magnitudes, proper motions, and colors. The minimal specifications for GSC-II are given in Appendix A. These specification were created to formulate a minimum deliverable, but actual ambitions for the GSC-II (as well as the requirements of some of the Patrons) go somewhat beyond these. Specifically, the system design should accommodate the following important extensions:
It is conceivable that, as the catalog construction activity approaches completion, export or publication of the COMPASS database (not just the calibrated data but the measures too) may be desirable. While this is presently out of scope, the initial design should be extensible to a massive publication (e.g., on DVD) or to network access. Return to top. II. Top-Level COMPASS Database Requirements Database Size
|
m | N | Comments |
18 | 1.7e8 | Nominal GSC-II specification |
20 | 1.1e9 | |
21 | 2.7e9 | Reasonable completeness at m=20 |
22 | 6.9e9 |
These estimate are probably not trustworthy to better than a factor of three. However, we may note that another estimate (USNO A0) comes up with 4.9e9, at an unspecified completeness that I assume to be about 21. For sizing purposes we should adopt the m=18 and m=21 values given above.
1.2 Data-requirements per Image on Plate
The storage requirements per detected plate image depend on the detailed database design. This value (in bytes) is designated Bo and may be estimated at 300 for sizing purposes.
It is expected that image cutouts will not be stored as database objects but that they may be archived in the same mass storage as the database. For sizing purposes, the average cutout size, Lc, is estimated (from an inspection of five recent pipeline outputs) at 17 pixels.
1.3 Number of Plates per Object per Survey (Np)
The number of plates per object per survey is used to adjust the database size for the plate overlaps. This slightly messy calculation (not reproduced here) gives 1.74 for the 5-degree grids and 1.28 for the 6-degree grids.
Summary: Adopt Np=1.6 (2:1 weighting in favor of 5 degree grid) for initial estimation.
1.4 Number of Surveys Used (Ns)
North, minimum: POSS-I (E), POSS-II
(J), and POSS-II (F). The POSS-I photometry is too poor to rely on for
GSC-II colors.
North, additional: POSS-II (N), POSS-I (O). The POSS-QV will also
be used, but it is not deep enough to affect the database size.
South, minimum: SERC J/EJ, AAO SES, and POSS-I (E) in the equatorial
zone. There also are selected short exposure plates, but again none of
them are deep enough to affect the database size.
South, additional: UK Schmidt IV-N (nomenclature unclear)
Summary: Ns=3-5. It’s reasonable to use 3 as the initial value
1.5 Database Size Estimate
For the database size in bytes:
N(m) Bo Np Ns = N(m) 300
x 1.6 x 3 = 1440 N(m)
= 2.6e11 to 18 mag
= 3.9e12 to 21 mag
Caution: this estimate does not include any database overhead.
For the cutout size in bytes:
Also note: The initial hardware purchase for this application is 3.6e12 bytes, assuming a factor of 2 hardware compression. We do not know how actual data will compress. Also, the initial software license is for 1.0e12 bytes. Additional capabilities will need to be acquired in 1998-99.
2.0 Object Naming Requirements
2.1 Unique GSC-II Names
Each astronomical object will have a unique GSC-II name. This implies that all instances of the same astronomical object, regardless of the number of plates or other data sources on which they are detected, will be named identically. Furthermore, the name may not change even if a recalibration after the initial database entry changes the position of the object.
The names of deleted objects will not be reused.
2.2 Use of GSC-I Names
GSC 1.2 will be preloaded into the database and used to build a unique and stable relation between GSC-I and GSC-II names.
3.1 Data Sources
The primary data source is OOP files from the plate-processing pipeline. Other sources (c.f., e.g., section 6.0) may be accommodated on a research basis, subject to resource availability.
Index information for related external data, e.g., image cutouts, will be included in the database; such data itself will not be put into the database (even though they may be loaded into the mass-storage system).
3.2 Ingest Rate
The database will be capable of supporting sustained ingest rates of twice the pipeline plate processing rate of 10 plates per day, i.e., an ingest rate of 20 plates per day.
3.2 Provisions for Hardware Contingency
The design will include a plan for responding
to the contingency that the hardware capabilities, while anticipated to
be adequate for the database size in the long run, are momentarily inadequate.
In such a contingency, it will be necessary to temporarily delete or delay
the loading of a subset of the data based either on sky area or magnitude.
Such contingency procedures will preserve object names.
3.3 QA for Database loading
QA utilities will be provided to assure that the statistics of the loaded objects are reasonable and that the naming process is not generating false doubles. It may also be useful to flag calibration problems (particularly with classification) at this time. The QA will be as impersonal as possible, but it is accepted that there may be some need for visual inspection.
4.1 By plate
The calibration procedures are intrinsically plate-based, e.g., systematic astrometric residuals are primarily a function of plate position, colors and proper motions depend on operations between plates, and the plate-overlap tests are a primary validation of the integrity of single-plate photometry and astrometry. Therefore, a fundamental access requirement are to provide access to instrumental and reduced data for a specified plate and optionally to provide the same information for other plates which overlap the specified one.
4.2 By polygon on sky
The operational and astrophysical use of the catalog, as well as problem-solving during catalog construction, will involve accessing small areas ofthe sky. These areas are to be defined by the vertices of a polygon. It is reasonable to give preferred values of 4 vertices and a typical area of 0.1 to 1 square degree in size for the polygons, although functionality in full generality at reduced efficiencey is expected.
4.3 By GSC-II object name
Access is to be provided to GSC-II objects by their names.
4.4 By the Whole
An efficient mechanism is required to access the entire database, i.e., for producing export data. It is clear that there will be sorting requirements on export data, e.g., by coordinates (equatorial or galactic). However, it is debatable whether these have to be included within the database access requirements.
4.5 By Astrophysical Characteristics
Once the catalog begins approaching an interesting level of completeness, scientific activities that fall within the general topic of "data mining" will become important. Find, for example, all objects with low proper motions and J-F, F-N colors in a certain band. While brute force approaches to this kind of task (i.e., read the whole database) are acceptable and while optimizing this functionality is of lower priority than requirements 4.1 - 4.3, the design should, whenever possible, make use of data mining techniques developed for other projects.
4.6 Engineering Considerations
on the Access Tools
It is expected that for engineering reasons related both to database architecture
and to computer memory availability, the database will be subdivided into
about 104 regions. The database access subroutines and programs
provided to the internal scientific users will minimize the visibility
of these regions for ordinary operations.
As the data volumes of the larger entities (e.g., all objects on a plate) will probably stress or exceed the hardware capabilities available, it will be necessary to handle such entities on a region-by-region basis. The utilities provided will not require the users to handle these regions explicitly.
Simple relationships and cross-reference methods need to be provided to associate GSC-II objects with their counterparts in various reference catalogs and to provide complete reference-catalog information for calibrations.
The minimum set of reference catalogs will include GSPC-I, GSPC-II, PPM, TYCHO, HIPPARCOS, CMC
Other catalogs of size comparable to the GSC-II are under development, e.g., 2MASS, other Schmidt programs. Provisions for including their data (calibrated, not instrumental) in COMPASS need to be made, although implementation obviously will depend on additional mass storage.
7.0 Auxiliary Data for Calibrations
The calibration/recalibration procedures done in-plate within the database require (in addition to the database elements pertaining to individual inventory objects and, as applicable, the reference objects matched to them) a number of supporting entities:
Note: This requirement is somewhat vague and does not reflect that eventually a large part of the calibration data will be based on database operations.
This is a place-holder for descriptive material about the scans, the pipeline processing, and the calibrations. It is to be debated whether these data need to be part of COMPASS or a separate database.
The design will include backup procedures
that protect against catastrophic loss in the case of hardware or media
failure.
The backup procedures will :
1. be consistent with the backup requirements
of the hierarchical storage system
2. use no more than 10% of the up-time of the
database system
3. include off-site storage of back-up media
4. always protect against the loss of more than
one month of production work, even in the presence of multiple failures.
5. Subject to item 2 above, it will be permissible
to stop other database operations during parts of the backup process.
It is likely that there is a tradeoff between labor and hardware requirements for the backup operations. The design should respond to this with options to be selected in the light of operational needs and available resources.
10.0 Procedures and Administration
As the database will extend over a large disk and DLT system, provisions are required to migrate its elements from one volume to another without reloading the data-objects.
As database restoration from the last full backup (Section 9) will be a disruptive and costly operation, provisions are required to ensure the security of the file system. This should include graceful recovery in the presence of a procedural error or a fatal error in the database ingest task, i.e., it should not be necessary to do a full restore if one corrupts a few plates (or regions). This requirement is non-trivial because of the way object names are defined across plate boundaries.
All operational activities needed to load, backup, and administer the database will be documented as written procedures. To the largest extent possible, these procedures are to be implemented by the GS Operations Group.
It is conceivable that, once or twice after database loading begins, it will be necessary to destroy the database (at the minimum, the astronomical objects, but perhaps extending to the database infrastructure, e.g., regions, federated database, etc.) The recreation of these will also be written procedures.
The interfaces to the database system are as follows:
12.0 Operations Performed in the Database Environment
This section is TBD. What follows is an outline.
12.2 Single Plate Calibrations and Recalibrations
Return to top.
III. Top-Level Derived Requirements