Project URL: http://casjobs.sdss.org/casjobs
Project is in year: 1 of 2
Highlights:
The Sloan Digital Sky Survey (SDSS) Catalog Archive Server (CAS) has been serving up data online from the multi-TB SDSS Science Archive since 2003. Our analysis of queries submitted to the SDSS CAS over the past two years reveals a consistent pattern, as we will show with plots of our logging data. Most queries execute in seconds or minutes. However, some queries can take hours or days, either because they require non-index scans of the largest tables, or because they represent very complex aggregations of the data, or because they request very large result sets. These "monster queries" not only take a long time, they also affect response times for everyone else - one or more of them can clog up the entire system. Statistics also suggested that much of the data transfer is unnecessary; users would prefer to store results locally in order to allow further joins and filtering. To address these problems, we developed a multi-server multi-queue batch job submission and tracking system for the CAS called CasJobs. The batch system ensures that the number of resource-intensive queries is well regulated and can be directed to different servers if needed. To allow local analysis, we give users their own personal databases (MyDB) on the server. Users may transfer data to their MyDB, and then perform further analysis before extracting it to their own machine. MyDB tables also provide a convenient way to share results of queries with collaborators without downloading them. CasJobs is built using SOAP XML Web services and has been in operation for over a year now. We are now studying ways to extend CasJobs to the distributed NVO environment. We envisage a distributed system in which multiple nodes form a trusted network (Virtual Organization) and users with an account can access their workbench at any node without needing to register separately. We foresee users having datasets in multiple "spaces" on multiple machines. We have begun to upgrade CasJobs to include repli
cation of user accounts between multiple systems. The next step will be to implement WS-Security within CasJobs. This ties in well with the VO objective of getting WS-Security working for VOSpace.
Revelance to NASA:
The NVO is partly funded by NASA, and several NASA mission archives and data centers contribute to the NVO effort. The distributed batch query system we describe is useful for any distributed archive network. WS-Security and VOSpace will also be crucial technologies for archives supported by NASA. CasJobs/MyDB is also the technology used to enable anonymous importing of user data (DataUpload) into OpenSkyQuery (openskyquery.net), an NVO federated query and cross-matching service.
Application to NASA AMission and Programs:
Other than SDSS, CasJobs will soon be adapted for the GALEX science archive to handle intensive queries.
Tracking:
We maintain detailed weblogs and services logs for CasJobs that record every query submitted to the system.
Upcoming Plans:
We have begun to upgrade CasJobs to include replication of user accounts between multiple systems. The next step will be to implement WS-Security within CasJobs. This ties in well with the VO objective of getting WS-Security working for VOSpace. In the long run, we plan to integrate MyDB with MySpace and possible SRB (Storage Resource Broker). This will require integration of WS-Security with VOStore (which is a major hurdle).