.. _administrate_oracle: ======================================== LHCb Bookkeeping database administration ======================================== This document contains all the information needed to manage the Bookkeeping Oracle database. Login to the database ===================== How-To in `lbDevOps doc `_. Ticketing, contacts for IT-DB Oracle support ============================================ You can send a mail to ``_ mailing list. It will be converted to a ticket. Otherwise, directly in `service now `_. There's also a `mattermost channel `_. Compile oracle stored procedure =============================== In order to compile the stored procedure you need the `sql file `_. Make sure that you are using the correct version. #. Login the database (e.g. using ``sqlplus``, but also with **sqldeveloper** or **sqlcl**) #. in the terminal execute ``@/path/to/oracle_schema_storedprocedures.sql`` #. commit; In case of error you have to use 'show errors' command In case of a schema change, you can find the command that needs to be executed in this sql file `oracle_schema_commands.sql `_ Discover slow queries in the db =============================== Note: If you are not familiar with Oracle, it is better to send a mail to ``_ mailing list. You can write we have problem with the database and that the queries are very slow. IT/DB expert will find the slow queries and will probably tell what is the problem and try to solve. The ``_ is a portal provided by IT/DB where you can logon and find the running queries. You can find the query which is running very long, get the execution plan and also can take the query and run it in ``sqlplus``. So you can compare the execution plan which is in the web and in sqlplus. When you login to the session manager, you have a reader and a writer account. All the select queries are running with the reader account. :: Login form: UserName: LHCB_DIRACBOOKKEEPING_users Password: pass Database: LHCBR You can also check for slow queries in ``_. You can check the execution plan in the following way, after having logged in with the main owner account: #. set autot traceo #. set timing on #. set linesize 30000 #. set pagesize 3000 #. execute the qurey When the query will finish you will have the execution plan and you will have the real execution time as well. I propose to look the following parameters:: Cost (%CPU) , consistent gets, physical reads For example:: Elapsed: 00:00:00.12 Execution Plan ---------------------------------------------------------- Plan hash value: 3340191443 --------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop | --------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 4960 | 2232K| 21217 (1)| 00:00:01 | | | | 1 | NESTED LOOPS | | | | | | | | | 2 | NESTED LOOPS | | 4960 | 2232K| 21217 (1)| 00:00:01 | | | | 3 | PARTITION RANGE ALL | | 4897 | 1219K| 1619 (1)| 00:00:01 | 1 | 20 | | 4 | TABLE ACCESS BY LOCAL INDEX ROWID| JOBS | 4897 | 1219K| 1619 (1)| 00:00:01 | 1 | 20 | |* 5 | INDEX RANGE SCAN | PROD_CONFIG | 4897 | | 88 (0)| 00:00:01 | 1 | 20 | | 6 | PARTITION RANGE ITERATOR | | 1 | | 3 (0)| 00:00:01 | KEY | KEY | |* 7 | INDEX RANGE SCAN | JOBS_REP_VIS | 1 | | 3 (0)| 00:00:01 | KEY | KEY | | 8 | TABLE ACCESS BY LOCAL INDEX ROWID | FILES | 1 | 206 | 4 (0)| 00:00:01 | 1 | 1 | --------------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 5 - access("J"."PRODUCTION"=51073) 7 - access("J"."JOBID"="F"."JOBID" AND "F"."GOTREPLICA"='Yes') Statistics ---------------------------------------------------------- 46 recursive calls 0 db block gets 508 consistent gets 46 physical reads 1452 redo size 56603 bytes sent via SQL*Net to client 640 bytes received via SQL*Net from client 10 SQL*Net roundtrips to/from client 1 sorts (memory) 0 sorts (disk) 131 rows processed Problems: * the cost is a big number. * the consistent gets is very high * physical reads are very high Notes: * You may have queries which need to read a lot of data. In this case the consistent gets and physical reads are very high numbers. In that example if the consistent gets and physical reads are very high (for example more than 10k) we have a problem. This is because the query only returned 131 rows. * TABLE ACCESS FULL is not good. You have to make sure that the query uses an index. This is not always true. * parallel execution: you have to make sure that the query is running parallel, the processes does not send to much data between each other. If you run a query parallel and the consistent gets is very high then you have a problem. Contact to oracle IT/DB if you do not know what to do... * CARTESIAN join: If you see that word in the execution plan, the query is wrong. ================================= Steps in the Bookkeeping database ================================= Steps are used to process/produce data. The steps are used by the Production Management system and work flow. The steps are stored in the steps table which has the following columns:: STEPID STEPNAME APPLICATIONNAME APPLICATIONVERSION OPTIONFILES DDDB CONDDB EXTRAPACKAGES INSERTTIMESTAMPS VISIBLE INPUTFILETYPES OUTPUTFILETYPES PROCESSINGPASS USABLE DQTAG OPTIONSFORMAT ISMULTICORE SYSTEMCONFIG MCTCK The steps table has 3 triggers:: STEP_INSERT: This trigger is used to replace NULL, None to an empty string. steps_before_insert: It checks that the processing pass contains a '/'. step_update: The steps which are already used can not be modified. Modifying steps =============== We may want to modify an already used steps. A step can be modified if the trigger is disabled. The following commands has to be performed in order to modify a step: .. code-block:: sql ALTER TRIGGER step_update disable; UPDATE steps SET stepname='Reco16Smog for 2015 pA', processingpass='Reco16Smog' WHERE stepid=129609; --an alternative is used by the StepManager page ALTER TRIGGER step_update enable; ================================== Processing pass in the Bookkeeping ================================== The processing pass is a collection of steps. The processing pass is stored in the processing table:: ID ParentID Name The following example illustrates how to create a step: .. code-block:: sql SELECT max(id)+1 FROM processing; SELECT * FROM processing where name='Real Data'; insert into processing(id,parentid, name)values(1915,12,'Reco16Smog'); In this example we have created the following processing pass: /Real Data/Reco16Smog The following query can be used to check the step: .. code-block:: sql SELECT * FROM (SELECT distinct SYS_CONNECT_BY_PATH(name, '/') Path, id ID FROM processing v START WITH id in (SELECT distinct id FROM processing where name='Real Data') CONNECT BY NOCYCLE PRIOR id=parentid) v where v.path='/Real Data/Reco16Smog'; If we know the processing id, we can use the following query to found out the processing pass: .. code-block:: sql SELECT v.id,v.path FROM (SELECT distinct LEVEL-1 Pathlen, SYS_CONNECT_BY_PATH(name, '/') Path, id FROM processing WHERE LEVEL > 0 and id=1915 CONNECT BY PRIOR id=parentid order by Pathlen desc) v where rownum<=1; ===================== Bookkeeping down time ===================== The following services/agent needs to be stopped before the deep down time (SystemAdministrator can be used in order to manage the services):: RMS: RequestExecutingAgent check it really stops (may take long time) TS: BookkeepingWatchAgent TransformationAgent - Reco, DM, MergePlus (this to be checked). This was not stopped the latest deep downtime TransformationCleaningAgent MCSimulationTestingAgent PMS: ProductionStatusAgent RequestTrackingAgent DMS: PopularityAgent StorageHistoryAgents(s) Just before the intervention stop all Bookkeeping services. =============================================== Automatic updating of the productionoutputfiles =============================================== Create an oracle periodic job: .. code-block:: sql BEGIN DBMS_SCHEDULER.CREATE_JOB ( job_name => 'produpdatejob', job_type => 'PLSQL_BLOCK', job_action => 'BEGIN BKUTILITIES.updateProdOutputFiles(); END;', repeat_interval => 'FREQ=MINUTELY; interval=10', start_date => systimestamp, enabled => TRUE ); END; / For monitoring: .. code-block:: sql select JOB_NAME, STATE, LAST_START_DATE, LAST_RUN_DURATION, NEXT_RUN_DATE, RUN_COUNT, FAILURE_COUNT from USER_SCHEDULER_JOBS; Debugging the produpdatejob in case of failure: - sqlplus LHCB_DIRACBOOKKEEPING/xxxxx@LHCB_DIRACBOOKKEEPING - set serveroutput on - exec BKUTILITIES.updateProdOutputFiles(); You will see the problematic production, which you will need to fix. For example: If the production is 22719, you can use the following queries for debug: .. code-block:: sql SELECT j.production,J.STEPID, f.eventtypeid, f.filetypeid, f.gotreplica, f.visibilityflag FROM jobs j, files f WHERE j.jobid = f.jobid AND j.production=22719 and f.gotreplica IS NOT NULL and f.filetypeid NOT IN(9,17) GROUP BY j.production, J.STEPID, f.eventtypeid, f.filetypeid, f.gotreplica, f.visibilityflag Order by f.gotreplica,f.visibilityflag asc; select * from files f, jobs j where j.jobid = f.jobid AND j.production=22719 and f.gotreplica IS NOT NULL and f.eventtypeid is NULL and f.filetypeid NOT IN(9,17); update files set eventtypeid=90000000 where fileid in (select f.fileid from files f, jobs j where j.jobid = f.jobid AND j.production=22719 and f.gotreplica IS NOT NULL and f.eventtypeid is NULL and f.filetypeid NOT IN(9,17)); commit; =============================================== Automatic updating of the prodrunview =============================================== Create an oracle periodic job: .. code-block:: sql BEGIN DBMS_SCHEDULER.CREATE_JOB ( job_name => 'prodrunupdatejob', job_type => 'PLSQL_BLOCK', job_action => 'BEGIN BKUTILITIES.updateprodrunview(); END;', repeat_interval => 'FREQ=MINUTELY; interval=20', start_date => systimestamp, enabled => TRUE ); END; / For monitoring: .. code-block:: sql select JOB_NAME, STATE, LAST_START_DATE, LAST_RUN_DURATION, NEXT_RUN_DATE, RUN_COUNT, FAILURE_COUNT from USER_SCHEDULER_JOBS; Debugging the produpdatejob in case of failure: - sqlplus LHCB_DIRACBOOKKEEPING/xxxxx@LHCB_DIRACBOOKKEEPING - set serveroutput on - exec BKUTILITIES.updateprodrunview(); You will see the problematic production, which you will need to fix. ==================== Managing partitions ==================== `jobs` and `files` tables are partitioned. `jobs` table RANGE-HASH partitioned by `production` and `configurationid`. `files` table RANGE partitioned by `jobid`. jobs table partitions ===================== The last partitions called `prodlast` and `runlast`. The maximum value of the `prodlast` partition is MAXVALUE. This may require to split, if we see performance degradation. This can happen if two many rows (`jobs`) belong to this partition. The recommended way to split the partition is to declare a down time, because when the partition split then the non partitioned indexes become invalid. The non partitioned indexes needs to be recreated, which will block writing to the DB. The procedure for splitting the `prodlast` partition: .. code-block:: sql select max(production) from jobs PARTITION(prodlast) where production!=99998; ALTER TABLE jobs SPLIT PARTITION prodlast AT (xxxxx) INTO (PARTITION prodXXX, PARTITION prodlast); One of the possibility is to split the `prodlast` using the last production, which can be retrieved using the following query above. `xxxxx` is the result of the query. `prodXXX` is the last partition+1. For example: .. code-block:: sql select max(production) from jobs PARTITION(prodlast) where production!=99998; which result is 83013 .. code-block:: sql ALTER TABLE jobs SPLIT PARTITION prodlast AT (83013) INTO (PARTITION prod4, PARTITION prodlast); Rebuild the non partitioned indexes: .. code-block:: sql ALTER INDEX SYS_C00302478 REBUILD; ALTER INDEX JOB_NAME_UNIQUE REBUILD; files table partitions ====================== This table is RANGE partitioned by `jobid`, which can reach the maximum value of the existing partition. It this happen, the following error will appear:: 2018-07-30 01:12:00 UTC dirac-jobexec/UploadOutputData ERROR: Could not send Bookkeeping XML file to server: Unable to create file /lhcb/MC/2015/SIM/00075280/0000/00075280_00009971_1.sim ! ERROR: Execution failed.: ( ORA-14400: inserted partition key does not map to any partition ORA-06512: at "LHCB_DIRACBOOKKEEPING.BOOKKEEPINGORACLEDB", line 976 ORA-06512: at line 1 In order to fix the issue a new partition has to be created: .. code-block:: sql alter table files add PARTITION SECT_0620M VALUES LESS THAN (620000000); =================== Database monitoring =================== Various queries are available in the BookkeepingSystem/DB/monitoring.sql file. They can be used for discovering problems such as database locks, broken oracle jobs, sessions, used indexes, etc.