LHCb Bookkeeping database administration

This document contains all the information needed to manage the Bookkeeping Oracle database.

Login to the database

How-To in lbDevOps doc.

Ticketing, contacts for IT-DB Oracle support

You can send a mail to mailto:phydb.support@cern.ch mailing list. It will be converted to a ticket.

Otherwise, directly in service now.

There’s also a mattermost channel.

Compile oracle stored procedure

In order to compile the stored procedure you need the sql file. Make sure that you are using the correct version.

  1. Login the database (e.g. using sqlplus, but also with sqldeveloper or sqlcl)
  2. in the terminal execute @/path/to/oracle_schema_storedprocedures.sql
  3. commit;

In case of error you have to use ‘show errors’ command

In case of a schema change, you can find the command that needs to be executed in this sql file oracle_schema_commands.sql

Discover slow queries in the db

Note: If you are not familiar with Oracle, it is better to send a mail to mailto:phydb.support@cern.ch mailing list. You can write we have problem with the database and that the queries are very slow. IT/DB expert will find the slow queries and will probably tell what is the problem and try to solve.

The https://cern.ch/session-manager is a portal provided by IT/DB where you can logon and find the running queries. You can find the query which is running very long, get the execution plan and also can take the query and run it in sqlplus. So you can compare the execution plan which is in the web and in sqlplus.

When you login to the session manager, you have a reader and a writer account. All the select queries are running with the reader account.

Login form:
UserName: LHCB_DIRACBOOKKEEPING_users
Password: pass
Database: LHCBR

You can also check for slow queries in oem.cern.ch.

You can check the execution plan in the following way, after having logged in with the main owner account:

  1. set autot traceo
  2. set timing on
  3. set linesize 30000
  4. set pagesize 3000
  5. execute the qurey

When the query will finish you will have the execution plan and you will have the real execution time as well. I propose to look the following parameters:

Cost (%CPU) , consistent gets, physical reads

For example:

Elapsed: 00:00:00.12
Execution Plan
----------------------------------------------------------
Plan hash value: 3340191443
---------------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |          |  4960 |  2232K| 21217   (1)| 00:00:01 |       |     |
|   1 |  NESTED LOOPS             |          |     |     |     |     |     |     |
|   2 |   NESTED LOOPS            |          |  4960 |  2232K| 21217   (1)| 00:00:01 |       |     |
|   3 |    PARTITION RANGE ALL          |          |  4897 |  1219K|  1619   (1)| 00:00:01 |     1 |  20 |
|   4 |     TABLE ACCESS BY LOCAL INDEX ROWID| JOBS      |  4897 |  1219K|  1619   (1)| 00:00:01 |     1 |  20 |
|*  5 |      INDEX RANGE SCAN           | PROD_CONFIG  |  4897 |     |  88   (0)| 00:00:01 |     1 |  20 |
|   6 |    PARTITION RANGE ITERATOR        |          |   1 |     |   3   (0)| 00:00:01 |   KEY | KEY |
|*  7 |     INDEX RANGE SCAN         | JOBS_REP_VIS |     1 |     |   3   (0)| 00:00:01 |   KEY | KEY |
|   8 |   TABLE ACCESS BY LOCAL INDEX ROWID  | FILES     |   1 | 206 |   4   (0)| 00:00:01 |     1 |   1 |
---------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
 5 - access("J"."PRODUCTION"=51073)
 7 - access("J"."JOBID"="F"."JOBID" AND "F"."GOTREPLICA"='Yes')
Statistics
----------------------------------------------------------
 46  recursive calls
  0  db block gets
508  consistent gets
 46  physical reads
     1452  redo size
    56603  bytes sent via SQL*Net to client
640  bytes received via SQL*Net from client
 10  SQL*Net roundtrips to/from client
  1  sorts (memory)
  0  sorts (disk)
131  rows processed

Problems: * the cost is a big number. * the consistent gets is very high * physical reads are very high

Notes:

  • You may have queries which need to read a lot of data. In this case the consistent gets and physical reads are very high numbers. In that example if the consistent gets and physical reads are very high (for example more than 10k) we have a problem. This is because the query only returned 131 rows.
  • TABLE ACCESS FULL is not good. You have to make sure that the query uses an index. This is not always true.
  • parallel execution: you have to make sure that the query is running parallel, the processes does not send to much data between each other. If you run a query parallel and the consistent gets is very high then you have a problem. Contact to oracle IT/DB if you do not know what to do…
  • CARTESIAN join: If you see that word in the execution plan, the query is wrong.

Steps in the Bookkeeping database

Steps are used to process/produce data. The steps are used by the Production Management system and work flow. The steps are stored in the steps table which has the following columns:

STEPID
STEPNAME
APPLICATIONNAME
APPLICATIONVERSION
OPTIONFILES
DDDB
CONDDB
EXTRAPACKAGES
INSERTTIMESTAMPS
VISIBLE
INPUTFILETYPES
OUTPUTFILETYPES
PROCESSINGPASS
USABLE
DQTAG
OPTIONSFORMAT
ISMULTICORE
SYSTEMCONFIG
MCTCK

The steps table has 3 triggers:

STEP_INSERT: This trigger is used to replace NULL, None to an empty string.
steps_before_insert: It checks that the processing pass contains a '/'.
step_update: The steps which are already used can not be modified.

Modifying steps

We may want to modify an already used steps. A step can be modified if the trigger is disabled. The following commands has to be performed in order to modify a step:

ALTER TRIGGER step_update disable;
UPDATE steps SET stepname='Reco16Smog for 2015 pA', processingpass='Reco16Smog' WHERE stepid=129609; --an alternative is used by the StepManager page
ALTER TRIGGER step_update enable;

Processing pass in the Bookkeeping

The processing pass is a collection of steps. The processing pass is stored in the processing table:

ID
ParentID
Name

The following example illustrates how to create a step:

SELECT max(id)+1 FROM processing;
SELECT * FROM processing where name='Real Data';
insert into processing(id,parentid, name)values(1915,12,'Reco16Smog');

In this example we have created the following processing pass: /Real Data/Reco16Smog

The following query can be used to check the step:

SELECT * FROM (SELECT distinct SYS_CONNECT_BY_PATH(name, '/') Path, id ID
     FROM processing v   START WITH id in (SELECT distinct id FROM processing where name='Real Data')
CONNECT BY NOCYCLE PRIOR  id=parentid) v   where v.path='/Real Data/Reco16Smog';

If we know the processing id, we can use the following query to found out the processing pass:

SELECT v.id,v.path FROM (SELECT distinct  LEVEL-1 Pathlen, SYS_CONNECT_BY_PATH(name, '/') Path, id
  FROM processing
  WHERE LEVEL > 0 and id=1915
  CONNECT BY PRIOR id=parentid order by Pathlen desc) v where rownum<=1;

Bookkeeping down time

The following services/agent needs to be stopped before the deep down time (SystemAdministrator can be used in order to manage the services):

RMS:
  RequestExecutingAgent
    check it really stops (may take long time)
TS:
  BookkeepingWatchAgent
  TransformationAgent - Reco, DM, MergePlus (this to be checked). This was not stopped the latest deep downtime
  TransformationCleaningAgent
  MCSimulationTestingAgent
PMS:
  ProductionStatusAgent
  RequestTrackingAgent
DMS:
  PopularityAgent
StorageHistoryAgents(s)

Just before the intervention stop all Bookkeeping services.

Automatic updating of the productionoutputfiles

Create an oracle periodic job:

BEGIN
  DBMS_SCHEDULER.CREATE_JOB (
     job_name             => 'produpdatejob',
     job_type             => 'PLSQL_BLOCK',
     job_action           => 'BEGIN BKUTILITIES.updateProdOutputFiles(); END;',
     repeat_interval      => 'FREQ=MINUTELY; interval=10',
     start_date           => systimestamp,
     enabled              =>  TRUE
     );
  END;
  /

For monitoring:

select JOB_NAME, STATE, LAST_START_DATE, LAST_RUN_DURATION, NEXT_RUN_DATE, RUN_COUNT, FAILURE_COUNT from USER_SCHEDULER_JOBS;

Debugging the produpdatejob in case of failure:

You will see the problematic production, which you will need to fix. For example: If the production is 22719, you can use the following queries for debug:

SELECT j.production,J.STEPID, f.eventtypeid, f.filetypeid, f.gotreplica, f.visibilityflag
  FROM jobs j, files f WHERE
    j.jobid = f.jobid AND
    j.production=22719 and
    f.gotreplica IS NOT NULL and
    f.filetypeid NOT IN(9,17) GROUP BY j.production, J.STEPID, f.eventtypeid, f.filetypeid, f.gotreplica, f.visibilityflag Order by f.gotreplica,f.visibilityflag asc;

select * from files f, jobs j where
    j.jobid = f.jobid AND
    j.production=22719 and
    f.gotreplica IS NOT NULL and
          f.eventtypeid is NULL and
    f.filetypeid NOT IN(9,17);

update files set eventtypeid=90000000 where fileid in (select f.fileid from files f, jobs j where  j.jobid = f.jobid AND
j.production=22719 and
f.gotreplica IS NOT NULL and
f.eventtypeid is NULL and
f.filetypeid NOT IN(9,17));

commit;

Automatic updating of the prodrunview

Create an oracle periodic job:

BEGIN
DBMS_SCHEDULER.CREATE_JOB (
   job_name             => 'prodrunupdatejob',
   job_type             => 'PLSQL_BLOCK',
   job_action           => 'BEGIN BKUTILITIES.updateprodrunview(); END;',
   repeat_interval      => 'FREQ=MINUTELY; interval=20',
   start_date           => systimestamp,
   enabled              =>  TRUE
   );
END;
/

For monitoring:

select JOB_NAME, STATE, LAST_START_DATE, LAST_RUN_DURATION, NEXT_RUN_DATE, RUN_COUNT, FAILURE_COUNT from USER_SCHEDULER_JOBS;

Debugging the produpdatejob in case of failure:

You will see the problematic production, which you will need to fix.

Managing partitions

jobs and files tables are partitioned. jobs table RANGE-HASH partitioned by production and configurationid. files table RANGE partitioned by jobid.

jobs table partitions

The last partitions called prodlast and runlast. The maximum value of the prodlast partition is MAXVALUE. This may require to split, if we see performance degradation. This can happen if two many rows (jobs) belong to this partition. The recommended way to split the partition is to declare a down time, because when the partition split then the non partitioned indexes become invalid. The non partitioned indexes needs to be recreated, which will block writing to the DB. The procedure for splitting the prodlast partition:

select max(production) from jobs PARTITION(prodlast) where production!=99998;
ALTER TABLE jobs SPLIT PARTITION prodlast AT (xxxxx) INTO (PARTITION prodXXX, PARTITION prodlast);

One of the possibility is to split the prodlast using the last production, which can be retrieved using the following query above. xxxxx is the result of the query. prodXXX is the last partition+1. For example:

select max(production) from jobs PARTITION(prodlast) where production!=99998;

which result is 83013

ALTER TABLE jobs SPLIT PARTITION prodlast AT (83013) INTO (PARTITION prod4, PARTITION prodlast);

Rebuild the non partitioned indexes:

ALTER INDEX SYS_C00302478 REBUILD;
ALTER INDEX JOB_NAME_UNIQUE REBUILD;

files table partitions

This table is RANGE partitioned by jobid, which can reach the maximum value of the existing partition. It this happen, the following error will appear:

2018-07-30 01:12:00 UTC dirac-jobexec/UploadOutputData ERROR: Could not send Bookkeeping XML file to server:
Unable to create file /lhcb/MC/2015/SIM/00075280/0000/00075280_00009971_1.sim ! ERROR: Execution failed.: (
ORA-14400: inserted partition key does not map to any partition
ORA-06512: at "LHCB_DIRACBOOKKEEPING.BOOKKEEPINGORACLEDB", line 976
ORA-06512: at line 1

In order to fix the issue a new partition has to be created:

alter table files add PARTITION SECT_0620M  VALUES LESS THAN (620000000);

Database monitoring

Various queries are available in the BookkeepingSystem/DB/monitoring.sql file. They can be used for discovering problems such as database locks, broken oracle jobs, sessions, used indexes, etc.