.. _administrate_oracle:

========================================
LHCb Bookkeeping database administration
========================================

This document contains all the information needed to manage the Bookkeeping Oracle
database.

Login to the database
=====================

How-To in `lbDevOps doc <https://lbdevops.web.cern.ch/lbdevops/DIRACInfrastructure.html>`_.


Ticketing, contacts for IT-DB Oracle support
============================================

You can send a mail to `<mailto:phydb.support@cern.ch>`_ mailing list. It will be converted to a ticket.

Otherwise, directly in `service now <https://cern.service-now.com/service-portal?id=service_element&name=oracle-database-service>`_.

There's also a `mattermost channel <https://mattermost.web.cern.ch/it-dep/channels/it-db>`_.

Compile oracle stored procedure
===============================

In order to compile the stored procedure you need the `sql file <https://gitlab.cern.ch/lhcb-dirac/LHCbDIRAC/blob/master/src/LHCbDIRAC/BookkeepingSystem/DB/oracle_schema_storedprocedures.sql>`_.
Make sure that you are using the correct version.

#. Login the database (e.g. using ``sqlplus``, but also with **sqldeveloper** or **sqlcl**)
#. in the terminal execute ``@/path/to/oracle_schema_storedprocedures.sql``
#. commit;

In case of error you have to use 'show errors' command

In case of a schema change, you can find the command that needs to be executed in this sql file `oracle_schema_commands.sql <https://gitlab.cern.ch/lhcb-dirac/LHCbDIRAC/blob/master/src/LHCbDIRAC/BookkeepingSystem/DB/oracle_schema_commands.sql>`_


Discover slow queries in the db
===============================

Note: If you are not familiar with Oracle, it is better to send a mail to `<mailto:phydb.support@cern.ch>`_ mailing list.
You can write we have problem with the database and that the queries are very slow.
IT/DB expert will find the slow queries and will probably tell what is the problem and try to solve.

The `<https://cern.ch/session-manager>`_ is a portal provided by IT/DB where you can logon and find the running queries.
You can find the query which is running very long, get the execution plan and also can take the query and run it in ``sqlplus``.
So you can compare the execution plan which is in the web and in sqlplus.

When you login to the session manager, you have a reader and a writer account. All the select queries are running with the reader account.
::

    Login form:
    UserName: LHCB_DIRACBOOKKEEPING_users
    Password: pass
    Database: LHCBR

You can also check for slow queries in `<oem.cern.ch>`_.

You can check the execution plan in the following way, after having logged in with the main owner account:

#. set autot traceo
#. set timing on
#. set linesize 30000
#. set pagesize 3000
#. execute the qurey

When the query will finish you will have the execution plan and you will have the real execution time as well. I propose to look the following
parameters::

   Cost (%CPU) , consistent gets, physical reads

For example::

   Elapsed: 00:00:00.12
   Execution Plan
   ----------------------------------------------------------
   Plan hash value: 3340191443
   ---------------------------------------------------------------------------------------------------------------------
   | Id  | Operation              | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
   ---------------------------------------------------------------------------------------------------------------------
   |   0 | SELECT STATEMENT          |          |  4960 |  2232K| 21217   (1)| 00:00:01 |       |     |
   |   1 |  NESTED LOOPS             |          |     |     |     |     |     |     |
   |   2 |   NESTED LOOPS            |          |  4960 |  2232K| 21217   (1)| 00:00:01 |       |     |
   |   3 |    PARTITION RANGE ALL          |          |  4897 |  1219K|  1619   (1)| 00:00:01 |     1 |  20 |
   |   4 |     TABLE ACCESS BY LOCAL INDEX ROWID| JOBS      |  4897 |  1219K|  1619   (1)| 00:00:01 |     1 |  20 |
   |*  5 |      INDEX RANGE SCAN           | PROD_CONFIG  |  4897 |     |  88   (0)| 00:00:01 |     1 |  20 |
   |   6 |    PARTITION RANGE ITERATOR        |          |   1 |     |   3   (0)| 00:00:01 |   KEY | KEY |
   |*  7 |     INDEX RANGE SCAN         | JOBS_REP_VIS |     1 |     |   3   (0)| 00:00:01 |   KEY | KEY |
   |   8 |   TABLE ACCESS BY LOCAL INDEX ROWID  | FILES     |   1 | 206 |   4   (0)| 00:00:01 |     1 |   1 |
   ---------------------------------------------------------------------------------------------------------------------
   Predicate Information (identified by operation id):
   ---------------------------------------------------
    5 - access("J"."PRODUCTION"=51073)
    7 - access("J"."JOBID"="F"."JOBID" AND "F"."GOTREPLICA"='Yes')
   Statistics
   ----------------------------------------------------------
    46  recursive calls
     0  db block gets
   508  consistent gets
    46  physical reads
        1452  redo size
       56603  bytes sent via SQL*Net to client
   640  bytes received via SQL*Net from client
    10  SQL*Net roundtrips to/from client
     1  sorts (memory)
     0  sorts (disk)
   131  rows processed


Problems:
* the cost is a big number.
* the consistent gets is very high
* physical reads are very high

Notes:

* You may have queries which need to read a lot of data. In this case the consistent gets and physical reads are very high numbers. In that example if the consistent gets and physical reads are very high (for example more than 10k) we have a problem. This is because the query only returned 131 rows.
* TABLE ACCESS FULL is not good. You have to make sure that the query uses an index. This is not always true.
* parallel execution: you have to make sure that the query is running parallel, the processes does not send to much data between each other. If you run a query parallel and the consistent gets is very high then you have a problem. Contact to oracle IT/DB if you do not know what to do...
* CARTESIAN join: If you see that word in the execution plan, the query is wrong.

=================================
Steps in the Bookkeeping database
=================================

Steps are used to process/produce data. The steps are used by the Production Management system and work flow.
The steps are stored in the steps table which has the following columns::

   STEPID
   STEPNAME
   APPLICATIONNAME
   APPLICATIONVERSION
   OPTIONFILES
   DDDB
   CONDDB
   EXTRAPACKAGES
   INSERTTIMESTAMPS
   VISIBLE
   INPUTFILETYPES
   OUTPUTFILETYPES
   PROCESSINGPASS
   USABLE
   DQTAG
   OPTIONSFORMAT
   ISMULTICORE
   SYSTEMCONFIG
   MCTCK

The steps table has 3 triggers::

   STEP_INSERT: This trigger is used to replace NULL, None to an empty string.
   steps_before_insert: It checks that the processing pass contains a '/'.
   step_update: The steps which are already used can not be modified.

Modifying steps
===============

We may want to modify an already used steps. A step can be modified if the trigger is disabled.
The following commands has to be performed in order to modify a step:

.. code-block:: sql

   ALTER TRIGGER step_update disable;
   UPDATE steps SET stepname='Reco16Smog for 2015 pA', processingpass='Reco16Smog' WHERE stepid=129609; --an alternative is used by the StepManager page
   ALTER TRIGGER step_update enable;

==================================
Processing pass in the Bookkeeping
==================================
The processing pass is a collection of steps. The processing pass is stored in the processing table::

   ID
   ParentID
   Name

The following example illustrates how to create a step:

.. code-block:: sql

    SELECT max(id)+1 FROM processing;
    SELECT * FROM processing where name='Real Data';
    insert into processing(id,parentid, name)values(1915,12,'Reco16Smog');

In this example we have created the following processing pass: /Real Data/Reco16Smog

The following query can be used to check the step:

.. code-block:: sql

    SELECT * FROM (SELECT distinct SYS_CONNECT_BY_PATH(name, '/') Path, id ID
         FROM processing v   START WITH id in (SELECT distinct id FROM processing where name='Real Data')
    CONNECT BY NOCYCLE PRIOR  id=parentid) v   where v.path='/Real Data/Reco16Smog';

If we know the processing id, we can use the following query to found out the processing pass:

.. code-block:: sql

    SELECT v.id,v.path FROM (SELECT distinct  LEVEL-1 Pathlen, SYS_CONNECT_BY_PATH(name, '/') Path, id
      FROM processing
      WHERE LEVEL > 0 and id=1915
      CONNECT BY PRIOR id=parentid order by Pathlen desc) v where rownum<=1;

=====================
Bookkeeping down time
=====================
The following services/agent needs to be stopped before the deep down time (SystemAdministrator can be used in order to manage the services)::

  RMS:
    RequestExecutingAgent
      check it really stops (may take long time)
  TS:
    BookkeepingWatchAgent
    TransformationAgent - Reco, DM, MergePlus (this to be checked). This was not stopped the latest deep downtime
    TransformationCleaningAgent
    MCSimulationTestingAgent
  PMS:
    ProductionStatusAgent
    RequestTrackingAgent
  DMS:
    PopularityAgent
  StorageHistoryAgents(s)

Just before the intervention stop all Bookkeeping services.


===============================================
Automatic updating of the productionoutputfiles
===============================================
Create an oracle periodic job:

.. code-block:: sql

  BEGIN
    DBMS_SCHEDULER.CREATE_JOB (
       job_name             => 'produpdatejob',
       job_type             => 'PLSQL_BLOCK',
       job_action           => 'BEGIN BKUTILITIES.updateProdOutputFiles(); END;',
       repeat_interval      => 'FREQ=MINUTELY; interval=10',
       start_date           => systimestamp,
       enabled              =>  TRUE
       );
    END;
    /

For monitoring:

.. code-block:: sql

    select JOB_NAME, STATE, LAST_START_DATE, LAST_RUN_DURATION, NEXT_RUN_DATE, RUN_COUNT, FAILURE_COUNT from USER_SCHEDULER_JOBS;

Debugging the produpdatejob in case of failure:

- sqlplus LHCB_DIRACBOOKKEEPING/xxxxx@LHCB_DIRACBOOKKEEPING
- set serveroutput on
- exec BKUTILITIES.updateProdOutputFiles();

You will see the problematic production, which you will need to fix. For example: If the production is 22719, you can
use the following queries for debug:


.. code-block:: sql

  SELECT j.production,J.STEPID, f.eventtypeid, f.filetypeid, f.gotreplica, f.visibilityflag
    FROM jobs j, files f WHERE
      j.jobid = f.jobid AND
      j.production=22719 and
      f.gotreplica IS NOT NULL and
      f.filetypeid NOT IN(9,17) GROUP BY j.production, J.STEPID, f.eventtypeid, f.filetypeid, f.gotreplica, f.visibilityflag Order by f.gotreplica,f.visibilityflag asc;

  select * from files f, jobs j where
      j.jobid = f.jobid AND
      j.production=22719 and
      f.gotreplica IS NOT NULL and
            f.eventtypeid is NULL and
      f.filetypeid NOT IN(9,17);

  update files set eventtypeid=90000000 where fileid in (select f.fileid from files f, jobs j where  j.jobid = f.jobid AND
  j.production=22719 and
  f.gotreplica IS NOT NULL and
  f.eventtypeid is NULL and
  f.filetypeid NOT IN(9,17));

  commit;

===============================================
Automatic updating of the prodrunview
===============================================
Create an oracle periodic job:

.. code-block:: sql

  BEGIN
  DBMS_SCHEDULER.CREATE_JOB (
     job_name             => 'prodrunupdatejob',
     job_type             => 'PLSQL_BLOCK',
     job_action           => 'BEGIN BKUTILITIES.updateprodrunview(); END;',
     repeat_interval      => 'FREQ=MINUTELY; interval=20',
     start_date           => systimestamp,
     enabled              =>  TRUE
     );
  END;
  /

For monitoring:

.. code-block:: sql

    select JOB_NAME, STATE, LAST_START_DATE, LAST_RUN_DURATION, NEXT_RUN_DATE, RUN_COUNT, FAILURE_COUNT from USER_SCHEDULER_JOBS;

Debugging the produpdatejob in case of failure:

- sqlplus LHCB_DIRACBOOKKEEPING/xxxxx@LHCB_DIRACBOOKKEEPING
- set serveroutput on
- exec BKUTILITIES.updateprodrunview();

You will see the problematic production, which you will need to fix.

====================
Managing partitions
====================
`jobs` and `files` tables are partitioned. `jobs` table RANGE-HASH partitioned by `production` and `configurationid`. `files` table RANGE partitioned by `jobid`.

jobs table partitions
=====================

The last partitions called `prodlast` and `runlast`. The maximum value of the `prodlast` partition is MAXVALUE. This may require to split, if we see performance degradation.
This can happen if two many rows (`jobs`) belong to this partition. The recommended way to split the partition is to declare a down time, because when the partition split then the
non partitioned indexes become invalid. The non partitioned indexes needs to be recreated, which will block writing to the DB.
The procedure for splitting the `prodlast` partition:

.. code-block:: sql

  select max(production) from jobs PARTITION(prodlast) where production!=99998;
  ALTER TABLE jobs SPLIT PARTITION prodlast AT (xxxxx) INTO (PARTITION prodXXX, PARTITION prodlast);

One of the possibility is to split the `prodlast` using the last production, which can be retrieved using the following query above.
`xxxxx` is the result of the query. `prodXXX` is the last partition+1. For example:

.. code-block:: sql

  select max(production) from jobs PARTITION(prodlast) where production!=99998;

which result is 83013

.. code-block:: sql

  ALTER TABLE jobs SPLIT PARTITION prodlast AT (83013) INTO (PARTITION prod4, PARTITION prodlast);

Rebuild the non partitioned indexes:

.. code-block:: sql

   ALTER INDEX SYS_C00302478 REBUILD;
   ALTER INDEX JOB_NAME_UNIQUE REBUILD;

files table partitions
======================

This table is RANGE partitioned by `jobid`, which can reach the maximum value of the existing partition. It this happen, the following error will appear::

  2018-07-30 01:12:00 UTC dirac-jobexec/UploadOutputData ERROR: Could not send Bookkeeping XML file to server:
  Unable to create file /lhcb/MC/2015/SIM/00075280/0000/00075280_00009971_1.sim ! ERROR: Execution failed.: (
  ORA-14400: inserted partition key does not map to any partition
  ORA-06512: at "LHCB_DIRACBOOKKEEPING.BOOKKEEPINGORACLEDB", line 976
  ORA-06512: at line 1

In order to fix the issue a new partition has to be created:

.. code-block:: sql

  alter table files add PARTITION SECT_0620M  VALUES LESS THAN (620000000);

===================
Database monitoring
===================

Various queries are available in the BookkeepingSystem/DB/monitoring.sql file. They can be used for discovering problems such as database locks, broken oracle jobs, sessions, used indexes, etc.