Productions

These advices can be applied to all sort of productions.

Hospital

When?

When all files have been processed except some that cannot make it due to either excessive CPU or memory (e.g. at IN2P3 jobs stalled)

How?

In the CS: /Operations/LHCb-Production/Hospital set 2 or 3 options:

Transformations: list of productions to be processed at the hospital queue
HospitalSite: for example CLOUD.CERN.cern (or any other site without strict limitations)
HospitalCE: if needed, define a specific CE (e.g. one with more memory or CPU)
It is also possible to define some "Clinics" for specific purposes:

Within /Operations/LHCb-Production/Hospital/Clinics define one section per clinic, with a dummy name (e.g. Stripping)
For each clinic, define a list of transformations, a site and possibly a CE. If Site or CE is not defines the site / CE set for hospital is used.
Transformations: list of productions to be processed at this clinic
ClinicSite: for example CLOUD.CERN.cern (or any other site without strict limitations)
ClinicCE: if needed, define a specific CE (e.g. one with more memory or CPU)

Then:

reset the files Unused. They will be brokered to the designated hospital site, wherever the input data is.

Example:

[localhost] ~ $ dirac-transformation-debug 71500 --Status MaxReset --Info jobs
 Transformation 71500 (Active) of type DataStripping (plugin ByRunWithFlush, GroupSize: 1) in Real Data/Reco17/Stripping29r2
BKQuery: {'StartRun': 199386L, 'ConfigName': 'LHCb', 'EndRun': 200350L, 'EventType': 90000000L, 'FileType': 'RDST', 'ProcessingPass': 'Real Data/Reco17', 'Visible': 'Yes', 'DataQualityFlag': ['OK', 'UNCHECKED'], 'ConfigVersion':
'Collision17', 'DataTakingConditions': 'Beam6500GeV-VeloClosed-MagDown'}

2 files found with status ['MaxReset']


 1 LFNs: ['/lhcb/LHCb/Collision17/RDST/00066581/0004/00066581_00043347_1.rdst'] : Status of corresponding 5 jobs (sorted):
Jobs: 200059983, 200171070, 200415532, 201337455, 201397489
Sites (CPU): LCG.IN2P3.fr (3609 s), LCG.IN2P3.fr (3635 s), LCG.IN2P3.fr (3626 s), LCG.IN2P3.fr (3617 s), LCG.IN2P3.fr (3649 s)
  5 jobs terminated with status: Failed; Job stalled: pilot not running; DaVinci step 1
      1 jobs stalled with last line: (LCG.IN2P3.fr) EventSelector     SUCCESS Reading Event record 3001. Record number within stream 1: 3001
      1 jobs stalled with last line: (LCG.IN2P3.fr) EventSelector     SUCCESS Reading Event record 4001. Record number within stream 1: 4001
      2 jobs stalled with last line: (LCG.IN2P3.fr) EventSelector     SUCCESS Reading Event record 5001. Record number within stream 1: 5001
      1 jobs stalled with last line: (LCG.IN2P3.fr) EventSelector     SUCCESS Reading Event record 3001. Record number within stream 1: 3001

 1 LFNs: ['/lhcb/LHCb/Collision17/RDST/00066581/0007/00066581_00074330_1.rdst'] : Status of corresponding 7 jobs (sorted):
Jobs: 200339756, 200636702, 200762354, 200913317, 200945856, 201337457, 201397490
Sites (CPU): LCG.IN2P3.fr (32 s), LCG.IN2P3.fr (44 s), LCG.IN2P3.fr (46 s), LCG.IN2P3.fr (45 s), LCG.IN2P3.fr (44 s), LCG.IN2P3.fr (3362 s), LCG.IN2P3.fr (38 s)
  7 jobs terminated with status: Failed; Job stalled: pilot not running; DaVinci step 1
      1 jobs stalled with last line: (LCG.IN2P3.fr) dirac-jobexec INFO: DIRAC JobID 200339756 is running at site LCG.IN2P3.fr
      1 jobs stalled with last line: (LCG.IN2P3.fr) dirac-jobexec/Subprocess VERBOSE: systemCall: ['lb-run', '--use-grid', '-c', 'best', '--use=AppConfig v3r [...]
      1 jobs stalled with last line: (LCG.IN2P3.fr) dirac-jobexec/Subprocess VERBOSE: systemCall: ['lb-run', '--use-grid', '-c', 'best', '--use=AppConfig v3r [...]
      1 jobs stalled with last line: (LCG.IN2P3.fr) dirac-jobexec/Subprocess VERBOSE: systemCall: ['lb-run', '--use-grid', '-c', 'best', '--use=AppConfig v3r [...]
      1 jobs stalled with last line: (LCG.IN2P3.fr) dirac-jobexec/Subprocess VERBOSE: systemCall: ['lb-run', '--use-grid', '-c', 'best', '--use=AppConfig v3r [...]
      1 jobs stalled with last line: (LCG.IN2P3.fr) EventSelector     SUCCESS Reading Event record 2001. Record number within stream 1: 2001
      1 jobs stalled with last line: (LCG.IN2P3.fr) dirac-jobexec/Subprocess VERBOSE: systemCall: ['lb-run', '--use-grid', '-c', 'best', '--use=AppConfig v3r [...]

So 2 files have to be hospitalised. Reset them Unused:

[localhost] ~ $ dirac-transformation-reset-files 71500 --Status MaxReset
2 files were set Unused in transformation 71500