Merging

DM checks at the end of Merging

When the Merging productions for a given Stripping are over, there are a couple of situations that might arise:

  1. Output file is registered in the FC but is not registered in the BKK
  2. Output file is registered in the FC but does not have the replica flag = No
  3. Output file is registered in the BKK but not in the FC

First case

Nothing can be done, we do not have all the info, we should just remove the file. If it is a debug file, it will be cleaned anyway when cleaning the DEBUG storage.

Second case

Either the file is at its final destination, and in that case the replica flag can just be toggled (–FixBK in dirac-dms-check-fc2bkk), or the file is in a failover. In the later case, it is enough to replicate the file to its run destination (dirac-dms-replicate-to-run-destination) and remove the replica on the failover storage.

Third case

If the file physically exists, the file can just be registered in the FC.

Examples

The 3rd case is noticeable only in the output replication transformation, because it will mark these files as MissingInFC. For the first two cases, the best is to use dirac-dms-check-fc2bkk.

For example

[LHCbDirac prod] diracos $ dirac-dms-check-fc2bkk --Prod 69080,69076,68774,68772
Processing production 69080
Getting files from 18 directories  : found 5348 files with replicas and 0 without in 13.9 seconds
Getting 5348 files metadata from BK : completed in 1.8 seconds
>>>>
1 files are in the FC but have replica = NO in BK
====== Now checking 1 files from FC to SE ======
Checking replicas for 1 files : found 1 files with replicas and 0 without in 1.1 seconds
Get FC metadata for 1 files to be checked:  : completed in 0.1 seconds
Check existence and compare checksum file by file...
Getting checksum of 1 replicas in 1 SEs
0. At RAL-DST (1 files) : completed in 1.7 seconds
Verifying checksum of 1 files
No files in FC not in BK -> OK!
No missing replicas at sites -> OK!
No replicas have a bad checksum -> OK!
All files exist and have a correct checksum -> OK!
====== Completed, 1 files are in the FC and SE but have replica = NO in BK ======
1 files are visible, 0 files are invisible
/lhcb/LHCb/Collision15/BHADRONCOMPLETEEVENT.DST/00069080/0000/00069080_00003151_1.bhadroncompleteevent.dst :
Visi Y
Full list of files:    grep InFCButBKNo CheckFC2BK-2.txt
Use --FixBK to fix it (set the replica flag) or --FixFC (for removing from FC and storage)
<<<<
No files in FC not in BK -> OK!
Processed production 69080
Processing production 69076
Getting files from 18 directories  : found 5789 files with replicas and 0 without in 10.3 seconds
Getting 5789 files metadata from BK : completed in 2.9 seconds
No files in FC with replica = NO in BK -> OK!
No files in FC not in BK -> OK!
Processed production 69076
Processing production 68774
Getting files from 18 directories  : found 7510 files with replicas and 0 without in 12.7 seconds
Getting 7510 files metadata from BK : completed in 2.8 seconds
No files in FC with replica = NO in BK -> OK!
No files in FC not in BK -> OK!
Processed production 68774
Processing production 68772
Getting files from 18 directories  : found 10702 files with replicas and 0 without in 14.8 seconds
Getting 10702 files metadata from BK : completed in 4.2 seconds
No files in FC with replica = NO in BK -> OK!
>>>>
1 files are in the FC but are NOT in BK:
/lhcb/debug/Collision15/LEPTONIC.MDST/00068772/0000/00068772_00003806_1.leptonic.mdst
Full list of files:    grep InFCNotInBK CheckFC2BK-3.txt
Use --FixFC to fix it (remove from FC and storage)
<<<<




[LHCbDirac prod] diracos $ dirac-dms-check-fc2bkk --
LFN=/lhcb/LHCb/Collision15/BHADRONCOMPLETEEVENT.DST/00069080/0000/00069080_00003151_1.bhadroncompleteevent.dst
--FixBK
Checking replicas for 1 files : found 1 files with replicas and 0 without in 0.3 seconds
Getting 1 files metadata from BK : completed in 0.0 seconds
>>>>
1 files are in the FC but have replica = NO in BK
====== Now checking 1 files from FC to SE ======
Checking replicas for 1 files : found 1 files with replicas and 0 without in 4.8 seconds
Get FC metadata for 1 files to be checked:  : completed in 0.4 seconds
Check existence and compare checksum file by file...
Getting checksum of 1 replicas in 1 SEs
0. At RAL-DST (1 files) : completed in 1.0 seconds
Verifying checksum of 1 files
No files in FC not in BK -> OK!
No missing replicas at sites -> OK!
No replicas have a bad checksum -> OK!
All files exist and have a correct checksum -> OK!
====== Completed, 1 files are in the FC and SE but have replica = NO in BK ======
1 files are visible, 0 files are invisible
/lhcb/LHCb/Collision15/BHADRONCOMPLETEEVENT.DST/00069080/0000/00069080_00003151_1.bhadroncompleteevent.dst :
Visi Y
Full list of files:    grep InFCButBKNo CheckFC2BK-4.txt
Going to fix them, setting the replica flag
       Successfully added replica flag to 1 files
<<<<
No files in FC not in BK -> OK!

[LHCbDirac prod] diracos $ dirac-dms-remove-files
/lhcb/debug/Collision15/LEPTONIC.MDST/00068772/0000/00068772_00003806_1.leptonic.mdst
Removing 1 files : completed in 8.1 seconds
Successfully removed 1 files

jobs failing during finalize

Problem:

If a Merge job fails during finalisation, its input files may not be removed… In addition its output files may be incorrectly uploaded or registered Therefore starting from the left non-merged files one may find anomalies and fix them. This requiers getting invisible files in the DataStripping productions and checking their descendants in the Merge production

Examples:

Get the descendants of the DataStripping production (here 69528) that still have replicas, and check their descendants in the Merging production (here 69529)

[localhost] ~ $ dirac-bookkeeping-get-files --Production 69528 --Visi No | dirac-production-check-descendants 69529
Got 59 LFNs
Processing Merge production 69529
Looking for descendants of type ['EW.DST', 'BHADRON.MDST', 'SEMILEPTONIC.DST', 'DIMUON.DST', 'CALIBRATION.DST', 'FTAG.DST', 'CHARMCOMPLETEEVENT.DST', 'BHADRONCOMPLETEEVENT.DST', 'CHARM.MDST', 'LEPTONIC.MDST']
Getting files from the TransformationSystem...
Found 59 processed files and 0 non processed files (1.2 seconds)
Now getting daughters for 59 processed mothers in production 69529 (depth 1) : completed in 5.9 seconds
Checking replicas for 2 files : found 2 files with replicas and 0 without in 0.4 seconds
Checking FC for 2 file found in FC and not in BK |                                                  |Checking replicas for 2 files (not in Failover) : found 0 files with replicas and 0 without in 0.5 seconds
: found 2 in Failover in 0.5 seconds

Results:
2 descendants were found in Failover and have no replica flag in BK
All files:
/lhcb/LHCb/Collision16/DIMUON.DST/00069529/0001/00069529_00012853_1.dimuon.dst
/lhcb/LHCb/Collision16/BHADRONCOMPLETEEVENT.DST/00069529/0001/00069529_00012813_1.bhadroncompleteevent.dst
You should check whether they are in a failover request by looking at their job status and in the RMS...
To list them:     grep InFailover CheckDescendantsResults_69529-1.txt
2 unique daughters found with real descendants
No processed LFNs with multiple descendants found -> OK!
No processed LFNs without descendants found -> OK!
No non processed LFNs with multiple descendants found -> OK!
No non processed LFNs with descendants found -> OK!
Complete list of files is in CheckDescendantsResults_69529-1.txt
Processed production 69529 in 9.4 seconds

After checking at the RMS whether they have matching Requests, and if so what happened to it, we can replicate them to final destination and then remove from Failover

[localhost] ~ $ grep InFailover CheckDescendantsResults_69529-1.txt | dirac-dms-replicate-to-run-destination --RemoveSource --SE Tier1-DST
Got 2 LFNs
Replicating 2 files to CERN-DST-EOS
Successful :
    CERN-DST-EOS :
        /lhcb/LHCb/Collision16/BHADRONCOMPLETEEVENT.DST/00069529/0001/00069529_00012813_1.bhadroncompleteevent.dst :
             register : 0.757441997528
            replicate : 655.287761927
        /lhcb/LHCb/Collision16/DIMUON.DST/00069529/0001/00069529_00012853_1.dimuon.dst :
             register : 0.632274866104
            replicate : 46.3780457973

Finally, Check again and remove non-merged files

[localhost] ~ $ dirac-dms-remove-files --Last
Got 59 LFNs
Removing 59 files : completed in 103.1 seconds
59 files in status Processed in transformation 69529: status unchanged
Successfully removed 59 files

Flushing runs

When a file is problematic in the Stripping production, or if a RAW file was not processed in the Reco, the run cannot be flushed automatically ( Number of ancestors != number of RAW in the run). We list the runs in the Stripping productions (here 71498) that have problematic files, and we flush them in the Merging (here 71499)

[localhost] ~ $ dirac-transformation-debug 71498 --Status Problematic --Info files | dirac-bookkeeping-file-path --GroupBy RunNumber --Summary
--List
Got 29 LFNs
Successful :
    RunNumber 201413 : 1 files
    RunNumber 201423 : 1 files
    RunNumber 201467 : 1 files
    RunNumber 201602 : 1 files
    RunNumber 201643 : 1 files
    RunNumber 201647 : 1 files
    RunNumber 201664 : 1 files
    RunNumber 201719 : 1 files
    RunNumber 201745 : 2 files
    RunNumber 201749 : 1 files
    RunNumber 201822 : 1 files
    RunNumber 201833 : 1 files
    RunNumber 201864 : 1 files
    RunNumber 201873 : 1 files
    RunNumber 201983 : 1 files
    RunNumber 202031 : 1 files
    RunNumber 202717 : 1 files
    RunNumber 202722 : 1 files
    RunNumber 202752 : 1 files
    RunNumber 202773 : 1 files
    RunNumber 202809 : 1 files
    RunNumber 202825 : 1 files
    RunNumber 202835 : 2 files
    RunNumber 202860 : 1 files
    RunNumber 202869 : 1 files
    RunNumber 202873 : 1 files
    RunNumber 202887 : 1 files

List of RunNumber values
201413,201423,201467,201602,201643,201647,201664,201719,201745,201749,201822,201833,201864,201873,201983,202031,202717,202722,2027
52,202773,202809,202825,202835,202860,202869,202873,202887

Then flush the runs in the merging production

[localhost] ~ $ dirac-transformation-flush-runs 71499 --Runs
201413,201423,201467,201602,201643,201647,201664,201719,201745,201749,201822,201833,201864,201873,201983,202031,202717,202722,2027
52,202773,202809,202825,202835,202860,202869,202873,202887
Runs being flushed in transformation 71499:
201413,201423,201467,201602,201643,201647,201664,201719,201745,201749,201822,201833,201864,201873,201983,202031,202717,202722,2027
52,202773,202809,202825,202835,202860,202869,202873,202887
27 runs set to Flush in transformation 71499

Then, starting from the runs that are not flushed in the Merging, we can check if some RAW files do not have descendant

dirac-bookkeeping-run-files <runNumber> | grep FULL | dirac-bookkeeping-get-file-descendants

The files that are marked as NotProcessed or NoDescendants are in runs that will need to be flushed by hand

Another way of understanding why a run is not flushed is by using dirac-transformation-debug. But this takes a looooong while

dirac-transformation-debug --Status=Unused --Info=flush <mergingProd>