The popularity analysis relies on a lot of components


Despite its name, that is where both the StorageUsageAgent and the PopularityAgent stores their data. It is exposed via the StorageUsageHandler and the DataUsageHandler


This agent scans the DFC and stores the size and number of files per directory and per StorageElement in the StorageUsageDB.


This agent crawls the StorageUsageDB, convert each directory into a bookkeeping path and fill in the following accounting:
  • Storage: space used/free per storage and/or directory
  • Data storage: spaced used per bookkeeping path
  • user storage: like Storage, but for user directories


This service is called by the jobs to declare their use of a given directory. It is stored per directory and per day.


This agent goes through the StorageUsageDB and creates accounting entries for the popularity. It also caches the BK dictionary for each directory in the StorageUSageDB.

DataPop server

Yandex provided service that consumes our popularity CSV and make prediction on which dataset to remove. It is ran on our mesos cluster:


This agents creates two files:
  • one CSV containing a summary of the popularity (see popularity.csv file ).
  • one CSV, generated from the first one through the DataPop server