Kosice Technical meeting day 1: We went through the reports, see those collected in the indico under the Tuesday morning session "Tuesday morning: Reports and Presentations": https://indico.lucas.lu.se/sessionDisplay.py?sessionId=0&confId=279#20160531 Kosice Technical meeting day 2 meeting notes: --------------------------------------- 1) Performance data: discussion ACTION: server side performance data depo: Copenhagen and Ulf will deliver it. ACTION: submit perf data file will be world writeable (initscript will do it) 2) AREX behavior under heavy load A-REX crashes under heavy-loaded cluster (heavy data staging): typical lifetime 1 day! several sites have now cron jobs to restart A-REX daily. Could be a memory leak, since restart helps. NDGF works on a test system. ACTION: Aleksandr will prepare a "crash-data collector" script with some instructions on a wiki that will help sysadmins to provide debuging data. ACTION: Alarik is one of the most instable cluster. Save this cluuster for DEBUGING!!!! 3) Towards Python (in the LRMS moduls) LSF and SCEAPI (Chinese) python scripts are used. Slurm is also there but not tested. Infoprovider part is unclear. Info part calls python from perl. Python version might have limitations for RTE handling. ACTION: test the available Python-SLURM, compare it to 5.1.1 non-Python version. a stable 5.1.1-based branch with python needed (Martin). related: Python versions status: 2.6 is the required version. 4) VOs & ARC: short status update see slides of Florido, we publish VO-dependent job statistics. development done, code is on the trunk. discussion: default VO set by sysadmins on a CE? some problem about infosys startup scripts were bought up. 5) AREX and HTCondor - US Globus CEs to be shut down by end of 2016, install Condor CE on them. - Condor CE shortages: no-info sys, manual/hackish accounting... ACTION: investigate Condor CE interface, possible job solution/management to CCE using simplified ClassAds. check how the CCLI does it. Not yet decided if interfacing to CondorCE to be done inside ACT or inside ARClib. Ugly(?) solution: Arclib wrapper (???) around condor-cli The missing link between ACT and CondorCE is a must. APF: limited in scope, not extensible. ACT could be the replacement. - other direction: Condorcli to A-REX via EMIES. Prerequisites: --Working EMI-ES for pilot jobs. ACTION: have EMI-ES based clusters in Ljubljana and CERN (Boinc) submission via ACT, if needed prepare quick bugfix release for EMI-ES fixes. Check scalability too. Then, if everything works Balazs will go and contact Miron again. 6) Top priority bugs checked the P1, P2 and major bugs. two bugs got fixed. 7) Protection against hanging/timeout external processes in A-REX including the infosys special use-case http://bugzilla.nordugrid.org/show_bug.cgi?id=3468 no general rules, everything depends on the script(s). e.g. for authorization scripts there is a timeout. an example for misbehaving scripts: ceinfo.pl takes too long and while it is still running A-REX restart will launch a new instance ceinfo.pl, causing problems. also discussed ceinfo.pl scalability, various ideas but nothing concrete action yet. 8) 10 years of RTEs: time for replacement? CSC: keeps both the RTE scripts and software on CVMFS. on every CE they run a cron to copy RTE scripts to a local dir set in arc.conf. ENV/PROXY: let's integrate it into core AREX: won't do it yet. Andrii came with an idea howto change the RTE scripts so that they can be distributed & installed under standard location and remove the requirement for shared RTE dir. ACTION: Andrii responsible for the RTE change. 9) server logs feedback: some logs are not easily machine readable. Ulf showed some extensive logging examples (jobparsing.. will be fixed asap). ACTION: Ulf sends a collection of most annoying log examples. 10) Config parsers perl (assigned to Florido): ------ ConfigCentral.pm IniParser.pm: candidate for rewrite. use the python parser. XMLparser.pm: drop ConfigParser.pm: drop arc-config-check: plan to use th common parser perl scripts will call the arcconf_parser.py and consume their output (could be json) nordugridmap: separate thing, don't touch bash (Christian, Aleksandr): ----- gridftpd.init arex.in config_parser_compat.sh: will rely on arcconfig_parser.py config_parser.sh: drop c++ (Aleksandr): ----- class configsection python: -------- acix parser: will use the common python python-ssh scripts also have their parser. rather simply, in addition to getting their data from perl. full scale config parser, for all info in arc.conf: arcconf_parser.py (Martin & Christian) IMPORTANT: all arc.conf changes to be done with the new parsers!!!! 11) Planning for the next major/minor release - 5.1.2: quick bugfix release(s) for EMI-ES roll-out scan-slurm-job bug will be part of it too. time: before end of June? - 6.0.0 the next Major, backward-incompatible release --new arc.conf --event-driven arex --python backends --new RTE stuff --VO job numbers --per queue authorization to deprecated, removed from 6.0.0: --to be collected later time planning: first release candidate: begining of December 2016 -5.2.0 minor release half way some of the major items will be released as tech preview already in 5.2.0 date: autumn 2016. 12) There we no time left for the following topics: - When can we retire the GIISes? - Resource discovery revisited - Simple ARC CE configurations (to be done already on the current config, in September) - crazy brainstorming .... 13) next meeting: a developer week/camp sometime in September.