21 February 2017, 14:04 - 15:02
Present: Maiken, David, Mattias, Anders (after 14:17), Oxana
Apologies: Balazs, Aleksandr
= News
Usual problems on NDGF clusters due to 2 problems: 1000+ file jobs and the gridftp problem w.r.t. some US sites. Dima implemented a JURA-to-ES converter as a separate service, but can be actually implemented directly in JURA, all the hooks are there. Maiken works on ganglia in a-rex, job state changing histograms are requested by Jens, some problems encountered but getting fixed by Aleksandr. ARC update in EPEL (5.2.2) can be pushed to stable, should be done (better than 5.2.1 anyway). Oxana is busy writing Swedish Tier1 funding application. Anders committed a S3 fix for rawhide, Mattias started preparing an update.
= Release
Not much point in having new release before the a-rex crashes are resolved. Crashes are stopped by David removing dataset from the offending site. Nobody figured yet what causes the crash anyway.
There are quite some fixes since 5.2.1, though not all are relevant for a bugfix, rather for a minor release. The target for a minor release can be in a month or so, will include the new ganglia feature, and JURA->ES, if done, and hopefully the mutex problem.
S3 fixes need to be backported. The minor release also must involve the OpenSSL change. If 5.3 introduces a lot of new code for older systems (EL6 and such), it would need some serious testing. Non-RFC proxies should not be supported, even though some services still use only them. But dropping non-RFC proxies support amounts to backwards incompatibility.
Bug fixes are now applied to 5.2.x but not to 5.3, so there is a risk of divergence. Everybody agrees that 5.3 needs a lot of testing.
Decision: merge bugfixes into 5.3, run testing and move over to 5.3. For a proper testing a production site needs to try the 5.3 release candidate, something like EGI Early Adopters.
Code freeze is scheduled to 4 weeks from now (March 21).
= Bugs
Florido removed the heartbeat code from everywhere, which probably fixes a couple of bugs.
* 3633: not fixed yet, though it might not be a blocker; superuser processes should be careful when writing into a directory other users can write into. Aleksandr says it is easier to drop the entire performance logging code than fixing the bug properly. Demoted to critical.
* 3637: Anders might still have a core, a backtrace might be useful
* 3636: no progress, an access via gridftp to the offensive site can be arranged
* 3635: Anders accidentally may have fixed it