Technical Coordination Weekly

Europe/Stockholm
Skype

Skype

Balazs Konya (Lunds universitet)
Description
Technical coordination group members and invited persons
1 July 2015, 14:02 - 14:53

Present: Mattias, Jon, Balazs
Apologies: Oxana, David

=News
No news from Mattias. 
Nothing from Aleksandr. 
Jon: first package submitted to EPEL. Balazs asked if the SLURM comparison testing (Python vs. old script) started yet: answer no  
Balazs: Tells about strange GGUS behaviour, work around VO-views.  


=Cutting through the "Mess Everywhere"

motivation: the new major release at the end of the year will allow us to do proper cleanup of the messy ARC areas. This meeting we started to identify the topics which are most messy in ARC. Here is the initial list:

- server side logging: "state of-the-art" is almost collected, Jon got lost wrt the infosys logs 
- interfaces: any chagne will be more difficult because of compatibility reasons
- treatmet of VOs: publishing, authorizing, discovering, accounting VO stuff: all done differently
- arc.conf : non-intruitive, naming inconsistency all over the  place
- naming of ARC components, modules, functionality


=Release status
Jon: We are celebrating setting a new record to come up with a minor release: 23h:50min. 
Dyma deployed 5.0.2, right after the discovered the problem with 5.0.1.
No plan yet for new minor release :)
Jon proposed both 5.0.1 and 5.0.2 for the UMD inclusion process ;)
Long discussion about the EPEL testing phase. No conclusion.


=Bugs
-3468    arex excessive logging when infoproviders timeout expires:
Balazs and Florido might propose a radical solution: remove a large part of code that causes the problem. more info to be posted later

-3210    CPU time isn't measured correctly for some jobs (e.g. ALICE)
no progress

-3470    Watchdog did not restart arched after segfault
Aleksandr, Jon thinks more investigation needed. The problem is unclear 

-3163    Infosystem showing incorrect info on multicore jobs with condor backend
no progress

-2036    infosys not scalable for ~100k jobs
no progress

-3384    Support for per-queue authorisation configuration and publishing
no progress

-3486    External helper log file location is hardcoded to controldir/job.helper.errors
Aleksandr agrees with the proposed change

-3432    bdii-update.log fills up with complaints about dn suffix (REOPENED)
Mattias promises to look at it next week

-3457    Accounting problem with PBS/torque for multi-core jobs (REOPENED)
no progress

-3482    ARC cache service failed to stage data for job submitted via EMI-ES due to proxy issues (REOPENED)
cache-service welcome back :)

=AOB 
David posts the following towards the end of the meeting:
this page was mentioned today, not sure if you have seen it https://twiki.cern.ch/twiki/bin/view/LCG/BatchSystemComparison
The vast majority of European WLCG sites are using Torque
i would like someone to comment on the APEL support row

There are minutes attached to this event. Show them.