I’m going to do something very dangerous – talk about specific raw, unanalyzed, and likely inaccurate statistics.
But I don’t know how else to combine radical transparency with my work of tracking and analyzing community health.
Does that mean someone is likely to read my posts and cherry-pick information that serves their own agenda? Perhaps. But I don’t choose to be open and transparent because it’s the risk-free position. I do it because it’s the better way to do things.
Do I expect some healthy criticism of my methods and results? Yes, I encourage it, and figure criticism will be more accurate and reasonable when the critic can view my methods transparently. I know I’m doing things incorrectly and getting things wrong. I need the eyes and help of others who care as much (or more) than I do.
Part of the better way is that I’m doing work on tracking community health directly in the communities themselves. The tools and data already belong there, so will the gathered statistics and (at least) base analyses. There is no opportunity or reason for me to keep all this a secret, but it is going to be a while until a coherent story unfolds. In the meanwhile, I’m making the proverbial sausage but without a recipe.
A website we’ve been working on at Red Hat is going live next week – and I’ll announce it when it goes live, but give us our rare Big Reveal without tittering, thankyouverymuch – and one part of that work is going to be based on these statistics I cooked up about Fedora Packaging today. At a minimum, this post is a reference for how those statistics came about.
Thankfully Toshio still had the script he wrote with Max a few years ago (2007!), and with some tweaks to the interaction with the Fedora account system (FAS), I got these raw, unanalyzed, and likely inaccurate statistics about who owns and co-maintains Fedora 17 packages.
Why am I so certain they are inaccurate? Simply, folks at Red Hat who work on Fedora don’t always use their @redhat.com email address. Unfortunately, this tool creates statistics based solely on the email address of the packager. We don’t have anything more than a wild guess about how many Red Hat people use a different address (most likely user@fedoraproject.org but others from personal domains and mail hosting services.)
Is there anything I can do to make them more accurate? I need to make a mapping of the different email addresses Red Hat folks use, mapped to their main Red Hat account. This will help with sorting out this sort of detail.(Currently 659 accounts to check, not a terrible manual research job, just tedious.)
Why am I so concerned with who owns or maintains what packages? Aren’t Red Hat people community contributors, too? Darn tootin’ they are, but that’s not the point. I know people look at such statistics as a competition, but that’s not my goal. These are statistics that are all out there in the public, I’m just doing a job of gathering things together. I’m doing that job because it’s one useful way of knowing if projects are being successful. We all want to see a steady, sustainable growth in packages in Fedora, with a healthy balance of package ownership so that one person or one organization isn’t taking on too much for itself. I’d like to be able to give a more accurate account of who at Red Hat contributes to Fedora – I’m sure even Red Hat doesn’t know exactly how much effort goes in to Fedora (and other upstreams) from Red Hat folks.
What is my future plan for these specific statistics? I feel responsible for reporting these, now that I’m starting. I recall Simon Phipps once wisely saying, “Don’t start reporting on any statistics that you don’t want to report on forever.” I accept that, and that any changes to the reporting – the methods, sources, results, etc. – need to be highlighted and explained. (I also hope that by putting tools and methodologies directly in to the projects, others can be involved in the creation and delivery of statistics and reporting.) I’ll link out to everything I do from the canonical Fedora statistics page, and I’ll host tools, configurations, and documentation on Fedora Hosted and the Fedora Project Wiki. Anything that is generic will also get contributed to the Metrics Working Group (metrics-wg) of TheOpenSourceWay.org.
And now, the stats and how I got them:
./rhpkgers.py f17 maint.list users.list Total Packages: 12157 Total RH Maintainers: 386 Total NonRH Maintainers: 659 These stats are for people able to commit to the package. The first set disregards packages which are open for anyone to commit: Packages which have at least one Red Hat maintainer and packages which have at least one non-Red Hat maintainer:   @redhat.com: 5330  !@redhat.com: 8575 Packages which have solely Red Hat maintainers, solely non Red Hat maintainers, and a mixture of both:  solely  @redhat.com: 3490  solely !@redhat.com: 6735  mixed redhat+!redhat: 1840     orphaned packages: 92 This set factors in the possible effects of open acls (ie: anyone in cvsextras can commit: Packages to which only Red Hat packagers can commit, only non Red Hat packagers can commit, or both:  solely  @redhat.com: 0  solely !@redhat.com: 0  mixed redhat+!redhat: 12157 Total Packages to which anyone can commit: 12154
Steps:
- Get a list of all maintainers for all versions of Fedora (8.2 Mb file in the end):
-
curl https://admin.fedoraproject.org/pkgdb/lists/vcs?tg_format=plain \ > maint.list
-
- Get a list of all active user accounts (61201 accounts, 3 Mb file); substitute the USERNAME and PASSWORDwith those of any active FAS user (I used my own):
-
curl -d 'user_name=USERNAME&password=PASSWORD&login=Login' \ 'https://admin.fedoraproject.org/accounts/group/dump' > users.list
-
- Run the scriptspecifying the package version:
-
./rhpkgers.py f17 maint.list users.list
-
I will be getting a git repo started on Fedora Hosted soon to put relevant bits in, and I’ll update this post with a link to that repo when ready.
Very cool. In a couple months, you might want to check out ianweller’s up and coming work on a super mega stats DB for fedora — https://fedoraproject.org/wiki/User:Ianweller/statistics_plus_plus
I think there is 2 small issues ( both that apply to me ) :
– people whose main job at RH is not working on Fedora
– people who started to help Fedora before being hired
on the other hand, that may be statistical noise 🙂
Awesome stuff, quaid. Never apologize for science.
Fantastic, thanks for the pointer. I’ve worked with Ian on early versions of datanommer et al, we’re definitely moving in the same direction. I’ll go get reacquainted with statistics++.
Right, that’s exactly the situations I’m familiar with. I worked on Fedora until 2007 before it became any part of my job. Ironically, what I spent most of my time on in Fedora was almost never related to my job-work-for-Fedora, unless sometimes it was! It all gets so muddled …
I think the nuances you mention would matter statisitcally if I were trying to track how Red Hat spends payroll down to small percentages of a full-time person. That is something that really is between a manager, a team, and an individual. Even where you may have some Fedora duties as part of your job role, it can be very fluid how much time it takes, and you may get interested and pulled in to other parts of Fedora outside of that, again on a fluid basis. I think that overall flow is extremely important to the value and success of Red Hat in both producing free/open source software, and in making a business out of it.
Ultimately, my goal is to create something that people who take it on themselves to care about the health of their community – people like you and I – can have a dashboard-like view in to the kind of output of stats++ https://fedoraproject.org/wiki/User:Ianweller/statistics_plus_plus, with associated content to help analyze and act based on what the metrics show.
Word, brother. Just battling the inner “what-if” demon.
Awesome!!!!!!!!!!!!!!!!
I’m really excited by your effort here.
I would love to be able to mature methology and trending to the point were we can start to identify bottlenecks and some outreach goals.
“… identify bottlenecks and some outreach goals.”
Exactly! I’m pretty sure we can do all sorts of automagic to help flag potential bottlenecks (or downright trouble), and it’s ultimately about giving us humans a chance to reach out to each other.
It’s not really like a system monitoring setup, humans aren’t like servers. But human interaction – especially creating free and open source software on the Internet globally – is a social network with a lot of programmatic interfaces. Now we’re starting to get to something we can monitor and analyze, a network with API calls flying across it. 🙂
Unfortunately i’m not sure is possible to tell when an @redhat.com maintainer do this as part of his job or do it on his own.
Another hard to count factor is the package importance, since not all packages are equal, is different when one maintain a the *whole* LibreOffice and someone a set of TTF files (for example).
And yet another thing which can’t be aggregated into the official statistics are packages available elsewhere – and I am talking here about things like RPM Fusion, practically but not officially they *are* Fedora.
> “I need to make a mapping of the different email addresses Red Hat folks use, mapped to their main Red Hat account. This will help with sorting out this sort of detail.(Currently 659 accounts to check, not a terrible manual research job, just tedious.)”
Can’t you just look at the groups in FAS?
For example, for Red Hat employees contributing to Fedora there is:
https://admin.fedoraproject.org/accounts/group/view/cla_redhat
(there are equivalent groups for Dell, Intel,…)
That seems much less tedious, and much more future-proof (in case somebody gets hired for example).
I’m not sure it matters if an @redhat.com maintainer is doing something as part of a job role. Check my other comment on this – I think it’s essential for the business that we not care but encourage people to participate when they can, as they see fit, regardless of job role.
Relative importance is relative. To some folks, a set of TTF files is more valuable than all of LibreOffice. I’d rather not try to track that, either.
I’ll leave RPM Fusion up to someone else, for now. 🙂
Kevin Fenzi just confirmed this for me on IRC the other day – the ‘cla_redhat’ group was never universally used, and it’s now deprecated. Everyone is required to agree to the new FPCA, regardless of employment contract. It’s actually been a few years since ‘cla_redhat’ was used, so it’s probably very wildly out of date.
no matter how you try to peel back more info about paid versus volunteer time, the important thing is to always try to draw conclusions statements from the raw data which are either strong upper or lower bounds. The current inability to get a good handle on the split of volunteer hours versus work hours will impact how you represent the boundary condition for conclusions.
You can then cycle back with something like a tack on survey to redhat employed contributors and try to develop aggregate information about how often people are doing off-the-clock versus on-the-clock work. Obviously it will differ from person to person, but you might be able to stand up the average to apply in aggregate to the subgroup as layered approach to pull your strict upper/lower boundary statement and make it a statistical statment with quantified average and stddev.
-jef