23

I just saw an infographic comparing the size of various programs in lines of code, the largest one of those being the healthcare.gov site at 500 million lines of code. The source they used seems to be this NY Times article that states

According to one specialist, the Web site contains about 500 million lines of software code. By comparison, a large bank’s computer system is typically about one-fifth that size.

While I have no problems believing that the whole thing is rather large, this number seems wildly implausible to me. Is there any hard information (i.e. not from unnamed sources) on the size of the codebase?

Mad Scientist
  • 43,643
  • 20
  • 173
  • 192
  • [Check for yourself.](https://www.healthcare.gov/developers/) (I know the official github repo was pulled because of a flood of political false 'bug reports', but I'm pretty certain a mirror was put up. I'm unable to locate if ATM from my phone though.) – LessPop_MoreFizz Nov 11 '13 at 21:14
  • 2
    https://github.com/Conservatory/healthcare.gov-2013-10-01 – ChrisW Nov 11 '13 at 21:15
  • Can you say whether your asking about "the web site", or might you also be asking about its "back end"? Given that it ties together data from multiple insurers, might you be asking about the aggregate size of the software of those several insurers? – ChrisW Nov 11 '13 at 21:18
  • 1
    @ChrisW My own view of the best metric would be the size of all the code that was written specifically for the Affordable Care Act, including the back end, but not e.g. the systems itself that it integrates into. – Mad Scientist Nov 11 '13 at 21:23
  • @ChrisW that does not look like a full set of sources, nor it is official :-) – Sklivvz Nov 11 '13 at 22:04
  • @Sklivvz I took it that that is (or might be) the [web site front end](http://en.wikipedia.org/wiki/Healthcare.gov#Development_and_history): "The front-end of the website was developed by the startup Development Seed, and the website source code had previously been officially posted on GitHub, although this was later taken down.[3] The back-end work was contracted out to CGI Federal Inc., a subsidiary of the Canadian IT multinational CGI Group, which subcontracted the work to other companies as is common on large government contracts.[6]" – ChrisW Nov 11 '13 at 22:12
  • 1
    Anecdote: I worked on a project which simulated some other software. Each release of the target software was imported into version control, patched in specific (and routine) ways, and re-built, retaining all previous versions too. The developers were very cynical about Lines of Code being used as a productivity metric, and would happily demonstrate that one junior developer had added a million lines of code to the source base that month, while the tech lead had (quite productively) reduced the code base size by couple of thousand. My point? LoC is gameable to be whatever you want it to be. – Oddthinking Nov 13 '13 at 01:54
  • @Oddthinking As a productivity metric they are lousy, but at the large scale they do serve some fairly useful purposes such as being able to quantify the size of a large project and at the large scale, LoCs can be used to make reasonably accurate estimates about how long a project might take. – rjzii Nov 13 '13 at 12:28
  • @Oddthinking - you're just saying that to get onto Joel's good side :))) – user5341 Nov 15 '13 at 02:16

1 Answers1

18

("Yeah, but to be fair, 400M lines is just the JSON file with all our names and addresses...").

The 500MLoC number almost certainly originated in the NY Times on 10/21:

According to one specialist, the Web site contains about 500 million lines of software code. By comparison, a large bank’s computer system is typically about one-fifth that size.

I say "almost certainly" because prior to that date I was researching the software development process for my column and that number never came up and is not in any of the major primary sources (the most important of which is this GAO report from June ). I can't prove that it was never used prior to that NYT quote, but the NYT article was where it gained traction.

CGI has not released the source code for their work -- the github repository that was briefly visible and of which there are a few forks, was extremely minimal and only contained, at most, front-end aspects.

The OP is right to be extremely skeptical about the 500MLoC number, which would represent:

  • Complexity vastly greater than reasonable (as the comparison in the question shows),
  • Productivity vastly greater than reasonable (the contracts only having been awarded over the past few years) (Source lines of code / month is a terrible measure of software productivity, but since this site puts a premium on references, this is one of many references that speaks to a rate in the low hundreds per staff month for larger projects.)
  • Coordination vastly greater than reasonable (per-developer software productivity goes down with complexity and team size: yet somehow this extremely troubled development process led to 500MLoC of production?)

What may be more reasonable is that there might be 500MLoC that was generated, i.e., data initialization. Or it may be that this was simply a quote that was more about adding color than being the basis for analysis:

  • In the NYT article, you'll see that it's the last paragraph and is not followed up.
  • The statement, "the Web site contains about 500 million lines of software code" is, at best, sloppy (the site is more in the few-tens-of-thousands of lines size but clearly the back-end was where the vast majority of effort / costs were expended).
  • "By comparison, a large bank's computer system is typically about one-fifth that size" is, although not outrageous, also a fairly dubious quote. Large banks don't have a single "computer system," accumulate their codebases over decades, and don't open their codebases for analysis of what is and is not "typical."

UPDATE 2014-05-22: Reddit user agenaille claims to have worked on healthcare.gov and says that an automated tool produced the following counts:

Language     files    blank    comment       code
Java        13,481  419,643    847,982  2,399,683
HTML         1,635   50,124     16,845    515,494
Javascript   1,631   56,298    102,140    322,192
XSD          5,227    1,238     20,945    156,696
XML            659    6,436     13,073    136,827
CSS            205   14,000      9,420    109,815
Maven          275      737      1,421     47,449
XSLT           383    2,357      1,476     21,624
Bourne Shell   248    2,305      1,446      8,830
SQL             28      860        139      8,487
JavaServer Face 35      766          0      3,770
DOS Batch       48      235        118        849
Ant              8       77         45        810
Perl            18      161         45        646
Visualforce Com 39        0          0        626
Groovy           4       68         15        361
Python           5       55         90        263
Visual Basic     1        3          0         25
DTD              1        8          0         17
JSP              3        0          0         13
ASP.Net          1        0          0         11
SUM         23,935  555,371  1,015,200  3,734,488

These numbers are still quite high (as @ChrisW mentioned in comments, compaid.com/caiinternet/ezine/reifer-benchmarks.pdf suggests software development costs at least $20/LOC) and the variety of languages suggest that this codebase might be a catch-all rather than a snapshot of what is in production (48 DOS batch files? Fascinating!). Nonetheless, at more than two orders of magnitude less than the 500MLoC number, it's a far more believable volume.

Larry OBrien
  • 15,105
  • 2
  • 70
  • 97
  • Just speculating but that 500MLoC could also include all of the additional library code and what not that is required to run the site, but even then that still seems high. Also, automatically generated code (i.e. JSON objects) might be factored into that number and those can easily get to be quite large even if the "hand-written" code is more manageable. – rjzii Nov 13 '13 at 01:02
  • 1
    Yes, but typically when assessing the size of systems, it would be considered misleading, or even deceptive, to include libraries and automatically-generated code. Ultimately, the number is wildly unreasonable, and the context in which it appears makes me think the quote was added late in the editing and was not vetted for reasonability much less fact-checked. – Larry OBrien Nov 13 '13 at 01:12
  • 4
    True, but if we are talking about politicians then it is entirely possible that they might have creatively inflated the number. – rjzii Nov 13 '13 at 02:06
  • Can you justify your "productivity and coordination" assertion, by quantifying in any way how much effort was put into the development? I'm unsatisfied with this answer because says no more than we (software developers) had already guessed: it doesn't add any facts/data for consideration. – ChrisW Nov 13 '13 at 14:14
  • I agree that s/w devs will intuitively know the number to be highly suspect, but perhaps other readers won't. The one real thing that I think is important is my assertion that the 500MLoC number started with that quote and there hasn't been any validation of it, which I feel quite sure about, but which I can't prove. As far as actual effort levels, that's extremely opaque, which is hardly surprising, given Fed procurement processes. – Larry OBrien Nov 13 '13 at 18:25
  • ... We don't even know back-end tech stacks, dev processes, etc.. Ultimately, I'd argue that chasing down a precise LoC number is a red herring: again, s/w devs know LoC tells little. The 500MLoC number supports stereotype of inept, bloated bureaucracies, as does the "Periodic Table of Build Streams" (http://cdn.washingtonexaminer.biz/cache/r620-648499f36a67595d26f67cf181fe1345.jpg). Worthwhile reaction would be push towards procurement reform, process transparency, Open Source, etc., – Larry OBrien Nov 13 '13 at 18:39
  • Could you quote http://www.compaid.com/caiinternet/ezine/reifer-benchmarks.pdf saying that software development costs at least $20/LOC, that 500MLOC would therefore cost at least $10 billion, however they've only spent $300 million? – ChrisW Nov 14 '13 at 00:22
  • Maybe some of the LoC were comments and nonbillable? :) – Some Freemason Nov 14 '13 at 04:16
  • 'don't open their codebases for analysis of what is and not "typical"' - and yet, that's easy enough to check for a newspaper. Track down someone who worked in a senior IT role (preferably software engineering/architecture team) in several banks over their career, based on resumes/LinkedIn, and ask them for estimated LOCs at an average bank they worked at as a consulting gig. – user5341 Nov 15 '13 at 02:19
  • Database like a JSON files should not be counted in "Programming codes"! – Deepak Kamat May 22 '14 at 16:27
  • Hey. I'm not really familiar with this site so I wouldn't know if this should be an answer by itself, but it looks like someone on the team has done an actual count and it's more like 3MLOC: http://www.reddit.com/r/dataisbeautiful/comments/265yns/million_lines_of_code/cho3xhz – Jeremy May 22 '14 at 21:41
  • @JeremyBanks Thanks for that! Incorporated that data into an update (it's still just "some guy on the Internet" making the claim but the quirkiness of the data "smells" believable to me. 1 VB and ASP.NET page, a couple dozen .BAT files...) – Larry OBrien May 22 '14 at 23:48
  • I'd believe 500M lines of assembly generated by the compiler – slebetman May 23 '14 at 04:21
  • On the other hand, most compilers I use compile down a 1M line C code to 800k executable which equates roughly to 200k lines of assembly (assuming 4 byte average instruction length) – slebetman May 23 '14 at 04:22
  • "the variety of languages suggest that this codebase might be a catch-all rather than a snapshot of what is in production" There's no "might" about it; Apache Maven and Apache Ant, for example, are project management/build tools (and they use XML, so I'm not really sure why they're listed separately...) – Brian S May 23 '14 at 06:26
  • @BrianS As it was posted "the results of a little tool" which most likely just counted files by file extension. So *.xml -> XML File, *.pom -> Maven I don't think anyone did advanced categorization, they just translated file-endings to Names – Falco Feb 12 '15 at 16:33