No doubt. But can you please not post unsubstantive comments to Hacker News? Especially not ones that violate this site guideline: "Please don't post shallow dismissals, especially of other people's work."
For those who didn't realise above, I was being flippant.
On a slightly serious note though, I wonder how much productivity is lost in the scientific community due to poorly written and documented code?
I've heard stories of 40 year old Fortran code written by long deceased professors that was written to crunch physics numbers or whatever, and when it's come time to modify or add to it, nobody can make head nor tail of it and they have to write it from scratch.
There's a reason why in the non-academic world we have coding standards and code review. Code isn't written in a bubble, other people will look at it and work on it.
That's not to belittle or criticise the work done in the slightest. Cleanliness of code is orthogonal to functionality. You can have beautifully written, clean and documented code that doesn't do what it's meant to, and likewise you can have a complete mess of code that performs some genius function perfectly.
It's a toss-up. On the one hand, there's a loss due to dirty code, but a gain by a smaller group of people being able to do multidisciplinary work. In my own case, I'm a physicist outside academia, and in addition to code, I also do electronics and a variety of other things.
When you're doing exploratory R&D, as I am, there are downsides to getting things done by domain specialists. First, you have to find people with quantitative skills, and they tend to be in the greatest demand due to scarcity. Second, you have to manage the politics of getting them assigned and engaged. Third, you have to manage the interface between specialties. It becomes a project management exercise. And then, the way that code and project files are structured, it may be possible to read isolated sections of code, but very hard for a non-expert to find their way around the myriad of files that tend to form a modern code base.
In my own case, I do what I can to write good code. I try to keep up to date on good practices, and so forth. Could we do better? Sure. The quest to improve my coding is how I accidentally bumped into HN in the first place.
Don't worry about these comments. The worst thing in science is usually that the code is not published (and these comments on code quality don't help).
As long as it's published, if somebody wants to reuse it, reimplementing from the paper is the hardest part.
I agree with you that the scientific community is way behind industry standards, but the reason for that is much less of their code is actually designed for reuse. The overwhelming majority of their work is just "let me try writing this code and see what results I get."
Industry professionals are forced to take the approach of "I need to write this code to be as maintainable and flexible as possible" because they have no idea what the business is going to want next and generally have no set timeframe for how long they may have to maintain any particular project.
A lot of industry code is also glue logic which doesn't express any original idea which makes it inherently easier to document. Code expressing a novel algorithm is never going to be as easy to document and maintain as code plugging standard libraries together. Notably, code in "industry" which does express novel algorithms is often also not so easy to read, there just isn't that much of it on most projects.
There are efforts to integrate more industry standard software engineering practices into research (RSE or "research software engineering" as a phrase is growing in popularity):
A lot of academic code is also written by students. For instance, I'm working on a project that ends up with code written by 6 masters' students. I'm trying desperately to get them to use Git or some other kind of version control rather than emailing me files, but it's only been partially successful. My last CS class per se was 18 years ago. They don't know (&(^ about programming -- at least I've been paid to do it in a production context in a company that has to make money to justify its existence -- and since they learned C++ first but we're programming in R or Python there are some ridiculous and unnecessary maneuvers and lots of for loops. I try to work through the code with them but I also don't have time for all of it, since I'm also teaching several classes etc. Sometimes it's easier to go with the crap I've got (that I've tested for correctness) than rewrite things.
If people have good resources I could pass to students about standards for Python code, for instance, let me know.
This is an issue in my field (Engineering) as well.
Most people in my field (materials engineering) are not programmers either they are lucky if they've done one intro course 10 years ago (which was probably done in a language like Java or Visual Basic).
Even then what gets taught in an intro course at university is not the type of code that is written "on the job". I did two semesters of programming courses when I was at uni (as electives) my courses were taught in Java and focused on stuff like object oriented programming and memorizing stuff about "the waterfall model"
There is a pretty big gap between this and my first experience which was being sat down in front of some 30 year old Fortran code which had no objects, no classes etc.
The goto at least in my org when people are trying to understand scientific code - write their own algorithms etc is the 30 year old "Numerical Recipes" (https://en.wikipedia.org/wiki/Numerical_Recipes) textbook. The explanations in this textbook are best and simplest I have come across by far.
I know I personally referenced this book heavily when I was writing code in C to do Spline interpolation/smoothing. I am unaware of any other reference for a lot of algorithms/techniques than this book.
Only other thing I am aware of is the GNU GSL library which in my experience is harder to understand for beginners - even it's example code is "for loop based"
If I had to convert this code to R (which I do know) or python (which I've never written) I'd probably write it this loop based style as well it's what I know and what makes sense to do me and the people in my org I'd expect to be interacting with my code. (the "Engineers can write Fortran in any language meme" is a real issue).
Maybe someone should write a new textbook on "modern" way to solve these sorts of problems if such a thing exists I am unaware of it but would certainly be welcome.
This makes me wonder if universities could employ a bootcamp-like curriculum, with lots of feedback, collaboration and unit tests, and make it available for students in these disciplines. Like how many schools have everyone take writing classes.
I think this would be very useful. So useful. I personally haven't been able to get anything code-related through the curriculum committee though (I'm not in CS).
There's a lot of Fortran code in underlying math libraries that are highly highly optimized, including the Fortran compilers themselves (mainly due to age and demand to eek performance out).
I worked with an old Fortran codebase at one point and there were comments in the documentation (a scan of a typewritten via typewriter document) throughout about switching "cards" and "decks"... took me a moment to realize it was refering to punch cards (and I thought I was old) which also led to the program structure fragmented in several individual smaller sub programs (so card reader could handle it) that now is a trivial matter to handle. Maybe they were just ready for the SOA and microservices trend.
In academia, pressure is often on publishing and pulling funding in through grants and contracts. I've done a lot of rapid prototyping in academic research environments and while writing clean software is always on my mind, often, sitting down and refactoring to be more cleverly efficient or taking time to focus on structure, long term maintainability, etc. isn't a priority and refocuses needed cognitive load from the high level research goal the software needed to achieve to instead focusing on production quality software.
I'm not concerned if it takes O(2n) vs O(n) or O(n log n) vs. O(n) time if I know the target scale is small. I'm not concerned that I can cleverly avoid using an extra data structure (and reduce space complexity) if I can do this operation in place on an existing data structure using some reasonably complex algorithm. Chances are I might remove this functionality entirely tomorrow or some student may have to figure it out later on, and I don't want to implement or explain to the student the Boyer-Moore majority algorithm when a brute force O(n^2) time is just fine here and a lot easier to adjust/maintain for a passer by scientist/student.
I'm aware there's a lot of problems and maybe my abstraction hierarchies aren't the best, I could probably make something better with more time.
You have some high-level complex process you're trying to represent and translate in to a program (maybe a simulation, maybe a complex model or set models, etc.). You're not always concerned about if there's a better way to write it or make extensive use of all the features of whatever language you needed to work in (which you may or may not have experience with since you needed to work from existing codebases to start with since time is tight), you simply want to use whatever requires the least time and cognitive load to think about and produce results so you can keep your eyes on the target of what you're developing.
Later on, when prototypes work (or if you hit performance bottlenecks stopping progress), then and only then do you start refactoring and looking at performance optimization--targeting the biggest bottlenecks first.
If everything works, then you can focus on overall refactoring and optimization and turning your Frankenstein into a supermodel (if you have resources/money to do that with--good luck), but you typically need a functional proof of concept to even have a chance of securing funding for that step.
If there's no money in that effort moving forward and you decide "well, maybe someone can use this" so let's release it, that typically has to get approval through a technology transfer office who are always in arms about protecting potential IP so it ends up on some disks rotting away never to be seen or used again.
If you're permitted to release the IP, you begin wondering how the development quality will reflect on you and your group, especially for those who see it and have no context of the constraints you worked with to produce that miracle functional Frankenstein. It's ugly as sin, but it fulfilled the goal to deliver the core research results and did so as quickly as possible and cheaply as possible.