Whenever I can I strongly recommend not using Jupyter for anything more than the most transient tasks.
I don’t know whether it’s the Data Science culture or Jupyter but there is a big lack of discipline in writing maintainable code in DS and non-existent git support is part of that.
I always strongly discouraged developing models using notebooks, instead advocating for using .py files and then using notebooks for sanity checking data.
I don’t have any clever ideas for how we can move past Jupyter but the sooner we do the better.
Jupyter is not great for collaboration with multiple people editing, but with a little bit of order it's perfect for in-person working and presenting that work.
Notebooks can be clean if you follow some rules:
1. Code flow always goes down: holding Option+Enter should execute all fields without any errors. Don't do `x += 1` if `x` is defined underneath.
2. All blocks are idempotent: running any block 5 times should produce the same result as running it 1 time. Don't do `x += 1` unless `x` is defined in that block.
3. Keep block-local variables short and block-global variables long. Don't do `x += 1` unless you are not using `x` anywhere else.
Also, the Table of Contents extension [1] is a life-saver for making long analyses workable.
I have a rule that helps with hidden state and out-of-order execution. Once in a while I do a "restart kernel and run all cells." If doing that breaks anything, then I have to fix it. But it also ensures that a notebook is reproducible later on. Of course I don't have things that take hours to run.
It would be nice if there were something that would make out-of-order problems light up, the way that code editors can highlight errors while you're editing. A limitation of "browser as editor" is that it misses out on some of the powerful things that code editors do today.
Another thing is to put things in functions, so temporary variables are disposed of. That's a halfway step to putting things in .py files. A benefit if .py files is not always that jupyter is bad, but that variable scoping is good hygiene.
> It would be nice if there were something that would make out-of-order problems light up, the way that code editors can highlight errors while you're editing.
Agreed on all points. Notebooks really aren't that hard to maintain. They just require some slightly different rules from standard scripts.
Personally, I like to label block-global variables in capital case (like PEP8 constants), so as to make them easy to spot. Being formatted like constants also causes me to think twice about altering it after instantiation.
I try to follow these ideas, and in many of my notebooks I frequently test them by occasionally running "restart kernel and run all cells" from the menu, which tends to point out anything I've accidentally moved or run out of order.
As a table of contents, I usually write some markdown up top with links to markdown HTML anchors elsewhere in the page, which themselves also have a link back to the TOC. Works pretty well. Will have to check out that extension.
4. Use a function in each block, that is defined and the called with appropriate arguments, and return the values of the block. This prevents proliferation of global state - and the functions are really easy to move out to .py modules when things have solidified a bit.
> with a little bit of order it's perfect for in-person working
It's not perfect for in-person working because a single person should always keep their work under version control, and they should be able to view meaningful diffs to understand the history.
One of my most upvoted comments says something to the effect of "notebooks bad", so you're preaching to the choir here -- however:
- I work with several people who are purely data scientists, and I lean on "culture" rather than "Jupyter". In some circles, probably influenced by academia, programming is considered to be low status work. You are not going to solve the problem by switching to .py files, even though for most tasks literally anything is better than Jupyter.
That they don't use git, or that git wasn't originally even a concern, is a consequence of that low-status perception. If you pitched something to the developer community and told them "oh, by the way, you can't use git", they'd synthesize tomatoes out of thin air to throw at you.
- I'd carve an exception for stuff that satisfies ALL of the following: (a) is self-contained in a single notebook, (b) has no dependencies on anything non-standard, (c) is demonstrative in nature, or a personal exercise rather than production software. For example, I wrote my solutions to the Advent of Code problems in a notebook and I liked the experience, especially how you could mix math and code.
I would also lean more towards culture and the environment that Jupyter provides only perpetuates it. I think what made me leave Data Science in the end was that I wasn’t driven by work outside of the notebooks (i.e. coming up with a mathematically superior solution) but driven by building ML driven systems as a whole.
I think notebooks are a great way of presenting findings and showing your workings at the same time. If that was their main use then I wouldn’t have any issues
The incentives in academia are different. The objective is to publish, code is important only insofar as it allows you to achieve that goal. This is not to say that academics can't code, but even if you are a professor who cares passionately about making high-quality software, you're fighting uphill, because that's not what you're being evaluated on.
If you want to make the argument about "two kinds of people", I think it's more about A-type/B-type data scientists in the industry. I'm really mostly a developer and not a data scientist, but when I assist in DS tasks I wear a distinct B-type hat, and that informs my perspective. A-type people have different priorities and that's fine; my gripe is when you try to import A-type practices in a B-type scenario.
I think of Jupyter Notebooks as scratch paper on my desk. It's not to archive things, it's for developing ideas. Once ideas are developed, I transfer them to a long-term medium (LaTeX or Markdown document, Python source file, etc).
Yep. I worked in scientific applications, and when developing some new data cleaning and processing pipelines for our hydrology data, Jupyter was phenomenal.
It was easy to use as a presentation, with figures and plots embedded. With controls enabled, you could demonstrate what varying certain parameters would do and pitch proposed cleaning profiles.
I was rather easily able to send a directory and it's notebooks/data sources to colleagues in the water sciences team so they could validate my results on their own (they were luckily also familiar with Python and Jupyter), and caught some minor bugs in the pipeline.
This was all much more collaborative and concise, and I feel Jupyter played a huge part in it.
Once it was done, it and a "final draft" pdf were added to the Docs in the repo and the pipeline was written out into a full application of it's own.
Add in # %% and it makes the code executable in isolation. Like a cell in jupyter. This means i can write a py file. Execute bits indepentantly and when im ready to check it in just remove the block comments
It depends how you use it. When you're still new to the data, using a freeflow text-and-codeblock workflow like jupyter or org-mode really speeds up the exploration phase.
Once you have a consistent set of questions, and methods to answer them, then yes, copy off the relevant chunks into their own scripts and source these when using similar data to bring you up to speed, and modify them to your tastes.
The issue with starting off with an external script initially is the distracting temptation to refine your code so it can be better used with future data, despite not yet having seen or not knowing what that future data is like. The initial "play and explore" phase of an analysis is very important imo, and notebooks really facilitate that.
I agree, Jupyter has its place in helping so exploration and learning. A problem that Data Science faces is that the majority of courses don’t show Data Scientists how to progress on from notebooks to write robust training pipelines that are reproducible and safe.
i strongly agree with what you are saying about Jupyter, however i strongly disagree about using netobooks in general (literal programming)
one of the key things that a good notebook system must allow you to do is to mix something like markup format + LaTeX + source code. writing math-heavy documentation and explanations is simply impractical and limited (readability suffers) if done in comments. jupyter however is severely limited as it is unreadable in its raw format and therefore does not play well with a version control system such as git
instead there is a solution that allows one to do everything jupyter does good with the additional benefit that it plays with version control really well - ie org-mode [1]. the only difference is that instead of using a browser to interact with it, you use emacs. the added benefit to this is that you can also use full-featured key bindings (emacs / vim) and even integrate a language server for auto-completion [2]
EDIT: moreover the list of supported languages in orgmode far exceeds that of jupyter [3] (or did the last time i made this comparison)
There is always a lot of org-mode promotion on here when the topic is interactive notebooks. And I get it, people love it and it solved many of the problems other systems have. But org-mode users need to understand that the one thing holding org-mode back is simply emacs. I know you probably all love it, but everyone else is not interested in breaking of their fingers by learning obscure key command chains just to use org-mode. Sorry, but that is just the reality.
If someone can implement the majority of org-mode in a better editor, there might be more users interested. But as it stands, it's just to much of a hassle.
I expect one of the main reasons someone could evangelise Emacs for is the fact that defaults don't mean much when it's all configurable. So if you don't like the keys, just bind them to whatever you like. That's like the fundamental ethos of everything in Emacs. Also, CUA-mode exists.
If org-mode wasn't backed by Emacs, it would merely be a markdown substitute hence much less useful. There are many org-mode clones for modern editors like neovim or VSCode, except all they offer is front-end features (highlighting, folding, node manipulation etc). There is simply no reason to use those over a decent markdown editor. So I think you have this backwards; Emacs isn't holding back org-mode, rather much of advanced org-mode features are made possible and distinguished by the fact that it builds on Emacs.
> I expect one of the main reasons someone could evangelise Emacs for is the fact that defaults don't mean much when it's all configurable. So if you don't like the keys, just bind them to whatever you like.
Configurability is a strength of a system, but it is no an answer to a difficult learning curve. A user must first understand the system in order to configure it appropriately.
Even at the level of key bindings, the user needs to understand the relative frequency and importance of an operation to choose an appropriate key combination. Universal reconfiguration may even make the system less learnable, if documentation and tutorials can't assume a reasonable default configuration.
In my opinion, configuration is great as one of the final steps of a user's journey, taking the system from something that works to something that sings. It's just the wrong level to sell benefits to beginners.
> Even at the level of key bindings, the user needs to understand the relative frequency and importance of an operation to choose an appropriate key combination. Universal reconfiguration may even make the system less learnable, if documentation and tutorials can't assume a reasonable default configuration.
i have a feeling that people who write these things have never really tried emacs beyond opening it and getting annoyed that ctrl-c/v/x don't work (at first) the way they are used to
emacs is not key-binding-based, it is command based. if you change a key binding its not like you can wreck anything as you can always call the command prompt by M-x and search for the command that you wanted some key binding to perform. key-bindings are just shortcuts to commands so i think its best to listen to your fingers and form muscle memory and then assign them
what are your most basic commands? copy, paste, select, start/end of line/function/class/paragraph/etc, move by word/sentence/etc, save, exit? these are not that many to set to whatever key combinations you want. i wish my browser had at least this level of extensibility
> I know you probably all love it, but everyone else is not interested in breaking of their fingers by learning obscure key command chains just to use org-mode. Sorry, but that is just the reality
i'm sorry to burst your strong held convictions but you can choose any of the following
a) use any key-bindings you like including emacs, vim, cua, or combination of
b) use org-mode without any knowledge of more advanced emacs commands (except basic knowledge of using an editor)
c) drink some milk (gotta have strong bones) and learn how to use the emacs system including emacs lisp and have one of the most advanced computing environments in existence at your service
About 1000 times per day, someone says emacs is too big an ask for org-mode, and someone replies, it's configurable to feel like whatever you're used to.
The latter needs to accept that most users, particularly scientists, reject out-of-hand anything requiring configuration or compilation no matter how trivial.
But it's all moot since org-mode is largely promoted by non-scientists (computer science is not a science), and should a wysiwig-inclined scientist ever get past the emacs obstacle, he'll balk at the awkward BEGIN_SRC incantations.
Congratulations for being special, 99% of academia uses Matlab, a simple GUI text-editor or something like Anaconda.
Emacs is such a fundamentally different paradigm from all other IT tools/editors that it just doesn't make sense to recommend a specialized tool with a steep learning curve and non-transferable skills when it's not ubiquitous and more standard alternatives exist without it which do OK. It doesn't matter that emacs was historically the first and everyone else decided to go in different directions, that's just the reality of today.
much like saying that linux has a steep learning curve
Most people do say that. What programmers cannot grasp is progress is about giving people what they want, not what is rationally best. Most scientists want to knock out a paper or presentation, and make it home on time for dinner.
> I know you probably all love it, but everyone else is not interested in breaking of their fingers by learning obscure key command chains just to use org-mode. Sorry, but that is just the reality. If someone can implement the majority of org-mode in a better editor, there might be more users interested.
> If someone can implement the majority of org-mode in a better editor,
If that were easy, then that other editor would be an emacs. There's a reason emacs gets all these modes, and its because writing extensions is easy. If I want to extend something like VScode or Pycharm, or whatever, it is a massive undertaking.
> But org-mode users need to understand that the one thing holding org-mode back is simply emacs. I know you probably all love it, but everyone else is not interested in breaking of their fingers by learning obscure key command chains just to use org-mode.
We get it - what makes you think we don't. We are merely pointing out a superior solution.
Like back in 2004 I would tell people how many of their problems would be resolved if they switched to Linux. Fast forward two decades later, the statement is still true, and most people still don't use Linux. But it wasn't a problematic thing to point it out to them - be it in 2004 or now.
(It's a lot easier to use Emacs than switch to Linux.)
? i never said that its a one-size-fit-all solution. certainly you would not write a software package in a notebook. but you might write a tutorial, textbook, academic paper, homework, personal notes, etc.
My point was that a comment against notebooks being overused - where a different structure would make more sense - is not a necessarily a comment against literate programming.
The issues with notebooks - in general - are unrelated to literate programming. The notebook format is convenient to have some kind of “interactive” programming though, rather than “literate”.
Have you considered that the notebook is an evolution of a repl, with improved visualization and feedback, for for analysis-heavy work? The problem starts when notebooks are used for development and production.
Did you mind checking the availability of people with lisp skills vs most other languages used in research and DS? Or the availability of libraries and pace of evolution? If your solution is to teach lisp massively and have everyone just build their tooling, can you explain why it hasn't happened yet, despite lisp being around much longer than most languages in use in this field - and many attempts at what you're suggesting?
I have a sense that there are a quite lot more people with Lisp skills than there are jobs. It should be easy to find Lisp people. And I mean reasonably young people; not like pulling Cobol people out of retirement. Lisps continue to attract new people.
i only said lisp repl is superior. far more superior actually. whether its popular is a different matter. there are people perfectly content with subpar technology and thats ok. however it doesnt hurt to know that there are alternatives
Of course you don’t! The notebooks are glorified repls and you can also have literate programming without interactive notebooks. What notebooks get you compared to alternatives is both things at the same time.
my point is similar but restricted to jupyter. i think that that org-mode can offer a much more advanced and complete literate programing environment than jupyter that's far beyond just markdown + repl
Note how babel is presented, by the way (last point in particular):
Babel augments Org code blocks by providing:
interactive and programmatic execution of code blocks;
code blocks as functions that accept parameters, refer to other code blocks, and can be called remotely; and
export to files for literate programming.
Low quality Data Science code is not a fault of Jupyter.
The Jupyter allow you to load big chunk of data or some large model only once, and then use it for experiments in other cells. It is hard to replace this feature with plain `*.py` file. For me, this is the killer feature.
The big benefit of Jupyter in the context of machine learning, is that you are often dealing with models that take quite a few seconds to load. You can put big, slow loading, things into memory in the top cells, then try a bunch of logic with them below. Whereas when working with just '.py' scripts, you'd have to reload the model every time, which can make for slow and uncomfortable iteration.
The way I get around this is to start an IPython interpreter, and run .py files with `run -i file1.py`. This loads things into memory in the interpreter, and then I can run file2.py with the actual analysis, and iterate with file2.py until I'm happy. In the end, you can keep the files separate, or combine them into 1 file that will run top to bottom your whole analysis. As long as you keep the IPython session open everything remains in memory, just like in a notebook. The autoreload magic also works if you set it to the correct option, so if you are working on a library/package it will automatically reload them if necessary.
Whoops sorry for the late response - somewhat yes. You can configure a Jupyter workflow to work like this, but I don't find that I, or many other people do, as it takes more discipline to not hop around cells.
One of the main differences for me is that the .py file is run in its entirety (outside of if/else blocks for loading data). That usually corresponds to multiple cells of a Jupyter notebook that one would need to Ctrl+Enter through, where missing one would cause a problem.
The second is just how you can decouple the code and the terminal - it's a personal pet peeve that Jupyter notebooks jump around when running through cells - I don't want to be scrolling all around just to reset some variables to their original values, and it's really nice to run a whole .py script and see an output side by side, where the script is much longer than my screen. I can keep it open at the important part in VSCode, and change some intermediate process, and let all of the ad hoc plotting code remain at the bottom.
Finally, the biggest difference for me is how figures behave - the way I have it set up is that they open in their own window and remain interactive (can zoom/pan). I know you can do it in Jupyter as well, but the workflow really emphasizes inline plotting with non-interactive plots, especially when it comes to sharing them. But with the .py script and IPython command line, I can open up 5 figures, tile them however I'd like, and then refer to them by name/number in my script, so I can clear and overwrite them however I'd like, and they don't close or move around. This makes comparing things very easy, like how changing a parameter changes the rest of my analysis.
Lastly - the way it is set up is more like Matlab... whoops, but I think their workflow is much more ergonomic than a notebook. However, for sharing with other people, I usually just copy and paste the various parts of my scripts into a notebook, as that is the de facto standard.
not a complete solution, but PyCharm and VSCode both support using `# %%` to split a python script into 'cells' (stolen from matlab?), which then be executed individually/repeatedly
Yes that’s a big plus of notebooks. Hopefully a solution can be found for .py files in future where you can earmark the top part of the script to be cached so the interpreter skips over it
Yep, I don't know enough about the interpreter under the hood but an interactive mode like a debugger where you can go back to a previous line, etc might be the solution. I doubt that's high on the priorities of the Python team though.
1. Experiments in notebooks. Notebooks are saved under git but mostly as a backup, I don't care how nicely they play together. I don't get why you would discourage notebooks for running experiments, doing it with .py files sounds kind of miserable.
2. Services and library code in .py files, under version control, just like any other software we write.
Experiments using notebooks are fine as long as they are well documented.
Having your services and library code as .py files you can import in is great.
The issue comes with how to move from experimentation to deployment. If you already have services/library code as .py files you make your life a lot easier. The issue comes when everything is spread across multiple, poorly documented notebooks. If you're working with an MLOps team it makes their life a nightmare to take those notebooks and conform them into something usable.
Jupyter is great when it is used in the right way.
People use a great tool in a poor way and then broadly condemn the tool.
And any tool that is sufficiently flexible to be broadly useful can be used in very poor ways.
Jupyter is great, it gets me over the barrier potential for starting a task every time. I build and prove out an algorithm/task piece by piece. Once I'm happy, I move the meat of it to a function in a .py file, and move the code I used to test the algorithm to a unit test function. Delete the duplicated bits and replace with imports, and then what remains is a tutorial/demonstrator notebook using the function I wrote and maybe some nice plots to go along with that, that I wouldn't put in a unit test (nor that show up in docstrings). This can be converted to sphinx docs if the code gets big enough.
What a great tool for incrementally building software! In my world, I build brick by brick, not all at once. Jupyter is a key to that process.
not using Jupyter for anything more than the most transient tasks
While most programmers have reached this conclusion, they're generally not day-in day-out jupyter users. They need to understand *everything* is transient for scientists who optimize for proof-of-concept and publish-and-forget-it paper writing.
> *everything* is transient for scientists who optimize for proof-of-concept and publish-and-forget-it paper writing.*
Which itself is a huge problem.
Happily this mindset is changing, at least in some scientific all fields. For example, in particle physics proposals a document ("data management plan") much be written describing how that unconscionable attitude will not be taken with the experiment's data and software. That said, this transient mindset and derision of real software skills is still fairly prevalent in this field.
More "nature of the beast" in my opinion. Science measures itself by how many alluring women it can date; engineering, by how long it can keep the wife happy.
Except for the fact that some experiments are taking decades themselves or are one part of a long progression of related experiments. So continuity of software and data through generations of students, postdocs and even professor types is needed.
Even for short-lived experiments reproducibility is important. So much of today's experiments ultimately rely on complex stacks of software to get their results and on data which humans can afford to acquire only once. Preserving both is necessary for future re-validation or reuse.
Others have mentioned the usefulness of literate programming so I won't reiterate that.
Partially the lack of discipline comes from the implicit data dependancies between cells. Variables are all globally scoped and unless you ensure the notebook can be ran top to bottom its easy to introduce subtle bugs. I believe Julia's https://github.com/fonsp/Pluto.jl solves this issue quite well.
Another part comes from cells that should really be functions. In my opinion this is because functions are 2nd class citizens compared to cells, and could be improved with UI (function cells? node based programming?).
Programming is more than just manipulating text, so why shouldn't tools move in a direction of just being fancy text editors?
It all depends on the context. In academia it is a great tool. I can set up a couple of notebooks on our GPU server and give many students access to powerful GPUs without having to worry abbout shell access etc.
Aditionally they are ready to go and do interesting things immediately and don't have to install the environment on their laptops (which might be win/linux/mac but at least these days that's easier but still extra work for them).
I also use it a lot for experimenting, parameter tuning etc. It's not too bad to have it explicitly distinct from production level code. Run/tune/experiment in notebook, once you're happy with the model -> code it up in .py file(s).
Also great for quick presentations :)
However, the fast.ai team is actually doing a pretty solid job running everything off notebooks. So if I wanted to go that direction (and skip the .py files) it's that project I'd look at for how to do it.
Writing maintainable code is basically what defines the role of software developer. Pretty a lot of technical roles --engineers, scientists, hell, even economists-- pick up enough programming experience to hack stuff together. And those hacked together solutions are universally a nightmare to work on afterwards.
Yes, the Data Science culture around maintainable code does seem to be reaching a critical level of toxicity (in some environments at least).
In line with a nephew comment of mine, I feel that bringing the immediate interactivity or iteration cycle of notebooks to the development experience would help a lot, and not be too bad a thing for common development either.
I've heard of the related nbdev project, which seems like an interesting and compelling idea. But it'd be nice to see the reverse: something that makes ordinary python development more immediate than using a debugger/vanilla REPL.
I think that improving the shell experience and allowing e.g. multimedia content to be displayed and manipulated directly into the shell, would help a lot with interactivity. Maybe some specific terminal emulator (like kitty) with ipython would constitute a good starting point...
You're thinking it the wrong way. Notebooks don't do well in software development, but they are extremely useful on exploratory data analysis and quick iteration when searching for a suitable modeling approach.
These two tasks use code, but for completely different purposes. A DS is working on the data, understanding it and trying to identify what information it may have. Then they try to find a model that will leverage that information to deliver whatever inference solves the business need. This is extremely interactive and iterative, and everything from the actual business problem to the ML approach may change at each iteration. Imposing software development practices at this point is disruptive to the train of thought, which is very burdened already by the level of uncertainty and all the mathematics required to understand the data results. The goal is to find a viable approach, not write production code.
Once this approach is found, a good clean-up/refactor is strongly recommended, to then start a proper software development that will create a live product from the found approach. I call this the switch between research mode and development mode, and it has strong parallels to the way R&D is done in many industries. I believe a lack of understanding of this dual nature of ML is what causes many of the problems in MLOps: plans that don't take into account the research time and risk, mixed teams where engineers don't understand the initial nature of DS work, attempts to put notebooks containing research code in production, etc.
Even planning for the refactor doesn't solve it all - what will happen when the next generation of a model has to be created? Will the refactor Ed code be forced on the DS and ruin their research productivity? Will they start from scratch again and not only lose all the refactor/dev cost but also make this a recurring cost? I have been looking for answers for this for years now, and found none so far.
Source: I've been working with data for 27 years, as a data engineer, data architect and data scientist. When I do DE, my code is considered high quality by my peers, but when I'm doing DS research, I know I write bad code - and I won't change that. It's more productive to work this way and do the big refactor (possibly leaving the notebook env behind along the way) than the alternative.
I'm always surprised when people advocate for .py files over notebooks because of poor software practice. (Genuine question) have you found that it improves the situation at all?
I’ve found varied success. In general, I’ve encouraged the move across to being teaching source control. That has been in contexts where notebooks are being used for critical outputs rather than exploration.
When you get into MLOps as well, having .py templates actually makes the Data Scientist’s job easier as they can plug and play their models into a system that tracks inputs, outputs and changes for them
If you work with something visual, interactive then this workflow is so super awkward that I never end up doing it. For data-driven workflow you have to analyse the data, note down your thoughts, analyse a bit more and then come to a conclusion. Your conclusion might be code living in .py files, or another type of data then consumed by something else. But this will result in a significant part of the "thought-process" and relevant code living in those notebooks, with all their problems. I can't just switch to some .py files because I want to change the axis for some plot, or look at it in log-scale. But then where do you draw the line? A .py file for only 10 lines of code generating the resulting .csv? That's also a pain to maintain because you have all those disconnected files. We need those notebooks, they have to get better.
From the beginning of the article: "With nbdev2, the Jupyter+git problem has been totally solved. It provides a set of hooks which provide clean git diffs, solve most git conflicts automatically, and ensure that any remaining conflicts can be resolved entirely within the standard Jupyter notebook environment."
This is similar to my approach with all logic in a .py module and then the notebook is only for presentation, comments, and formatting. This works out fine with everything in git and it’s pretty rare to actually have a conflict in the .ipynb.
dude - more than 1 million undergraduate computer science students worldwide will learn Jupyter this Fall, and you are getting contrarian votes among a bunch of average-of-masters+industry CS people here
"we" have to learn and teach the next sets of people new to computer science
It looks like this tool, nbdev2, solves a real-world problem for Jupyter users, including me, with zero effort required to use it every day. It relies on clever hooks to get git to treat cells as first-class citizens (as opposed to lines of text, the default). Nice! Based on that alone, I would expect nbdev2 to be widely adopted over time. In fact, if it works as well as advertised, it should be incorporated into Jupyter. I, for one, will be giving it a try!
If you use Jupyter to solve problems in your domain of expertise, feel free to ignore all the smart-sounding software engineers who will poo-pooh this tool only because they don't like notebooks and don't want anyone to use them. No matter what you do, there will always be people who look down on easy-to-use tools that enable scientists and practitioners from other disciplines to write, run, and explore ad-hoc code on-the-fly.
EDIT: nbdev2's authors are on this page, answering questions. Thank you again!
I’ve used this library for a number of projects and it’s a joy to use. I don’t think it’s an understatement to say it’s paradigm shifting – to the extent that once you have your environment set up, you are free to code, think, iterate, deploy and document your projects all at 99% of the speed of thought.
There seems to be a lot of discussion in here around the pitfalls of jupyter, and notebooks, and the poor coding practices of data scientists. If you haven’t read the article or used the software I’d like to highlight that all of these (legitimate) complaints are exactly what nbdev2 was created to address, and in my opinion very successfully solves.
The way it works is that everything runs off a master notebook, and then with one command: libraries are built, git diffs are magically fixed, tests are run, documentation is automatically created. It doesn’t fundamentally change your workflow in any way, it just abstracts and automates away all of these pain points.
There’s a reason that everyone uses jupyter notebooks. They are fun to use, they are great for exploring and developing ideas. And (minus the aforementioned git collaboration issues) they are great for sharing with others, which is a huge part of the wider data science ecosystem. We don’t need to recommend avoiding notebooks, and allege they are just for beginners. We need to use tooling which addresses some of these final issues with writing mature software. And I'd like to thank the authors of nbdev for this.
The people who look down their noses at notebooks can continue to do so – but what they will find is that nbdev quite effortlessly leap-frogs over these sneered complaints, and allows you to write better software more productively.
I used to use Jupytext a lot for this problem, and think it does a decent job. The main problem with Jupytext is its reliance on BYOD (D for discipline), which is a poor (but possibly best available) solution for human systems.
IMHO the Jupyter+Git problem stems from the ipynb format. Jupytext does it "right" in the sense that you can work in .ipynb, and diff in .md. But as long as the base format is diff-unfriendly, all tools are methods of indirection of format complexity to tool complexity.
That's not to take away from the tool -- it looks great. It also takes the D out of BYOD, which is a win. But I think "solving" it means that anybody who receives an ipynb is able to just look at it out of the box, like plain text, so we're still a ways off.
I feel like the BYODiscipline problem with jupytext could be solved with relatively rudimentary text-editor plugins.
I've started rolling my own little plugin utilities and, so far, I have a (very) rudimentary notebook-like interface in plain text.
Combine a proper attempt at such a thing with a good interface to a background ipython kernel system (for which ipython could do with some minor enhancements AFAICT), and you'd basically have the best of all worlds (all plain text editor features including version control and personalisation and customisation, and the iterative advantages of notebook code-cell-based runtimes).
Hopefully, with such a combination functioning well, there'd be an emergent feature that allows one to more easily get interactive with a code-base for the purposes of understanding, debugging or developing it.
Personally, my biggest gripe with Jupyter at the moment is that a few years ago they decided to try to create a quasi-IDE (where they'll probably be beaten by VSCode) rather than improve the general utility of the kernel (or kernel protocol/interface?) and/or the essential notebook UI.
It's a personal gripe, and there's clearly value in the web-first interface they've made with JupyterLab (despite the not insubstantial growing pains that project has faced), but, watching ObservableHQ and Pluto (for Julia) focus just on the core notebook interface, while VSCode have focused on the IDE side and easily incorporated or recreated the now rather old/simple Jupyter Notebook interface, both with success, seem like some vindication on my gripe.
By "essential notebook UI", do you mean the old notebook or the lab interface?
The old notebook was painful to extend in JS compared to writing lab extensions in TS. In Jupyter Lab 3 they have taken questionable steps, but so far I have been able to work around issues.
I was referring to the notebook-like "part" of the interface, where JupyterLab is a notebook interface with IDE-like components wrapped around it (file explorer, terminal etc).
The ObservableHQ interface, for instance, I'd classify as just a notebook interface. IE, individually manipulatable code-cells with a shared runtime.
And yea, JupyterLab is better than the classic IMO. But, until recently I'd say, the notebook part of the interface hasn't gotten much love at all, while there've been steps, due to popular demand it seems, to provide alternative UIs that strip away much of what they've added on top of the notebook (ie, simple mode, and now Jupyter Lite).
I haven't really got experience writing extensions in the old Jupyter notebook, and hardly any with JupyterLab, but my experience with JupyterLab was frustrating because it felt like they really killed the ability to implement small and hacky plugins like you could with with the old. This always struck me as a shame. A necessary one perhaps given what I presume is the increased power of their new framework. But it always felt like there was a mismatch between the complexity of the plugin framework (which is a full web-dev experience) and the base features of the "product", where customising my test-editor is now much easier AFACT.
What issues and questionable steps were you thinking of?
> What issues and questionable steps were you thinking of?
2 things come to mind right now:
Starting with JupyterLab 3 (maybe 3.1), JupyterLab removes query arguments from the URL. Query arguments were the only way I know, to give arguments from the outside of JupyterLab to JupyterLab. Any extension, that relies on arguments given from the outside would break, just because JupyterLab removes query arguments, which were there since the beginning and did not do any harm, at aleast any I could tell. But suddenly this was taken away, without proper alternative. Now you have to hook into their "router" to quickly grab those arguments, before they are gone. This seems silly to me. Why randomly delete query arguments? They are there for a reason and since JupyterLab does not add any of its own, I cannot understand this decision. Simply seems to make it less powerful a tool.
The constant nagging about posting in their community JS-only forum. ("You should post this in the forum.", "Have you seen this post in the forum? links to forum") Why can this community not handle issues in issues, which can be easily found using a search engine. Why hide everything behind a JS-only forum, which one has to create another account for or associate ones Github account with? Whenever anyone gives me a link to the forum, where supposedly the answer to my question is, I keep thinking: "Ahh great, why did you have to hide it in there? If you had documented this in an issue, I would have found it via search engine and the thing would not have wasted my time and neither would I have had to waste yours." -- something along those lines.
When I find an issue and its solution, I still post it as Github issue, so that other people can easily find it, without signing up to their forum.
> I haven't really got experience writing extensions in the old Jupyter notebook [...]
I have done that a few years ago, when JupyterLab was still alpha versions. It worked, but the typical JS mistakes plagued me. JupyterLab is of course using TypeScript, which helps a lot with avoiding silly mistakes. However, I do think there is something to what you say about no longer encouraging the quick hack. Some functionality took years to appear in JupyterLab, but was already available for Jupyter Notebook, before JupyterLab took off.
I agree re format vs tool complexity. I don't think Jupyter is a particularly difficult format though, its mostly light JSON -- all human-readable.
We realised after working with Jupyter+Git for a while that the pain-points were actually with Jupyter editors (and/or their conventions) rather than the format, because they do things like store user-metadata in the file which pollutes diffs and leads to merge conflicts.
In fact, if Jupyter editors could handle merge conflicted files, we wouldn't need a custom merge driver either.
I have been using jupytext as well and it really makes handling notebooks much easier. I think the decision for Jupyter to save to json was not a good one and they should instead have looked at systems like org mode for inspiration.
I also don't understand what you mean by discipline. Yes you need to make sure that everyone has the jupytext extension installed, but that just becomes part of the needed dev environment. After that the whole experience becomes completely seemless.
Hi, I'm the author of the git merge driver and Jupyter save hook in nbdev2 :) I'd be happy to answer any questions you have about how we're handling using notebooks with git
Can this do three-way merge? If I have to resolve two conflicting code blocks, it is often useful to know how each of them change the code from the shared parent.
It does an ordinary three-way git merge (treating notebooks as plaintext) then a two-way merge on conflicted bits. We opted for that approach because its incredibly simple and has worked perfectly for us (I think since we tend to work with small code cells). I think nbdime has a full-on three-way notebook merge if that's what you need, which can be used together with nbdev's Jupyter save hook to clan up unneeded metadata.
Not criticizing the authors approach, but the Jupyter+Git Problem was solved for a long time with Jupytext [1].
Jupytext will convert Notebooks (.ipynb) files to Markdown (md) and Python (py) 'on the fly' (while working in Notebooks).
- Markdown files can be added to git
- Python and .ipynb files are added to .gitignore
- Python files allow 'chained' import of notebooks (*.py verions), which allows to split larger notebooks into multiple smaller ones
This is my folder structure:
.
├── notebooks
| ├── notebook1.ipynb # automatically generated from md
| └── notebook2.ipynb # automatically generated from md
├── md
| ├── notebook1.md # versioned in git
| └── notebook2.md # versioned in git
├── py
| ├──modules
| | ├──__init__.py # empty
| | └──tools.py # use for cross-project base tools
| ├──__init__.py # empty
| ├── notebook1.py # automatically generated from md
| └── notebook2.py # automatically generated from md
├──jupytext.toml
├──.git
└── README.md
See an example here [2]
Jupytext is mentioned as a 'potential' alternative. Re the "save" cell output: I usually produce html-files at the end of my notebooks (see the example), and add those either to git or auto-upload to an external webserver. The html is standalone and includes outputs, table of contents, and images (example [3]). I would advice against versioning all outputs (images) in git.
Very happy with this approach for a long time now. Jupytext increased my productivity by a hundred percent.
The pros and cons of Jupytext are discussed in the linked post. It's a great approach, but wasn't sufficient for our needs -- so for us, at least, it didn't fully solve the Jupyter+git problem.
Specifically, it doesn't handle the situation where you need cell outputs in version control -- since in that case, you still need the notebook, which results in all the usual problems occuring. With nbdev2, you don't need to think about anything or do anything special, and stuff like GitHub notebook rendering, nbviewer, ReviewNB, etc all just work. You just run a single command (`nb_install_hooks`) and that's it.
Also, no-one has to install anything extra to view your notebooks, since they're stored in the regular notebook format.
I'm not sure what your cell outputs are, but if you are doing plots or images inside your notebook, than I agree with the OP that it is generally not a good idea. You now store binary data inside your git repository (which sometimes just carries its own problems), but worse that binary data is mixed into your text diff.
If you do a diff between two revisions where some figure changed you essentially will be swamped by the diff in the figure making it difficult to find what actually changed. Now tools like nbreview get around that, but now you're forcing everyone to use the same dev tools, and can't look at diffs any other way really.
It’s been a while, but last time I needed to GitHub at least had really great tooling for diffs between versions of image files.
> but now you’re fixing everyone to use the same dev tools
No they’re not. You can continue using whatever approach you’re using. Attempting to shut down alternatives like this though could be seen as forcing everyone to accept whatever that status quo and lowest common denominator solution, even if their dev tools could support something better.
It's pretty nice. At first I thought it was just a rebranding of R Markdown but it's been decently modernized/improved to the point where it makes sense that it is its own, separate thing.
Jupytext does a lot more than just fix Jupyter/git integration, which is great if you want to adopt its approach, but a bit too heavy IMO if you don't. The approach mentioned here is extremely lightweight and doesn't use too much more than built-in Jupyter/git functionality (and it all happens automatically behind the scenes)
Subversion used to say, 'CVS done right.' With that slogan there is nowhere you can
go. There is no way to do CVS right. -- Linus T.
Jupyter's ipynb format is only slightly more amenable to git than say an
MSWord doc. Nbdime and friends will never get you to a point where
git+jupyter will be worth the ugly.
What are the outstanding problems you feel are there even with the new nbdev2 functionality? Since I've been using it (the prerelease version) over the last few months I haven't come across a single problem, personally, despite doing a very large amount of collaborative notebook work.
Thanks but a hard pass from me. The original sin was using goofy JSON as the file format (and no! I dont care for your pretty 5MB pngs polluting my git tree). This is the nth attempt at applying lipstick on the pig (n-1 being jupytext)
> Here at fast.ai we use Jupyter for everything. All our tests, documentation, and module source code for all of our many libraries is entirely developed in notebooks
That sounds like a nightmare. Why would you want to develop a library in a jupyter notebook?
> The solution presented here is the result of years of work by many people.
It's a bit depressing that it came to this. It's hard not to think that it was a mistake from the beginning and that the format should have been based on using special comment markers in valid code, together with an accompanying JSON metadata file. Or something like that. One way or another, we have a very strong tradition of storing code in plain text files, not embedded in strings in JSON or otherwise embedded in any opaque format. Maybe there'll come a day when it's appropriate to abandon that to get some advantages, but I don't think that day was the original creation of Jupyter. I know it was created by thoughtful and expert software engineers, but I feel that it was a mistake and it's actually made a lot of data science / academia-oriented people less qualified to participate in industry software engineering, because of the poor practices forced upon them by the inability to use git with Jupyter, and notions like developing library code in notebook cells.
Why is there a "Jupyter+Git" problem specifically? Why aren't we worrying about the "C+Git" problem and the "XML+Git" problem and the "Python+Git" problem? Because merge markers break, well, every file format.
Is it because Jupyter users in particular don't typically understand that there is a formatted text file behind the notebook, or how merge conflicts work?
It's because the primary editor of the notebook files barfs when presented with the file if it includes merge markers since it's no longer valid json. Imagine if one of your normal code-friendly text editors, or your ide, refused to open a .c or .py file and you had to open it in notepad to fix it.
That's what it feels like to be forced to drop into a normal text editor rather than using the normal notebook ui to fix the conflicts.
You know, is there anything like the Language Server Protocol for diff/merge resolution? Seems like there's an opportunity to build a system for semantic aware merge that's language/format aware and tool agnostic (and auto configurable to boot).
Hmm I'm a huge fan of pijul, looks like the future of change management from where I'm sitting, but no: they have not.
Semantic diffing needs something like pijul, but a system taking advantage of this doesn't yet exist. Pijul avoids some merge conflicts by design, won't do the wrong thing, and handles conflicts correctly: we still need tools with a fuller awareness of what strings mean to have rich semantic diffs.
So is jupytext, rmd and qmd. But what do you do about the output?
The nice thing about markdown-like notebooks is that they play well with git. The nice thing about jupyter style notebooks is that they contain all the content needed to actually _read_ the notebook.
I use this plugin for my jupyter notebook git integration. It has a git diff option that's useful but gets very slow for complex documents. Perhaps under the hood it's using one of the other tools mentioned in the postscript.
Streamlit has completely replaced my usage of Jupyter -- I find it to have the quick iteration speed and visual output of notebooks, but it's just normal python so all the normal tooling works (there is no "git problem") and you don't have the weird state problems of notebooks.
Definitely recommend checking it out if you haven't already!
I haven't used Jupyter but from what I can gather from this article they've built a simultaneous editing system on top of automatically committing to git in the background as multiple people edit things, and using that to share the changes between users.
Do I have that right? Because that sounds /insane/.
No it just uses normal git in the normal way. The simple trick is to use a jupyter-native git merge driver, so that merges are done at a cell level instead of a line level.
Also, unneeded metadata is removed from the notebook when saving, so there's less changes to merge.
Both these two things are done using standard hooks built into each of git and Jupyter. That is: git is written in such a way that it can fully support non line-oriented formats. We just took advantage of that capability.
Ahh right, so you still make manual git commits normally, it's just that the Jupyter UI used to fall over when it encountered merge conflict markers in source files. And now it doesn't fall over any more and can nicely represent them because the conflict markers are no longer done for individual lines of text?
If this doesn't work for you JetBrains' DataSpell might, it's oriented towards notebooks for teams. It has hiccups, things like ipywidgets don't always work as expected so I sometimes find myself falling back to Jupyterlab. But overall it's a very comfy chair.
I thought jupytext solved it long ago with percent formatted python file. Since its a python text file you can run automated formating, linting, static type checking and git version diff. What's new is being solved here?
Does this work for editing notebooks in VS code? (Unclear to me where the saving hooks reside, and whether you have to edit them through Jupyter labs/notebook) Any issue if the notebooks reside on a remote server?
This actually works? Awesome - never really thought about how dysfunctional git is with jupyter - I always assumed that it just didn't work. Nice to have someone fix the problem that I just lived with :)
My dumb as a brick solution to this is just to clear all outputs every time I commit. Having a smart diff\merge sounds like it would make things a lot easier.
It's the wrong way around though, Jupyter notebooks break a git work flow. I think the fault here is completely with the design of the Jupyter notebook file format (and the way editors save to it).
I think it's quite unfortunate that they did not consider that the format would integrate well with version control systems when first designing ipython notebooks.
Nah man, you got it backwards. Git still works just fine while my notebooks are definitely broken. Not here to play the blame game, just trying to relate the practical results.
I'm suggesting that git only breaks Jupyter notebooks (or anything else) if you do not know what to expect from git.
But if you don't know that git modifies files when conflicts, then you're an interesting and rather unexpected audience, I assume.
Meaning that for the typical git user, meaning, knowing about git diffs, the behavior is expected hence not broken. The files end up in an expected broken state, but git does not break them per se.
If you still disagree, let's just settle that we disagree and be done with it.
I don’t know whether it’s the Data Science culture or Jupyter but there is a big lack of discipline in writing maintainable code in DS and non-existent git support is part of that.
I always strongly discouraged developing models using notebooks, instead advocating for using .py files and then using notebooks for sanity checking data.
I don’t have any clever ideas for how we can move past Jupyter but the sooner we do the better.