Hacker News new | past | comments | ask | show | jobs | submit login
Data Science in Julia for Hackers (datasciencejuliahackers.com)
240 points by dklend122 on March 19, 2021 | hide | past | favorite | 51 comments



I really want to love Julia language but the overhead of package loading annoys me. Perhaps this may be just me being lazy or spoiled after using R packages or Python libraries.


Yeah, it's still dissapointing how much system resources it takes to get a plot running (even with all the improvements in 1.6, benchmark taken from https://www.oxinabox.net/2021/02/13/Julia-1.6-what-has-chang...).

    julia> @time (using Plots; display(plot(1:0.1:10, sin.(1:0.1:10))))
      9.694037 seconds (18.29 M allocations: 1.164 GiB, 4.17% gc time, 0.40% compilation time)
To be honest, this isn't really something the devs should be proud about. 10 seconds, 18.29 M allocations and 1.164 GiB of memory just to render a simple static image is just unacceptable. In order for Julia to be a good-enough language for general scientific usage the current slow LLVM-based JIT compiler sadly isn't enough. I really want Julia to succeed, but this one problem is triumphing over every good language feature.

In an ideal world Julia would have two backends:

- A fast, hand-made JIT backend (something like LuaJIT) which is mostly interpreted and only JIT-compiled in frequently run code paths.

- The current slow, almost-AOT-like LLVM backend which is only used for precompilation of packages

It's just like Debug/Release bulids in C++, except that the Debug builds are lightning fast to compile (but slower at runtime) and Release builds are slower to compile (but hyper-optimized for runtime speed). I guess there's going to be an immense engineering effort if something like this happens, given that Julia today is immensely coupled with the LLVM compiler/runtime. But that's just my pipe dream.


Agreed, that would be ideal. But it's also a little unrealistic in the short term - most languages don't have an interpreted and a compiled mode. I think there just isn't the dev manpower.

It's more useful to think of Julia's latency as a fundamental constraint of the compilation model. People don't complain that Python can't deliver static binaries, or that C++ is unsuitable for scripting. Julia's compiler can do both (although the "static binaries" haven't materialized, because noone seriously works on it), but the tradeoff is that you suffer from latency. That's fundamental. Future versions of Julia may cut the compilation time down further, but it will never be instant or even close to instant.

So, like other languages, it means you have to look at the strengths and weaknesses of the language for your use case. For myself, I can't imagine a situation where I want to plot something right now and not in 15 seconds. Whenever I plot something, it's always after minutes or (usually) hours of data analysis. Most of my Julia use is either long, interactive sessions where ten seconds of startup doesn't matter much, or long-running pipelines where it doesn't, either.


Little-known fact: there is a Julia interpreter, if anyone really wants to go that way: https://github.com/JuliaDebug/JuliaInterpreter.jl


> I can't imagine a situation

If you have N scripts that produce N plots for a paper, you can't just drop them in a makefile or you will pay the 15 second penalty over and over and over every time you build. You can't reset the interactive session as a matter of habit to aggressively control state build-up. Even iterating on a dashboard becomes painful with a long startup time. You have to treat your Julia session like a child, not like cattle. That's limiting and not everyone wants to it, especially now that they are used to treating python sessions like cattle and reaping the benefits.


Here we go again. Those allocations and run time can be avoided entirely using sysimages. Latest vscode extension gives a convenient build task for this.

It seems like every time julia is mentioned, someone has to clarify this.


The problem I seem with sysimages, is that the whole process is incredibly unergonomic for exploratory coding, which Julia always claims it is a primary use-case for. When you want to just glue some packages and quickly test out some results, the last thing you want to do is to think about what kinds of functions from libraries you are going to use beforehand, carefully write it in a precompilation file, and wait for the compiler to precompile and cache those code paths! And once those dependencies change even a bit you need to wait for precompilation again, which happens a lot when you're prototyping things fast or upgrading versions (for example, this X numerical computation library doesn't have a particular feature I want to use, so I want to switch to another library Y, but then you need to wait for everything to be precompiled again) The reality is that most people just want to fire up a virtualenv and do the occasional "pip install <my libraries>" and don't want to get involved in any of this stuff. (As someone who uses C++ I have a lot more patience than other people, but you shouldn't really expect much from others)

I think Julia needs to throw away some of the unnecessary obsession for just-ahead-of-time (JAOT) compilation. Compiling everything beforehand to LLVM machine code seems good for raw runtime speed, but it doesn't matter when it takes so much setup time to even just glue some simple libraries. Julia should seriously consider running most of the less-computationally intensive codepaths in a custom-made interpreter without LLVM, while precompiling the numerically intensive parts of library code in raw machine code with the current LLVM backend. Maybe there should be a flag you can use to mark certain functions to be unconditionally compiled AOT-style, and leave the rest to the interpreter. Interpreters can be made surprisingly fast with some effort - stuff like the interpreter in LuaJIT (https://luajit.org/) and HashLink (https://hashlink.haxe.org/) is what comes into my mind.


Regarding your first paragraph, I must ask, have you done significant coding in Julia? Because an outdated sysimage is not really that big of a problem in practice. I have a sysimage which includes common big packages for plotting, numerical computing and some prettifying stuff I like like OhMyREPL. Then during dev time I add and remove packages, the compile time for those is significant enough to be noticed but not significant enough to be a problem. Because of the great composability of Julia, I'm guessing internally its using those big libraries and thus find most stuff precompiled for them. But I can't be sure about that.

Anyways, this is just my experience and my job involves a lot of waiting and stuff anyways so a couple seconds here and there doesn't bother me.


To me separating compiler and interpreter looks like a new two languages problem (maintaining consistency seems difficult)


It would be the same code running on two different backends, so there would not be a two-language problem. Just like how GCC and Clang don't give C a two-language problem.


2 language problem for the Julia developer. Also these two languages might drift apart over time


They just explained to you that this idea has nothing to do with two different languages and that there are other examples of having multiple compilers for one language.


It's not a problem if the language spec is well defined. Scheme distributions often do this.


AFAIK Julia doesn’t even have a well-defined syntax (there’s no specification such as a context-free grammar, only the implementation of the parser).


If it worked by default out of the box, nobody would have to "clarify" that you can decrease Julia's computational slowness by increasing the human time and effort spent on cache management

:/


One of the things I took away from trying Julia was that unicode is incredibly dumb in some aspects. Unicode lacks support for parts of the latin alphabet in subscript and superscript formatting(similar for the greek alphabet)[0].

As Julia relies on Unicode for its neatly formatted variable names this is an annoying limitation that the developers can't even do much about.

Unicode has over 100k symbols, including about 3500 emoji. But a continuous set of 26 small letters, either in lowercase or uppercase(or both) is too much bloat apparently.

[0]: https://stackoverflow.com/questions/17908593/how-to-find-the...


Julia core developers are at the forefront of trying to push for a better unicode standard:

https://github.com/stevengj/subsuper-proposal


I don't mind Julia's Unicode support, but definitely share the gripe about how the Unicode standard doesn't yet support a complete a-z set of sub/superscripts!


What's the alternative? If you want julia files to be editable by regular programs, I can't think what else you would use.


Which is why I am so disappointed about those gaps in Unicode. Apparently the Unicode consortium doesn't care because in their eyes such things have to be handled by higher level formatting software, but that's not a very convincing excuse imho.

It's just so bewildering when on one hand lowercase superscript is missing exactly one letter for the full coverage, and at the same time Unicode has characters like "Grinning cat face with smiling eyes"(U+1F638). Just what I needed.

It's a bit disappointing because together with the integration in the REPL this feature in Julia works well enough to give a glimpse of the potential but when trying to experiment more it shows gaps that make it a bit of a trial-and-error situation. Better direct support in standardized encodings could make this much more usable and maybe not just in one specific language.


IMO, the solution here from unicode's end seems really clear. Just add superscript and subscript modifiers (the same way emoji have skin-tone and gender modifiers). That way, you don't need to add special ones for every character people want different versions of.


How about super and sub scripts for thr 35000 or some Chinese character? I think thr unicode committee made the right call


How about a single "superscript" and "subscript" modifier that handles any of the other Unicode symbols?

The standard already has enough formatting options to make smartphones crash by sending them a text message(as could be seen multiple times on both Android and iOS).

If you think the unicode committee made the right call, please explain why the decision to add smiling cat pictures and over 10 heart symbols was correct but not supporting everyday maths notation.


Julia doesn’t rely on any Unicode. It just gives you the option to use it. If it’s not useful to you, feel free to not use it.


How would one not use Unicode? Every character on my keyboard and screen are Unicode characters.


Most of them are ascii as well.


You mean, the time it takes at run-time? I'm not quite sure what you mean - I think Julia's package management is generally better than Python (which is a mess) - it doesn't play well with Nix but I can't really blame it for that.


I think they're referring to the JIT compilation time. There are some ways to mitigate it, such as creating an image using PackageCompiler.jl, but it's definitely a noticeable issue IME. https://julialang.github.io/PackageCompiler.jl/dev/


Or the fact that once you define a struct, it cannot be redefined again and you need to reload the repl


You can put it in a module, then you can change and reload in the repl without restarting.


From a data science perspective do you see any obvious advantages to Julia compared to R/Python?


I think it really depends on what subset of data science you have in mind. For me it solved my personal version of the "two-language" problem, but YMMV. Longer-term I think the most interesting thing about the Julia ecosystem is the composability that comes from dispatch-oriented programming [1]

[1] https://www.youtube.com/watch?v=kc9HwsxE1OY


Yes. And the warts in the language are pretty off putting.

Original paper was awesome. Execution of the language IMO hasn’t been good despite of great marketing.


This book is being written entirely in Pluto notebooks!


I had heard of Pluto notebooks but didn't know that they were truly different. The elevator pitch seems to be

Pluto: Jupyter without hidden state.

I can't wait to try it out. I did my thesis work in Julia ~5 years ago but abandoned it afterwards precisely because state management was so painful. At that time, Julia took 30 seconds to start (compile the plotting library), so you basically had to use live sessions, and they had all the usual stateful foot-guns, which made for a very painful overall experience. Performance was still pretty rough last time I tried it (build caching on the libraries didn't happen on install and even post-install wasn't Just Working), but Pluto sounds like a strong enough mitigation to be worth a try!


Adding Plots (or for that matter, Pluto!) to your sysimg with PackageCompiler.jl is a potentially significant QoL improvement on that front. Parallel auto-precompile in 1.6 is also awesome.


    auto-precompile
Ok, with Pluto it had my interest, but now it has my attention.


You should check out https://www.oxinabox.net/2021/02/13/Julia-1.6-what-has-chang... if you've been away for a while. It's a pretty good list of what has been improved.


For the record, I'm currently seeing 2.8 seconds for `using Plots` and another 4 for the first plot (on 1.6).


There really should be a better way to map values. There are numerous functions in R for this, I believe there should be similar functions in Julia. Is that because Julia have good performance so people tend to write loop to deal with this kind of task? And call this hacking?

I tried paste the code but it was hard to read. You can search Histograms in this page https://datasciencejuliahackers.com/03_probability_intro.jl....


For most cases, Julia’s broadcast syntax [1] is the easiest way to map a function over some values. When you want to do something more complex, Julia provides an incredibly flexible and powerful iterator in CartesianIndices [2].

Did you have a more specific question about mapping or iteration in Julia?

[1] https://julia.guide/broadcasting

[2] https://julialang.org/blog/2016/02/iteration/


Using a loop and `if` statements as the book does is unidiomatic, the `replace` function (and the in-place equivalent `replace!`) work as you’d expect. https://docs.julialang.org/en/v1/base/collections/#Base.repl...

If you have a mapping dict defined I think the syntax is just `replace!.(rainData.month, monthMap...)`, not tested though.


What do you mean, "map values"? There is

    [sin(x) for x in vec]
    map(sin, vec)
    sin.(vec)
which are roughly equivalent? Is that what you're referring to?


Don't forget the "lazy" version: (sin(x) for x in vec)


If you're referring to the `for` loop replacing Spanish month names with English ones, that could have been done like this [1]:

    rainData.month = map(mth -> myEsEnDict[mth], rainData.month)
[1] https://syl1.gitbook.io/julia-language-a-concise-tutorial/us...



Can you give a code example in R of what you would like to do in Julia?

There are quite a lot of different ways of mapping in Julia, including functions like `map`, `mapreduce`, `replace`, and various syntaxes for broadcasting and array comprehensions.

For maximum SIMD performance there are also things like `vmap` and `vmapreduce` from LoopVectorization.jl.


I see what you're saying, regarding the histogram code. A cleaner approach would have been to store the translations in a named tuple or Dict and iterated through it.

One of the other commenters also mentioned the `replace` function, which could be used in place of a loop.


Really liked the introduction, it resonates a lot with what I've been thinking for a while.


Off topic, but this is a few spots below "Hacker's Guide to Numerical Analysis" on the front page right now. It makes the naming seem pretty silly.

Hacker News presents: X is all you need for Hackers considered harmful


"For hackers" is a trick so you don't discourage people in HN. If you omit "for hackers", they'll say it has too much math and therefore no practical use. These are the same kind of people who think that the reason people use math is so they can feel smart about themselves, because practical people who want to do practical things don't need math - they have computers.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: