While data.table does have impressive benchmarks, the rise of dplyr in the years since with its significantly more readable and concise syntax makes life a lot easier for data sets that are not super large. (https://cran.rstudio.com/web/packages/dplyr/vignettes/introd...)
The vignette linked above describes some compatability between dplyr syntax and data.table, but I admit I have not tested that.
Nowadays, if you are working with multigigabyte datasets, it may be worth looking into Spark/SparkR (https://spark.apache.org/docs/latest/sparkr.html) for manipulating big data in a scalable manner instead, as there is feature parity with data.table + analytical tools [GLM] as a bonus. (The syntax is still messy, to my annoyance)
You can rent an r3.8xlarge instance with 344GB RAM on AWS for $2.66/hr. If that isn't enough, an x1.32xlarge instance with ~2TB RAM is $13.34/hr. Assuming your data is already in AWS, of course. Of course those machines have a crap ton of CPU power (I'm explicitly not saying cores because I'm mystified by what vCPU really means) as well which you will be hard pressed to take advantage of but if it's RAM you need they're super-cheap.
If you prefer Azure, a G5 instance with 448GB RAM is available for $9.65/hr.
Not a lot of need for Spark/SparkR for the vast majority of data sets given the cheapness of compute these days.
The CPU power would not be too hard to utilize, if one's task is amenable to fairly coarse-grained parallelization, which is rather easy to implement in R scripts with the help of foreach[1] and doFuture[2].
Such an approach worked to great effect for me recently, when I needed to perform a `zoo::rollapply`[3] across a time series with tens of millions of rows. The speedup when throwing more cores (7, on my laptop) at it is roughly linear. If I ever need to scale up the analysis to hundreds of millions of rows, the 128 vCPUs of an x1 EC2 instance would be well worth the $$/hour.
if your function has a well-defined rolling form, e.g. rolling mean or standard deviation, you should use a pre-optimized function for that. package `catools` has a bunch of useful ones.
Thanks! As it happens, my function doesn't conform to any pre-optimized facilities, though I wish it did. At some point, I may look into writing an optimized variation, with the help of Rcpp, or even using the (inspiring!) fortran approach of quantmod/xts.
The syntax is very concise and cryptic, but what is messy about dt[i,j,k]?
data.table transformed the way I do analysis. Honestly probably 75% of my lines have are dt[] calls. It allows me to analyze on 500M row data.sets routinely on a powerful workstation interactively.
While I like and use dplyr time to time, I wish it could handle POSIXlt datetime. Also, it seems to barf on some dataframes just because the df contained a column format, dplyr doesn't support even if it is not being used in processing.
just use lubridate, it has a lot less headache to it.
my beef is with the new tibble-ish format of dplyr. the format doesn't support lists as cells, unlike data frames. right now, `tbl_df` is a good in-between, but since dplyr is converging to the tibble/feather format, it's gonna be rough
Do you use data.table syntax, or do you use dplyr with the tbl_dt class to abstract that away?
I was a data.table acolyte for years, but after being forced to learn dplyr I can't imagine going back. I'm hoping if I need update-by-reference in the future I can use tbl_dt but I haven't played with it yet.
If I'm using them together, I'll start off a pipeline with a big DT expression, then pipe it through a few dplyr verbs as a kind of "post processing."
I find that some tasks are frustrating with one package but easy with the other, so more frequently I just have some "dplyr lines" and some "data.table lines" in my scripts.
And the compatibility between the two has improved a lot, so you don't get so many "invalid selref" warnings and unexpected behavior when you combine them in pipelines
Does any of these support out of memory data? IIRC dplyr can use a db as backend but I could be wrong. Anyway, I wonder why not just use a db & sql for more demanding data munging?
Presumably that depends at least on the type of data and whether you have a suitable database to use. As another approach, for R there's an interface to parallel NetCDF in the pbdR system that you might use for distributed processing on an HPC system. There doesn't seem to be anything like pdbR for Python, for instance.
I routinely do my data munging in SQL/Hive/Pig. I think R is more of a complement to that. Mainly for more data munging that needs some programmatic manipulation and of course advanced statistical analysis.
How to convert Wide format data to Long format data (process called 'reshaping')is very very important to understand. Certain statistical tests cannot be performed if not in Long format.
The article mentions Reshape2. I also love TidyR. It's much easier to use than Reshape2 when dealing with a large number of variables (i.e., columns).
Interesting, I'll have to watch out for it then. I found it more helpful than Reshape2 when trying to convert a dataset with 55 variables from wide to long format. 8 of those variables needed to be collapsed into 1 variable. With Reshape2, I had to identify all the ones I didn't want in the final result, with TidyR I just needed to identify the ones I wanted. Hence it was easier with TidyR.
It's a bit lame, but I am so in love with rstudio... I can't think of a python IDE that's near for data science. Simple things like being able to see all the entities you've created in your session and being able to jump back and forth through plots are really nice.
Since the comment from the new account mentioning it may be modded away, I'll say (as a recent user, not affiliated with them in any way) that Rodeo is a fairly robust integrated analysis environment, a more immediate kind of Jupyter notebook. They recently added a native dataframe viz, too.
I am with you on this. While I feel like I should eventually move fully into python for data, with it being a more multipurpose language. I can never never seem to make the transition because python does not have a rstudio equivalent.
I have tried yhat's rodeo and also yhat's port of ggplot for python but it just isn't there yet in my opinion.
I switched from Python/pandas to R because I was more interested in applying the algorithm/methods instead of developing them. Python is fine if you want to develop the algorithm or implement a particular method. But if you are more interested in application of algorithm/method, the availability of R packages to do so hands down the best. Anything you may want to do with R, most probably there is a R package already for it.
This really strongly depends on your industry/focus, and what your end goal is. Some industries are completely dominated (library/community wise) by one or the other. I used to do finance work and strongly preferred python, so I tried hard to use it. (This was also before/right when pandas was out.) I was constantly plagued by needing some fitting routine e.g. for a vector GARCH model, and there was a package in R just sitting there. I was able to get very good mileage out of the RPy2 interface, though.
On the flip side, I've done a fair swath of work in the machine learning arena, and in particular the deep learning topics before it was called 'deep learning', and it was nearly hopeless to use anything except Python or MATLAB (both strongly tied in with C++/CUDA libaries). I think this is still largely the case.
As others have mentioned here, for me the biggest pull away from R was that it's not general purpose. Hard to ship someone R code. Hard to throw a web framework in front of your code. Hard to build a rich desktop GUI on top of it. I know you can do most of those things in R, but last time I dealt with it there was a massive ravine in usability/maturity. I'd also be insincere if I didn't admit that the pervasive R coding style just drove.me.fucking.crazy. That and I found that while there was a mindbogglingly large pile of libraries, documentations was usually very lacking.
One of the aspects of data munging is visualization, which is much easier in R with ggplot. Matplotlib is painful and always requires more than double the code to accomplish the same thing.
There's s lot more options in Python for data viz other than matplotlib. Pandas has several built ins as well, so in many cases simple plots on existing data frames can be a one liner.
R is generally better for data science, so it's nice to be able to use 1 ecosystem for a whole project. But yes, I too have in the past used python for data munging and R for machine learning.
The vignette linked above describes some compatability between dplyr syntax and data.table, but I admit I have not tested that.
Nowadays, if you are working with multigigabyte datasets, it may be worth looking into Spark/SparkR (https://spark.apache.org/docs/latest/sparkr.html) for manipulating big data in a scalable manner instead, as there is feature parity with data.table + analytical tools [GLM] as a bonus. (The syntax is still messy, to my annoyance)