\d less efficient than [0-9]

0x0 · on May 19, 2013

I wonder what kind of security vulnerabilities could be looming in validators not expecting non-ascii 0-9 digits and using this regex?

duaneb · on May 20, 2013

I'm betting quite a few. People should use library number parsers, even if they reject all non-[0-9].

hdragomir · on May 19, 2013

That's exactly what I was thinking!

trebor · on May 20, 2013

As of PHP 5.3, PHP-powered software is safe. Using

    is_numeric('١٣٦٨') // -> false
    preg_match('/\d/', '١٣٦٨') // -> no match / false
    filter_var('١٣٦٨', FILTER_VALIDATE_INT) // -> false

Which I'm thankful for. I should hope that most people understand base-10 and ascii numbers. I don't want to have to worry about properly validating/handling unicode characters with number parsing.

0x0 · on May 20, 2013

In PHP 5.4.15, I get:

  var_dump(preg_match('/\d/u', '١٣٦٨')) -> 1
  var_dump(preg_match('/\d/', '١٣٦٨'))  -> 0

laumars · on May 20, 2013

Regex is a really powerful tool, but sometimes I wonder just how well people actually understand it as the vast majority of people (myself included) seem to be self taught in the syntax - only learning the bits they need as and when they need it.

The problem is, regular expressions is packed full of counter intuitive idiosyncrasies which make perfect sense once they're explained, but are far from obvious. Take this for example:

    s/(^\s+|\s+$)//g

is slower than running two separate regex, like so:

    s/^\s+//;
    s/\s+$//;

So it does make me wonder the number of bugs that have been introduced to software by bad regex.

xonea · on May 20, 2013

The speed difference is bigger than I would have expected - about one order of magnitude in perl with a simple test script : http://ideone.com/Yso23W

rhizome · on May 20, 2013

Use the s/, Luke

    s/^\s+?(.*)\s+?$/$1/g

conroe64 · on May 20, 2013

That wouldn't work. First, it will only grab at only one whitespace character at the beginning and at the end. Second, if there was whitespace at the beginning or the end but not both, it won't match at all. "^\s* (.* ?)\s* $/$1/g" would work.

rhizome · on May 20, 2013

point being: get rid of the anchors in the alternation.

Aurel1us · on May 19, 2013

Short answer: \d includes all the Unicode characters from http://www.fileformat.info/info/unicode/category/Nd/list.htm

ars · on May 19, 2013

Is that actually a good thing? If I'm using \d to validate numbers (for example to check before string to int conversion, or IP address, phone number, or any other use), other unicode digits are not helpful to me.

It's great to support unicode, but I don't think the \d should have been extended this way. Add a \ud or something.

Tuna-Fish · on May 19, 2013

Given that the category is specifically "decimal digit", I think it's good, so long as the number parsing code accepts them all too.

dllthomas · on May 20, 2013

Yes. Assuming that, it's good. I think that assumption is likely to be invalid in many cases, though.

rmc · on May 20, 2013

Yes it's a good thing. There are other places in the world that don't just use ascii. If you want European style numbers just use [0-9]

bellbind · on May 19, 2013

If you use a preg engine you can add the /a modifier which excludes unicode chars from matches.

chebucto · on May 20, 2013

Maybe specify the subset of unicode you're expecting in the headers, and have the compiler do the nitty gritty?

wging · on May 19, 2013

...at least in C# regexes.

ars · on May 19, 2013

Anyone know if this happens in other languages?

yahelc · on May 19, 2013

Doesn't appear to in JavaScript:

    "੧".match(/\d/); //null

(Incidentally, this may explain the finding from http://stackoverflow.com/a/16622773/172322, as to why adding the RegexOptions.ECMAScript flag in the C# code eliminates the performance gap)

deskglass · on May 19, 2013

Nor in python:

print re.match(r'\d','੧')

None

wulczer · on May 19, 2013

it does when using the re.U flag

  re.match(r'\d', u'੧', re.U)
  <_sre.SRE_Match at 0x3070ac0>

  sys.version
  2.7.3 (default, Mar  4 2013, 14:57:34) \n[GCC 4.7.2]

tcas · on May 20, 2013

Also, when using Python 3.2 it seems to be the default behavior

  Python 3.2.3 (default, Oct 19 2012, 20:10:41) 
  [GCC 4.6.3] on linux2
  Type "help", "copyright", "credits" or "license" for more information.
  >>> import re
  >>> re.match(r'\d', '੧')
  <_sre.SRE_Match object at 0x7f188f6d4850>

Falling3 · on May 20, 2013

Yes, but not by default.

jrabone · on May 20, 2013

Not true for Java. Docs even say:

  \d         A digit: [0-9]
  \p{Digit}  A decimal digit: [0-9]

which is actually somewhat depressing. I'd expect the named class to include the full Unicode digit set. It's surprising to see:

  ab1234567890cd matched 1234567890
  ab𝟣𝟤𝟥𝟦𝟧𝟨𝟩𝟪𝟫𝟢cd no match

from code using Pattern.compile("(\\p{Digit}+)");

EDIT: and perhaps more surprising to see in the logs:

  Exception in thread "main" java.lang.NumberFormatException: For input string: "𝟤𝟥𝟦𝟧"
  	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
  	at java.lang.Integer.parseInt(Integer.java:449)

That'll keep someone guessing for a while...

nspragmatic · on May 20, 2013

It happens in Objective-C:

    NSString *pattern = @"\\d", *string = @"੧";
    NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:pattern
                                                                           options:NSRegularExpressionCaseInsensitive
                                                                             error:nil];

    NSUInteger numMatches = [regex numberOfMatchesInString:string
                                                   options:0
                                                     range:NSMakeRange(0, [string length])];

    numMatches ? NSLog(@"%@ found by %@", string, pattern) : NSLog(@"%@ not found", string);

    // 2013-05-20 09:38:42.650 Regexperiment[17848:c07] ੧ found by \d

LawnGnome · on May 19, 2013

Happens in PHP only if you enable Unicode regex handling via the /u modifier and are running libpcre 8.10 or later (which corresponds to PHP 5.3.4 and later, assuming you're using the bundled libpcre): http://3v4l.org/QD3k0

bodyfour · on May 19, 2013

If you're using pcre directly from C code, this is controlled by specifying the PCRE_UCP flag to pcre_compile(). By default, \d and friends only match ASCII characters even if the PCRE_UTF8 flag is set.

xudongz · on May 19, 2013

Not true for Go

http://play.golang.org/p/ls96RxJxpz

nknighthb · on May 19, 2013

I would be reluctant to rely on this until the Go documentation is clearer on the intended behavior. Right now it's very poorly specified. The regex doc[1] talks about "same general syntax" as Perl, but points to [2], which doesn't seem to understand what it's saying, describing '\d' in terms of its "Perl" meaning, but then saying that it's [0-9].

[1] http://golang.org/pkg/regexp/

[2] https://code.google.com/p/re2/wiki/Syntax

laumars · on May 19, 2013

As a Perl developer that's been making the switch to Go, I've been caught out a few times with Go's no-so-Perl-like regular expression syntax. In fact I wish I knew about your 2nd link before now, because that could have saved me a few hours over recent months.

jamesmiller5 · on May 20, 2013

Considering go's vocal support for UTF8 I'm surprised at this behavior and curious to the reason for excluding it.

masklinn · on May 20, 2013

Supporting UTF8 and correctly handling unicode are very, very different beasts. The former is absolutely trivial, the latter is extremely difficult.

Go is vocal about the former, but seems to not give a shit about the latter.

snogglethorpe · on May 20, 2013

I'm not particularly fond of go, but "correctly handling unicode" can be subjective and case-dependent... I think making only minimal guarantees and punting to the application is often the only sane course.

masklinn · on May 20, 2013

> "correctly handling unicode" can be subjective and case-dependent...

So is correctly handling integers.

> I think making only minimal guarantees and punting to the application is often the only sane course.

That is completely and utterly crazy, the average developer has neither the knowledge nor the resources to make anything but a mess out of it without proper tools and APIs. Even with these (including a complete implementation of the unicode standard and its technical reports) unicode is already complex enough to deal with.

snogglethorpe · on May 20, 2013

Of course it's not "completely and utterly crazy."

Not every app needs to deal with the enormous complexities implied by "full unicode support", and given the huge cost of that, there's a real place for a minimalist approach. If all I do with unicode is input strings from the user, store them in a database, and then later spit them out, I don't need to be able to do Turkish case-conversion, and I may not want to pay the cost of making it possible.

Certainly tools and APIs help for those cases where an app needs to do the sort of complicated text-processing that warrants "full" unicode support, but it's not at all clear that the proper place for such support is in the base language libraries. It's quite reasonable for the language implementors to say "if you want to do X, we'll support that, but if you want to do Y and Z, please use external library L."

masklinn · on May 20, 2013

> Not every app needs to deal with the enormous complexities implied by "full unicode support", and given the huge cost of that, there's a real place for a minimalist approach.

Not sure what point you're trying to make, I never said all applications had to make full use of all possible Unicode APIs, I said the language must expose them. Because if it doesn't, those who should use them will never become aware of them let alone use them.

> If all I do with unicode is input strings from the user, store them in a database, and then later spit them out, I don't need to be able to do Turkish case-conversion, and I may not want to pay the cost of making it possible.

So?

> It's quite reasonable for the language implementors to say "if you want to add numbers, we'll support that, but if you want to subtract or divide them, please use external library L."

Really?

Then again, considering Go's embedded contempt for non-US locales (see: datetime patterns) I'm not even sure why we're having this discussion, and since it's obvious they don't care for a non-US world it make sense that they wouldn't care for processing text.

And at the end of the day, you agree that Go has no provision for unicode handling, you just think it's all fine and dandy.

cwmma · on May 20, 2013

All the same speed in JavaScrip http://jsperf.com/regexcwm/2

jeltz · on May 19, 2013

Happens in Perl but not ruby or PostgreSQL.

pfedor · on May 20, 2013

Doesn't happen in Perl for me:

  pfedor@Pawels-iMac:~$ perl -ne 'print "Digit!\n" if /\d/'
  af
  3
  Digit!
  23fa3
  Digit!
  asdf
  ١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹
  ৩৪৫৬৭৮৯੦੧੨੩੪੫੬੭੮੯૦૧૨૩૪૫
  ୧୨୩୪୫୬୭୮
  ౨౩౪౫౬౭౮౯೦೧೨೩೪೫೬೭೮೯൦൧൨൩൪൫൬൭൮൯๐๑๒๓๔๕๖๗๘๙໐໑໒໓
  234
  Digit!

(perl from Macports and perl from /usr/bin/perl behave the same in this respect.)

xonea · on May 20, 2013

You have to tell to interpret stdin as UTF-8 (flag -C) - then it works: https://news.ycombinator.com/item?id=5734641

pfedor · on May 21, 2013

Good to know, thanks.

I'd argue that perl gets it right--as the default behavior, this behavior would gravely violate the principle of least surprise, but for the 0.01% of people who want \d to match ੧, there's no harm to making it available as an option you need to specifically request.

Falling3 · on May 20, 2013

Exactly what I was thinking.

Doesn't in Ruby:

/\d/.match "੧" #=> nil

Argorak · on May 20, 2013

Just for reference:

  /\p{Digit}/.match "੧" => #<MatchData "੧">

hkmurakami · on May 19, 2013

oh wow I had no idea that "full width digits" can actually be handled properly. (U+FF10 ~ U+FF19)

coldtea · on May 19, 2013

Or improperly. If you expect \d to be a shorthand for 0-9, your string can also contain junk.

joosters · on May 19, 2013

The quoted benchmarks all complete in fractions of a second. Not a good sign. They may be reliable results, performed accurately, but why risk it?

IMO you should be running something for much longer, to protect against random short spurious events. e.g. a task reschedule, interrupts, etc could add significant variances. It wouldn't hurt to add a few more zeros to the loop and wait a minute for the results.

stephencanon · on May 20, 2013

Those events are all on the order of micro-seconds, far shorter than the benchmark duration. A tenth of a second is an eternity on a modern CPU.

scott_s · on May 20, 2013

One of those events, yes. But it's possible for the system to be experiencing a bursty workload unrelated to your benchmark, and many of those events may happen. There's also the problems of startup effects, both at the high level (the VM, which in this case is .Net), the medium level (major and minor page faults) and the low level (caches).

My rule of thumb is that benchmarks which are supposed to be bound by the processor and memory should last at least 60 seconds.

stephencanon · on May 20, 2013

VM startup effects I can get behind as a confound; page faults and cache effects are below millisecond level (filling the beefiest Sandybridge Xeon L3 cache you can buy from a completely cold state is on the order of 1 millisecond, and a micro benchmark like this doesn’t come close to using that much data).

I would also note that one is sometimes in the position of needing to measure performance of a compute-intensive task that is latency-critical but will not be running constantly; in such a scenario, using long-running benchmarks can be misleading because the processor will become thermally constrained and drop in and out of lower voltage/frequency bands, further confounding measurements.

I agree with you that a tenth of a second is on the shorter side of what I would like to see in such a benchmark, but I don’t think the situation is as dire as your first post suggested; unless the system is exceptionally noisy, the measurements seem to be valid, despite the relatively short duration. 60 seconds is overkill for a simple task like this.

scott_s · on May 20, 2013

Again, it's repeated page faults and cache effects.

When you run experiments, you want to draw conclusions. To have confidence in your conclusions, you want to eliminate as many variables as possible. In my work, I set the time of the benchmark high enough that I am confident that it is very unlikely for these effects to have a significant influence on the results. When you're drawing conclusions and publishing the results that will be scrutinized by peers, "overkill" is the way to go.

Also note that I was not the first poster on this subject.

stephencanon · on May 20, 2013

You cannot eliminate confounds by simple over-measurement. “Overkill” provides false confidence.

The only way to eliminate confounds is to understand them, and either control for them or bound them to an acceptable error tolerance. For a simple benchmark such as this, cache misses and page faults reach steady state within the first millisecond of operation; the error they contribute to the measurement of a .1s benchmark (even in aggregate) is no more than 1% — almost surely acceptable.

I have no experience with .Net, and would not care to make any estimates on the contribution of VM startup time, but the experiment in question does not include the VM startup in the measurement.

If a system were so noisy as to have interrupt storms on the order of .1s, then I would not be comfortable with timings that run for 60s either. I would much rather have statistics on 100 measurements of .1s each, which would make clear the impact of such anomalies (while still being faster to gather). There are many events that can make such measurements slower, but almost none that can make them faster; the distribution of the measurements is typically well-modeled by a Poisson distribution with bias. If one is actually trying to eliminate the effect of those events from the measurement, taking the minimum over many short samples is actually much closer to the truth than averaging over one long sample. If instead one is trying to include the effect of such events, then a different statistic would be in order.

Erwin · on May 20, 2013

Python's methods on unicode strings also apply this logic. E.g.:

     >>>  u'١٣٦٨'.isdigit()
     True

    >>> int(u'١٣٦٨')
    1368

I suppose this could be potentially abused if you are storing and displayeing what is supposed to used as a number as unicode text, but later convert it to a number. E.g. an online shop where you are asked whether you want to pay '5꯸' for some item which looks like 5 plus some weird square, but is really int(u'5꯸') => 58 -- http://www.fileformat.info/info/unicode/char/abf8/index.htm

foobar__ · on May 20, 2013

The fact that character ranges like [a-z] can depend on the value of LC_COLLATE is also something not many people are aware of.

  $ echo "ä" | LC_COLLATE=C grep '[a-z]'
  $ echo "ä" | LC_COLLATE=en_US.UTF-8 grep '[a-z]'
  ä

For common values of LC_COLLATE, the range [a-z] does not exclude accented characters and umlauts.

rwmj · on May 19, 2013

I was a bit surprised that Perl does not seem to be matching Unicode digits. Anyone know why?

    $ echo '0' | perl -pe 'print "yes: " if m/\d/'
    yes: 0
    $ echo '੧' | perl -pe 'print "yes: " if m/\d/'
    ੧

xonea · on May 19, 2013

You have to tell perl to expect utf8 from stdin (switch -C).

  $ echo '੧' | perl -C -pe 'print "yes: " if m/\d/'
  and
  $ perl -e 'use utf8; print "yes\n" if "੧" =~ m/\d/;'

both work :)

pooriaazimi · on May 19, 2013

`man perlunicode` is chockfull of utf8-related stuff (and it's looong): http://perldoc.perl.org/5.14.0/perlunicode.html

damncabbage · on May 19, 2013

Ditto PHP:

  php > var_export(preg_match("/\d/", "1"));
  1
  php > var_export(preg_match("/\d/", "۳"));
  0

jpiasetz · on May 20, 2013

Add /u

    php > var_export(preg_match("/\d/u", "۳"));

bellbind · on May 19, 2013

It does. See http://ideone.com/Q1lf1M

ars · on May 19, 2013

Try:

    utf8::upgrade($string)

And/or:

    use feature 'unicode_strings'

netfeed · on May 20, 2013

The documentation says \d should match if you use /u on the regex

_3u10 · on May 19, 2013

The test code creates a new regex every time, would be interesting to see how it works with a compiled and reused regex.

joosters · on May 19, 2013

It compiles it once and then matches it against 10000 strings.

jnotarstefano · on May 20, 2013

There seems to be a tiny bit of difference in Ruby too. This code:

    require 'benchmark'

    def random_string(length)
      result = (1..length).map { (65+rand(26)).chr }.join
      result[rand(length)] = rand(10).to_s if rand > 0.5
      result
    end

    Benchmark.bmbm do |b|
      b.report("\\d") do 
        (1..1000).count { random_string(1000).match(/\d/) }     
      end

      b.report("[0-9]") do 
        (1..1000).count { random_string(1000).match(/[0-9]/) }
      end

      b.report("[0123456789]") do 
        (1..1000).count { random_string(1000).match(/[0123456789]/) }
      end
    end

gives:

    ~/Code/ruby% ruby regex.rb
    Rehearsal ------------------------------------------------
    \d             0.690000   0.000000   0.690000 (  0.712500)
    [0-9]          0.690000   0.000000   0.690000 (  0.703990)
    [0123456789]   0.680000   0.010000   0.690000 (  0.705759)
    --------------------------------------- total: 2.070000sec
    
                       user     system      total        real
    \d             0.710000   0.000000   0.710000 (  0.791722)
    [0-9]          0.700000   0.000000   0.700000 (  0.708210)
    [0123456789]   0.690000   0.010000   0.700000 (  0.713355)

justanotherbody · on May 20, 2013

Noted here that modifying \d to only include [0-9] yields \d more efficient http://stackoverflow.com/a/16622773/1943429

ams6110 · on May 20, 2013

I tend to use ranges (e.g. [0-9]) as they seem to me to be more standard than the token for "any digit" (often \d, but in elisp (Emacs) it's [:digit:])

belper · on May 20, 2013

Interesting to see these missing, which are 1 and 1, respectively: [一, 壹]

anonymous · on May 20, 2013

I think that's because they are treated as words, rather than digits. The same way that \d won't match "einz".

hdragomir · on May 19, 2013

in C#

hdragomir · on May 19, 2013

Here are some results in Javascript: http://jsperf.com/digit-regex

jeltz · on May 19, 2013

It does not match it in ruby or PostgreSQL (which uses a modified version of the tcl regexp implementation).

conchulio · on May 20, 2013

Maybe the order in which the Regexes are evaluated is also important due to caching etc. Has anyone tested if results are different when changing the order?

dbbolton · on May 20, 2013

What is the need for those "mathematical monospace/bold/sans" characters? Should that be a font issue?

claudius · on May 20, 2013

Fonts are about different representations of the same symbols. Monospace/bold/sans are different symbols in mathematics.