Fixing a Twenty Six Years Old Emacs Bug

They got it backward those days…

Created on [2024-02-10], last updated [2024-02-11]

A couple of weeks ago, an interesting bug report landed on the Emacs bug tracker:

Subject: bug#68762: 30.0.50; thing-at-point for an e-mail adress stops at “.”

When I use (thing-at-point ’email) and point is on an e-mail address like [email protected], thing-at-point only gets [email protected] and loses the first part before the “.” character.

This happens after I upgraded from 29.1 to emacs 30.0.50.

The reason this is a great bug report is that it’s easy to reproduce right there in the buffer showing the bug report itself. Namely, since I was reading this inside Emacs, this was as simple as placing the cursor right before where it says (thing-at-point 'email), hitting C-M-k to “kill” (copy) the whole expression, then C-s name to place the cursor in the middle of [email protected], and lastly M-: C-y RET to evaluate the expression. That’s it. You see the return value [email protected], instead of the expected [email protected], and that makes the issue apparent on the spot.

The thing-at-point function does just what it says on the tin: you specify a “thing” you’re looking for, and the function searches for that thing around your cursor (point). This function has different methods for finding different “things”. For email, it uses regular expression matching. In particular, it uses this regular expression:

(defvar thing-at-point-email-regexp
  "<?[-+_~a-zA-Z0-9/][-+_.~:a-zA-Z0-9/]*@[-a-zA-Z0-9]+[-.a-zA-Z0-9]*>?"
  "A regular expression probably matching an email address.
This does not match the real name portion, only the address, optionally
with angle brackets.")

This is slightly daunting at first look, but actually the regular expression itself is fine. Well, in general, trying to match email addresses with a regular expression is asking for troubles, really. This regular expression ain’t perfectly correct either, in the sense that there are valid email addresses that it fails to match, but it does match the example [email protected] from the bug report. So what gives?

This regular expression did change since Emacs 29.1, as we can quickly see by selecting the above variable definition with C-M-SPC and hitting C-x v h:

Date:   Sun Dec 3 13:49:07 2023 +0100

    Add slashes to 'thing-at-point-email-regexp'

    * lisp/thingatpt.el (thing-at-point-email-regexp): Allow for
    a (thing-at-point 'email) query to match addresses with slashes, as
    used by Sourcehut.  (Bug#67600)

diff --git a/lisp/thingatpt.el b/lisp/thingatpt.el
--- a/lisp/thingatpt.el
+++ b/lisp/thingatpt.el
@@ -654,5 +654,5 @@
 (defvar thing-at-point-email-regexp
-  "<?[-+_~a-zA-Z0-9][-+_.~:a-zA-Z0-9]*@[-a-zA-Z0-9]+[-.a-zA-Z0-9]*>?"
+  "<?[-+_~a-zA-Z0-9/][-+_.~:a-zA-Z0-9/]*@[-a-zA-Z0-9]+[-.a-zA-Z0-9]*>?"
   "A regular expression probably matching an email address.
 This does not match the real name portion, only the address, optionally
 with angle brackets.")
Date:   Wed Feb 15 12:16:11 2023 +0100

    Improve thing-at-point email detection

    * lisp/thingatpt.el (thing-at-point-email-regexp): Allow numbers at
    the start of the user portion, and disallow '.' at the start.  Also
    disallow '.' at the start of the domain portion.
    ...

diff --git a/lisp/thingatpt.el b/lisp/thingatpt.el
--- a/lisp/thingatpt.el
+++ b/lisp/thingatpt.el
@@ -647,5 +647,5 @@
 (defvar thing-at-point-email-regexp
-  "<?[-+_.~a-zA-Z][-+_.~:a-zA-Z0-9]*@[-.a-zA-Z0-9]+>?"
+  "<?[-+_~a-zA-Z0-9][-+_.~:a-zA-Z0-9]*@[-a-zA-Z0-9]+[-.a-zA-Z0-9]*>?"
   "A regular expression probably matching an email address.
 This does not match the real name portion, only the address, optionally
 with angle brackets.")

Neither of these changes should have directly caused thing-at-point-email-regexp to stop matching [email protected], though. The problem is with the specific way in which thing-at-point uses this regular expression, and others. Usually, one applies a regular expression to strings starting from their beginning, and indeed, the following form evaluates to 0, which shows that the example email address matches the regular expression:

(string-match thing-at-point-email-regexp "[email protected]")
=> 0

However, thing-at-point doesn’t know in advance where the “thing” starts. It needs to look for a match somewhere around point. To do that, it uses a helper function called thing-at-point-looking-at. This much I knew more or less before stumbling upon this bug report. So I decided to take a look at thing-at-point-looking-at, and see how it performs the fabled around-point regexp matching. And, what do you know, there it was! Among the trees of parentheses, a twenty six years old bug. Check it out:

(defun thing-at-point-looking-at (regexp &optional distance)
  "Return non-nil if point is in or just after a match for REGEXP.
Set the match data from the earliest such match ending at or after
point.

Optional argument DISTANCE limits search for REGEXP forward and
back from point."
  (save-excursion
    (let ((old-point (point))
          (forward-bound (and distance (+ (point) distance)))
          (backward-bound (and distance (- (point) distance)))
          match prev-pos new-pos)
      (and (looking-at regexp)
           (>= (match-end 0) old-point)
           (setq match (point)))
      ;; Search back repeatedly from end of next match.
      ;; This may fail if next match ends before this match does.
      (re-search-forward regexp forward-bound 'limit)
      (setq prev-pos (point))
      (while (and (setq new-pos (re-search-backward regexp backward-bound t))
                  ;; Avoid inflooping with some regexps, such as "^",
                  ;; matching which never moves point.
                  (< new-pos prev-pos)
                  (or (> (match-beginning 0) old-point)
                      (and (looking-at regexp)  ; Extend match-end past search start
                           (>= (match-end 0) old-point)
                           (setq match (point))))))
      (if (not match) nil
        (goto-char match)
        ;; Back up a char at a time in case search skipped
        ;; intermediate match straddling search start pos.
        (while (and (not (bobp))
                    (progn (backward-char 1) (looking-at regexp))
                    (>= (match-end 0) old-point)
                    (setq match (point))))
        (goto-char match)
        (looking-at regexp)))))

There are some details to unravel here, but before diving in I first did the C-x v h trick once more, to see if anything obvious broke since Emacs 29.1. Alas, thing-at-point-looking-at nearly haven’t changed since its introduction in the summer of 1997:

commit d9cc804bf8f440fa73f49abe7977518554126601
Author: Richard M. Stallman <[email protected]>
Date:   Fri Jul 4 19:59:49 1997 +0000

    ...
    (thing-at-point-looking-at): New function, adapted from old
    browse-url-looking-at.
    ...

diff --git a/lisp/thingatpt.el b/lisp/thingatpt.el
--- a/lisp/thingatpt.el
+++ b/lisp/thingatpt.el
@@ -200,0 +273,27 @@
+(defun thing-at-point-looking-at (regexp)
+  "Return non-nil if point is in or just after a match for REGEXP.
+Set the match data from the earliest such match ending at or after
+point."
+  (save-excursion
+    (let ((old-point (point)) match)
+      (and (looking-at regexp)
+          (>= (match-end 0) old-point)
+          (setq match (point)))
+      ;; Search back repeatedly from end of next match.
+      ;; This may fail if next match ends before this match does.
+      (re-search-forward regexp nil 'limit)
+      (while (and (re-search-backward regexp nil t)
+                (or (> (match-beginning 0) old-point)
+                    (and (looking-at regexp)  ; Extend match-end past search start
+                        (>= (match-end 0) old-point)
+                        (setq match (point))))))
+      (if (not match) nil
+       (goto-char match)
+       ;; Back up a char at a time in case search skipped
+       ;; intermediate match straddling search start pos.
+       (while (and (not (bobp))
+                  (progn (backward-char 1) (looking-at regexp))
+                  (>= (match-end 0) old-point)
+                  (setq match (point))))
+       (goto-char match)
+       (looking-at regexp)))))

I found it funny to think about how while RMS wrote this code, I… Well, I certainly wasn’t doing anything useful back then. Probably eating paste or some such.

Indeed the bug was hiding there all these years, until finally the change from last February—namely, the “disallow ‘.’ at the start” part—brought it to light. It’s a simple matter of orientation: just like the comments suggest, thing-at-point-looking-at would search forward from point, and then start searching backward. But since Emacs doesn’t really have a backward-regexp-matching capability, its re-search-backward works by moving backward and searching forward until it finds a match. That means that backward regexp matching is inherently non-greedy, it returns the shortest match that ends before point. thing-at-point-looking-at ventures to return the earliest match, as its docstring proclaims, so it employs an additional adjustment after calling re-search-backward, where it moves back one character at the time and checks whether it still matches from the new position. And that’s the crux of the issue. Extending the match backward that way, character by character, fails miserably when even one position between the starting point and the beginning of the full match is not a match. This is what happens when thing-at-point-looking-at tried to extend the match backward past the . in my.name, since the new value of thing-at-point-email-regexp in Emacs 30 (correctly) rejects addresses that start with a ..

To fix this issue, I took a stab at shaking up the decades old thing-at-point-looking-at. The idea seemed clear: backward regexp searching is, well, kinda backward. If we just start from the beginning and only search forward, we should find the earliest match without breaking a sweat. This leads to the following implementation:

(defun thing-at-point-looking-at (regexp &optional distance)
  "Return non-nil if point is in or just after a match for REGEXP.
Set the match data from the earliest such match ending at or after
point.

Optional argument DISTANCE limits search for REGEXP forward and
back from point."
  (let* ((old (point))
         (beg (if distance (max (point-min) (- old distance)) (point-min)))
         (end (if distance (min (point-max) (+ old distance))))
         prev match)
    (save-excursion
      (goto-char beg)
      (while (and (setq prev (point)
                        match (re-search-forward regexp end t))
                  (< (match-end 0) old))
        (goto-char (match-beginning 0))
        ;; Avoid inflooping when `regexp' matches the empty string.
        (unless (< prev (point)) (forward-char))))
    (and match (<= (match-beginning 0) old (match-end 0)))))

It even looks simpler, right? Less room for bugs to lurk. So I’ve proposed this change as a solution for Bug#68762, and a few days ago it landed on the Emacs master branch.

I really hope this isn’t too simplistic. All existing tests, and some new ones, still pass, but the tests don’t cover everything, and a lot of code could have come to rely on this function during the years. If you’re running Emacs 30 and see anything strange that you think might be related, please report it via M-x report-emacs-bug. And if you ever need to write some code to match email addresses, my tip for you is to stay away from regular expressions, and to use a proper parser instead.