Fixing a Twenty Six Years Old Emacs Bug
They got it backward those days…
A couple of weeks ago, an interesting bug report landed on the Emacs bug tracker:
Subject: bug#68762: 30.0.50; thing-at-point for an e-mail adress stops at “.”
When I use (thing-at-point ’email) and point is on an e-mail address like [email protected], thing-at-point only gets [email protected] and loses the first part before the “.” character.
This happens after I upgraded from 29.1 to emacs 30.0.50.
The reason this is a great bug report is that it’s easy to reproduce
right there in the buffer showing the bug report itself. Namely,
since I was reading this inside Emacs, this was as simple as placing
the cursor right before where it says (thing-at-point 'email)
,
hitting C-M-k
to “kill” (copy) the whole expression, then C-s name
to place the cursor in the middle of [email protected]
, and lastly
M-: C-y RET
to evaluate the expression. That’s it. You see the
return value [email protected]
, instead of the expected
[email protected]
, and that makes the issue apparent on the spot.
The thing-at-point
function does just what it says on the tin: you
specify a “thing” you’re looking for, and the function searches for
that thing around your cursor (point). This function has different
methods for finding different “things”. For email
, it uses regular
expression matching. In particular, it uses this regular expression:
(defvar thing-at-point-email-regexp "<?[-+_~a-zA-Z0-9/][-+_.~:a-zA-Z0-9/]*@[-a-zA-Z0-9]+[-.a-zA-Z0-9]*>?" "A regular expression probably matching an email address. This does not match the real name portion, only the address, optionally with angle brackets.")
This is slightly daunting at first look, but actually the regular
expression itself is fine. Well, in general, trying to match email
addresses with a regular expression is asking for troubles, really.
This regular expression ain’t perfectly correct either, in the sense
that there are valid email addresses that it fails to match, but it
does match the example [email protected]
from the bug report.
So what gives?
This regular expression did change since Emacs 29.1, as we can quickly
see by selecting the above variable definition with C-M-SPC
and
hitting C-x v h
:
Date: Sun Dec 3 13:49:07 2023 +0100 Add slashes to 'thing-at-point-email-regexp' * lisp/thingatpt.el (thing-at-point-email-regexp): Allow for a (thing-at-point 'email) query to match addresses with slashes, as used by Sourcehut. (Bug#67600) diff --git a/lisp/thingatpt.el b/lisp/thingatpt.el --- a/lisp/thingatpt.el +++ b/lisp/thingatpt.el @@ -654,5 +654,5 @@ (defvar thing-at-point-email-regexp - "<?[-+_~a-zA-Z0-9][-+_.~:a-zA-Z0-9]*@[-a-zA-Z0-9]+[-.a-zA-Z0-9]*>?" + "<?[-+_~a-zA-Z0-9/][-+_.~:a-zA-Z0-9/]*@[-a-zA-Z0-9]+[-.a-zA-Z0-9]*>?" "A regular expression probably matching an email address. This does not match the real name portion, only the address, optionally with angle brackets.")
Date: Wed Feb 15 12:16:11 2023 +0100 Improve thing-at-point email detection * lisp/thingatpt.el (thing-at-point-email-regexp): Allow numbers at the start of the user portion, and disallow '.' at the start. Also disallow '.' at the start of the domain portion. ... diff --git a/lisp/thingatpt.el b/lisp/thingatpt.el --- a/lisp/thingatpt.el +++ b/lisp/thingatpt.el @@ -647,5 +647,5 @@ (defvar thing-at-point-email-regexp - "<?[-+_.~a-zA-Z][-+_.~:a-zA-Z0-9]*@[-.a-zA-Z0-9]+>?" + "<?[-+_~a-zA-Z0-9][-+_.~:a-zA-Z0-9]*@[-a-zA-Z0-9]+[-.a-zA-Z0-9]*>?" "A regular expression probably matching an email address. This does not match the real name portion, only the address, optionally with angle brackets.")
Neither of these changes should have directly caused
thing-at-point-email-regexp
to stop matching [email protected]
,
though. The problem is with the specific way in which
thing-at-point
uses this regular expression, and others. Usually,
one applies a regular expression to strings starting from their
beginning, and indeed, the following form evaluates to 0
, which
shows that the example email address matches the regular expression:
(string-match thing-at-point-email-regexp "[email protected]") => 0
However, thing-at-point
doesn’t know in advance where the “thing”
starts. It needs to look for a match somewhere around point. To do
that, it uses a helper function called thing-at-point-looking-at
.
This much I knew more or less before stumbling upon this bug report.
So I decided to take a look at thing-at-point-looking-at
, and see
how it performs the fabled around-point regexp matching. And, what do
you know, there it was! Among the trees of parentheses, a twenty six
years old bug. Check it out:
(defun thing-at-point-looking-at (regexp &optional distance) "Return non-nil if point is in or just after a match for REGEXP. Set the match data from the earliest such match ending at or after point. Optional argument DISTANCE limits search for REGEXP forward and back from point." (save-excursion (let ((old-point (point)) (forward-bound (and distance (+ (point) distance))) (backward-bound (and distance (- (point) distance))) match prev-pos new-pos) (and (looking-at regexp) (>= (match-end 0) old-point) (setq match (point))) ;; Search back repeatedly from end of next match. ;; This may fail if next match ends before this match does. (re-search-forward regexp forward-bound 'limit) (setq prev-pos (point)) (while (and (setq new-pos (re-search-backward regexp backward-bound t)) ;; Avoid inflooping with some regexps, such as "^", ;; matching which never moves point. (< new-pos prev-pos) (or (> (match-beginning 0) old-point) (and (looking-at regexp) ; Extend match-end past search start (>= (match-end 0) old-point) (setq match (point)))))) (if (not match) nil (goto-char match) ;; Back up a char at a time in case search skipped ;; intermediate match straddling search start pos. (while (and (not (bobp)) (progn (backward-char 1) (looking-at regexp)) (>= (match-end 0) old-point) (setq match (point)))) (goto-char match) (looking-at regexp)))))
There are some details to unravel here, but before diving in I first
did the C-x v h
trick once more, to see if anything obvious broke
since Emacs 29.1. Alas, thing-at-point-looking-at
nearly haven’t
changed since its introduction in the summer of 1997:
commit d9cc804bf8f440fa73f49abe7977518554126601 Author: Richard M. Stallman <[email protected]> Date: Fri Jul 4 19:59:49 1997 +0000 ... (thing-at-point-looking-at): New function, adapted from old browse-url-looking-at. ... diff --git a/lisp/thingatpt.el b/lisp/thingatpt.el --- a/lisp/thingatpt.el +++ b/lisp/thingatpt.el @@ -200,0 +273,27 @@ +(defun thing-at-point-looking-at (regexp) + "Return non-nil if point is in or just after a match for REGEXP. +Set the match data from the earliest such match ending at or after +point." + (save-excursion + (let ((old-point (point)) match) + (and (looking-at regexp) + (>= (match-end 0) old-point) + (setq match (point))) + ;; Search back repeatedly from end of next match. + ;; This may fail if next match ends before this match does. + (re-search-forward regexp nil 'limit) + (while (and (re-search-backward regexp nil t) + (or (> (match-beginning 0) old-point) + (and (looking-at regexp) ; Extend match-end past search start + (>= (match-end 0) old-point) + (setq match (point)))))) + (if (not match) nil + (goto-char match) + ;; Back up a char at a time in case search skipped + ;; intermediate match straddling search start pos. + (while (and (not (bobp)) + (progn (backward-char 1) (looking-at regexp)) + (>= (match-end 0) old-point) + (setq match (point)))) + (goto-char match) + (looking-at regexp)))))
I found it funny to think about how while RMS wrote this code, I… Well, I certainly wasn’t doing anything useful back then. Probably eating paste or some such.
Indeed the bug was hiding there all these years, until finally the
change from last February—namely, the “disallow ‘.’ at the start”
part—brought it to light. It’s a simple matter of orientation: just
like the comments suggest, thing-at-point-looking-at
would search
forward from point, and then start searching backward. But since
Emacs doesn’t really have a backward-regexp-matching capability, its
re-search-backward
works by moving backward and searching
forward until it finds a match. That means that backward regexp
matching is inherently non-greedy, it returns the shortest match that
ends before point. thing-at-point-looking-at
ventures to return the
earliest match, as its docstring proclaims, so it employs an
additional adjustment after calling re-search-backward
, where it
moves back one character at the time and checks whether it still
matches from the new position. And that’s the crux of the issue.
Extending the match backward that way, character by character, fails
miserably when even one position between the starting point and the
beginning of the full match is not a match. This is what happens
when thing-at-point-looking-at
tried to extend the match backward
past the .
in my.name
, since the new value of
thing-at-point-email-regexp
in Emacs 30 (correctly) rejects
addresses that start with a .
.
To fix this issue, I took a stab at shaking up the decades old
thing-at-point-looking-at
. The idea seemed clear: backward regexp
searching is, well, kinda backward. If we just start from the
beginning and only search forward, we should find the earliest match
without breaking a sweat. This leads to the following implementation:
(defun thing-at-point-looking-at (regexp &optional distance) "Return non-nil if point is in or just after a match for REGEXP. Set the match data from the earliest such match ending at or after point. Optional argument DISTANCE limits search for REGEXP forward and back from point." (let* ((old (point)) (beg (if distance (max (point-min) (- old distance)) (point-min))) (end (if distance (min (point-max) (+ old distance)))) prev match) (save-excursion (goto-char beg) (while (and (setq prev (point) match (re-search-forward regexp end t)) (< (match-end 0) old)) (goto-char (match-beginning 0)) ;; Avoid inflooping when `regexp' matches the empty string. (unless (< prev (point)) (forward-char)))) (and match (<= (match-beginning 0) old (match-end 0)))))
It even looks simpler, right? Less room for bugs to lurk. So I’ve proposed this change as a solution for Bug#68762, and a few days ago it landed on the Emacs master branch.
I really hope this isn’t too simplistic. All existing tests, and
some new ones, still pass, but the tests don’t cover everything, and a
lot of code could have come to rely on this function during the years.
If you’re running Emacs 30 and see anything strange that you think
might be related, please report it via M-x report-emacs-bug
. And if
you ever need to write some code to match email addresses, my tip for
you is to stay away from regular expressions, and to use a proper
parser instead.