Make WordPress Core

Opened 6 years ago

Last modified 5 weeks ago

#44958 assigned defect (bug)

&nbsp character in title generates a permalink (and slug) with space

Reported by: ace2_heart's profile ace2_heart Owned by:
Milestone: 6.7 Priority: normal
Severity: normal Version: 4.8
Component: Permalinks Keywords: needs-testing has-patch has-testing-info has-screenshots needs-unit-tests
Focuses: Cc:

Description

If a title has an &nbsp character copy pasted or if the permalink has a &nbsp copy pasted
It will generate a permalink and slug with an space breaking some elements on the site

I generated an empty wordpress install without any plugins to test it, and it has the same problem

Attachments (3)

5aeaf99417f517d3b1c9bb120a3713a8.gif (144.2 KB) - added by audrasjb 6 years ago.
Works for me
44958.diff (825 bytes) - added by Presskopp 13 months ago.
oopsi.gif (1.6 MB) - added by Presskopp 11 months ago.

Download all attachments as: .zip

Change History (42)

#1 follow-up: @audrasjb
6 years ago

  • Keywords reporter-feedback added

Hi @ace2_heart thanks for the ticket and welcome to WordPress Trac,

I can't reproduce the issue on a fresh install (see screen capture above). I tested it in the permalink edition and in the title field and both are correctly escaping the non-breaking space. Can you provide us some further information to reproduce the bug please?

Cheers,

Jb

#2 @SergeyBiryukov
6 years ago

Hi @ace2_heart, thanks for the ticket!

  is supposed to be converted to a hyphen by sanitize_title_with_dashes().

Are you sure the issue still happens on a clean install with all plugins disabled and a default theme (Twenty Seventeen) activated?

Last edited 6 years ago by SergeyBiryukov (previous) (diff)

#3 @ace2_heart
6 years ago

Hi, I created an empty install here, you can access it

  • user: miguelesquirol
  • password: miguelesquirol

https://bug-testing-miguelesquirol.c9users.io/wp-ad- min

These are the pages created

The parmalink are as following:

/e -securite/
/ -test/
Last edited 6 years ago by ace2_heart (previous) (diff)

#4 in reply to: ↑ 1 @ace2_heart
6 years ago

Replying to audrasjb:

I just saw the demo you posted, instead of using the   text on the title, tried to copy/paste the actual space character.

e -securite

Hi @ace2_heart thanks for the ticket and welcome to WordPress Trac,

I can't reproduce the issue on a fresh install (see screen capture above). I tested it in the permalink edition and in the title field and both are correctly escaping the non-breaking space. Can you provide us some further information to reproduce the bug please?

Cheers,

Jb

#5 follow-up: @audrasjb
6 years ago

Ok thanks for your answer.

So the issue happens with control marks like %e2%80%. Indeed, I'm able to reproduce this specific issue.

By the way, you are running WP 4.8.7 which is quite an old version. But the issue also happens in 4.9.8.

Thanks,

Jb

#6 in reply to: ↑ 5 @ace2_heart
6 years ago

So it's a really a bug? It's my first time submitting one.

Replying to audrasjb:

Ok thanks for your answer.

So the issue happens with control marks like %e2%80%. Indeed, I'm able to reproduce this specific issue.

By the way, you are running WP 4.8.7 which is quite an old version. But the issue also happens in 4.9.8.

Thanks,

Jb

Last edited 6 years ago by ace2_heart (previous) (diff)

#7 @audrasjb
6 years ago

I'd say it's more like an edge case, but yep, thank you for your ticket :)

#8 @tsewlliw
6 years ago

I'm hoping to submit a patch for this ticket in the next couple of days, I have seen it pop up in other places with other characters with a clear case for being sanitized out of URIs, such as the unicode line separator.

#11 @Presskopp
13 months ago

  • Keywords has-patch added

Here's a patch to get around any %e2%80 constructions like:

%e2%80
%e2%80test
%e2%80%
%e2%80%93
%e2%80%94
%e2%80%943
%e2%80%3
%e2%80%94slug
my%e2%80%94slug
my%e28094slug
my%e2%80%A8slug
...

While I think the regex itself is quite robust, this needs testing to fit well into the sanitize_title_with_dashes function, but I hope it helps on the way

@Presskopp
13 months ago

#12 @audrasjb
13 months ago

  • Keywords has-testing-info added; reporter-feedback removed
  • Milestone changed from Awaiting Review to 6.4

Thanks for the patch! Moving for 6.4 consideration.

This ticket was mentioned in Slack in #core-test by oglekler. View the logs.


11 months ago

#14 @zunaid321
11 months ago

Reproduction Report

This report validates that the issue cannot be reproduced.

Environment

  • OS: Windows 11 (22H2)
  • Web Server: nginx/1.25.1
  • PHP: 7.4.33
  • WordPress: 6.4-alpha-56267-src
  • Browser: Chrome Version 113.0.5672.126 (Official Build) (64-bit)
  • Theme: Twenty Twenty-Three

Actual Results

  • ❌ Was not able to reproduce the issue. Even after modifying the title with a space, the slug seems to be okay.

Supplemental Artifacts

Image: https://prnt.sc/VZnSqmEl8IHA

#15 @iammehedi1
11 months ago

Test Report

This report validates that the indicated issue can not be reproduced

Environment

  • OS: Windows 11 Pro (22H2)
  • Web Server: nginx/1.25.1
  • PHP: 7.4.33
  • WordPress: 6.4-alpha-56267-src
  • Browser: Chrome Version 116.0.5845.98 (Official Build) (64-bit)
  • Theme: Twenty Twenty-Three

Actual Results

  • ❌ Was not able to reproduce the issue. Tried to edit the permalinks as guided, found nothing.

Supplemental Artifacts

https://i.ibb.co/cCCCk6d/image.png

#16 follow-up: @Presskopp
11 months ago

it can be reproduced, but not with a space alone, see comment No 5

Last edited 11 months ago by Presskopp (previous) (diff)

#17 in reply to: ↑ 16 @zunaid321
11 months ago

I did check that comment but didn't quite understand how to test it out. Can you please clarify how you have tested it?

Replying to Presskopp:

it can be reproduced, but not with a space alone, see comment No 5

@Presskopp
11 months ago

#18 @Presskopp
11 months ago

  • Keywords has-screenshots added

This ticket was mentioned in Slack in #core by zunaid321. View the logs.


11 months ago

#20 @huzaifaalmesbah
11 months ago

Test Report

I Test 44958.diff

Environment

OS: macOS m1
WordPress 6.4-alpha-56267-src
PHP 7.4.33
nginx/1.25.2
MySQL 5.7.43
Browser: Chrome 116.0.5845.140
Theme: Twenty Twenty v2.3
Active Plugins: No plugins activated.

Before Applying Patch

https://i.ibb.co/98stRpF/Screenshot-2023-09-16-at-12-26-45-AM.png

After Applying Patch

https://i.ibb.co/Ks2Ckxr/Screenshot-2023-09-16-at-12-27-11-AM.png

#21 @oglekler
10 months ago

I checked all cases that are mentioned in the comment#11 and the patch is working as expected, but I cannot evaluate that these are all possible cases. This patch is preventing invisible symbols to be shown in the URL, and the most accurate way to test it would have been to copy-paste text from somewhere with this invisible symbols inside while having the whole list of these symbols.

This ticket was mentioned in Slack in #core by nicolefurlan. View the logs.


10 months ago

#24 @dmsnell
10 months ago

the regex is replacing %e280 and then even more sequences with an optional percent sign but none of those are valid URL-encodings. do we have any examples of any system producing those outputs? At a minimum that should be interpreted/rendered as �80 but it would be valid to simply ignore decoding / leave it as %e280.

given that %e2 by itself is invalid UTF-8/invalid URL encoding, it's a bit easier to discuss here, but I'd be worried about consuming too much input because it looks like it shouldn't be there and then further corrupting data that's otherwise fine. in other words, if we don't have any examples of systems producing output like %e280 then it might be a good call for us to limit unintentional breakage by only replacing %e2%80 and successive sequences that all carry the percent-sign.

#25 @nicolefurlan
10 months ago

  • Milestone changed from 6.4 to 6.5

This ticket was discussed in today's bug scrub.

We're going to bump this to 6.5.

Per #comment:21, it would be great if we could also find more test cases so that we can be more confident in this fix.

Props to @hellofromTonya :)

#26 @dmsnell
10 months ago

After some additional testing I was unable to reproduce through the block editor, but when attempting to paste some of these characters into the title field, another bug appeared, which I'll file in the Gutenberg repository (the contents of the title duplicate).

With the Classic Editor plugin I was able to reproduce, and I assume that API calls and manual calls from within PHP would have similarly exhibited the broken behavior.

In deeper review of this patch I do think it's worth taking out the "also %e280 to be on the safe side," though on a second thought it's not unrealistic to keep it in there, but for the different reason that "%e280" only appears when something else is already broken. This is more about the question of whether we want to make a bad situation potentially worse or leave it only as bad as it is; I generally find that leaving broken things the way they are leads to less confusion and corruption.

---

If we want to explore the broader issue for 6.5 I will volunteer my time to look into it, as the function is already fairly ad-hoc and this particular ticket seems part of the broader problem that we're hitting specific cases instead of taking a systematic or formal approach in handling the slug creation.

I've started exploring a PCRE-based approach in https://github.com/WordPress/wordpress-develop/pull/5466 that relies on Unicode character classes. If that approach works, it could simplify the function while boosting its reliability and maintainability. There are thankfully a fair number of unit tests and there's a fair bit of work to do there, but it seems tractable that we could collect a lot of currently-scattered logic and cover many bases in one go.

This ticket was mentioned in Slack in #core by chaion07. View the logs.


6 months ago

#28 @swissspidy
6 months ago

  • Milestone changed from 6.5 to 6.6
  • Severity changed from major to normal

This ticket was mentioned in Slack in #core by oglekler. View the logs.


7 weeks ago

#30 @oglekler
7 weeks ago

Possibly Unit tests needs to be checked if they are covering all possible cases and then last patch can be tested as the way forward.

#31 @hmbashar
7 weeks ago

Test Report

Patch tested 44958.diff

Environment

  • WordPress: 6.6-alpha-57778-src
  • PHP: 8.3.7
  • Server: nginx/1.25.4
  • Database: mysqli (Server: 8.3.0 / Client: mysqlnd 8.3.7)
  • Browser: Chrome 125.0.0.0
  • OS: macOS
  • Theme: Twenty Fifteen 3.7
  • MU Plugins: None activated
  • Plugins:
    • Classic Editor 1.6.3
    • FakerPress 0.6.6
    • Test Reports 1.1.0

Actual Results

  1. ✅ Issue resolved with patch.

Before Patch

https://i.ibb.co/rcWz2Fp/Before-Patch.png

After Patch

https://i.ibb.co/DW5hBkD/After-Patch.png

This ticket was mentioned in PR #5466 on WordPress/wordpress-develop by @dmsnell.


7 weeks ago
#32

Trac ticket: Core-44958

From time to time new issues arise with the sanitization of a post title for slug creation. The existing algorithm builds takes a hit-and-miss approach of handling specific cases of known string elements that cause problems and replacing them with normalized characters.

In this patch a given title is first converted into a normalized form and then processed with a Unicode-aware PCRE pattern which formalizes the _kinds_ of replacements which are supposed to occur. For example, instead of removing "%c2%ab" the code now removes "invisible characters" as defined by Unicode itself.

This update, if it works without breaking existing dependencies, poses a more comprehensive solution to the problem of slug generation, one that updates with advancements to the Unicode specification provided by system libraries and PHP itself instead of through custom WordPress code.

Trac ticket:

This ticket was mentioned in Slack in #core by nhrrob. View the logs.


6 weeks ago

#34 @tremidkhar
6 weeks ago

We need to have a unit test of all the possible use cases as Olga suggested.

#35 @nhrrob
6 weeks ago

  • Keywords needs-unit-tests added

#36 @dmsnell
5 weeks ago

Today I took some time to expand the work in #5466. The new slugify() function is the important part to examine. Here are some thoughts:

  • converting all forms of allowable input escaping into Unicode code points does indeed collapse much of the logic and eliminate by default cases where something could escape the checks, for example, because it's a numeric character reference.
  • this is another potentially great use case for the WP_Token_Map for rewriting given characters in the title.
  • slugify() runs in one single pass through the input string and its main loop represents all of the core logic at each decoded cahracter. to my head this is clearer than jumping between multiple full passes and string-replaces

There's a lot to consider, but I think this has promise, and the changes to the HTML API in WordPress 6.6 are going to make it much easier to do the conversion. Looks like we could improve Core's UTF-8 functions though.

This ticket was mentioned in Slack in #core by nhrrob. View the logs.


5 weeks ago

#38 @hellofromTonya
5 weeks ago

  • Milestone changed from 6.6 to 6.7

This is ticket is for a bug report. 44958.diff is a bugfix.

However, the current direction with this PR is an enhancement which may also resolve the bug.

In discussing during today's bug scrub, @davidbaumwald suggested creating a new ticket for the enhancement and moving PR 5466 to that new ticket:

Right, but that's a separate task. "Bug: Fix this thing with simple code change" versus "ENH: Refactor this code to be better even though simpler code fixed original issue"

All that to say, I think current ticket is OK as-is, but maybe add a comment to encourage a new ticket for the ENH part

@dmsnell want to open a new Trac ticket for the enhancement scope of work?

Last edited 5 weeks ago by hellofromTonya (previous) (diff)

#39 @dmsnell
5 weeks ago

@hellofromTonya we can open a new ticket for 6.7 if we need to, but I don't understand why we are calling PR #5466 an enhancement instead of a bug fix.

The smaller patch here proposes fixing one very specific incarnation of a broader bug, and unfortunately at the current time introduces more breakage (and so I think it would be best to ensure that if we merge it, we fix that remaining issue before doing so).

The goal of the alternative patch in the PR is to comprehensively solve the category of defects that this reported issue is a manifestation of. It's goal has been to maintain the same interface and intent of the existing code, though I created sluggify() the new function mostly to make it possible to examine the older and newer code at the same time during review.

So I don't really care what goes on, but if we create a new ticket I feel like it should still be a bug fix, since the core issue reported here is that the slug generation does not adequately parse the inbound content and therefore all of the replacement logic is fragile and prone to only work in very specific cases. Do we classify a patch as a bug fix if it resolves 1 bug, but an enhancement if it resolves 1000?

Note: See TracTickets for help on using tickets.