4

Around a fifth of post submissions that I receive contains ridiculous amounts of hidden formatting.

For example, here is some of it from a recent post:

<!--[if gte mso 9]><xml>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:TrackMoves/>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:DoNotPromoteQF/>
<w:LidThemeOther>EN-US</w:LidThemeOther>
<w:LidThemeAsian>X-NONE</w:LidThemeAsian>
<w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:SnapToGridInCell/>
<w:WrapTextWithPunct/>
<w:UseAsianBreakRules/>
<w:DontGrowAutofit/>
<w:SplitPgBreakAndParaMark/>
<w:EnableOpenTypeKerning/>
<w:DontFlipMirrorIndents/>
<w:OverrideTableStyleHps/>
</w:Compatibility>
<m:mathPr>
<m:mathFont m:val="Cambria Math"/>
<m:brkBin m:val="before"/>
<m:brkBinSub m:val="&#45;-"/>
<m:smallFrac m:val="off"/>
<m:dispDef/>
<m:lMargin m:val="0"/>
<m:rMargin m:val="0"/>
<m:defJc m:val="centerGroup"/>
<m:wrapIndent m:val="1440"/>
<m:intLim m:val="subSup"/>
<m:naryLim m:val="undOvr"/>
</m:mathPr></w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="true"
DefSemiHidden="true" DefQFormat="false" DefPriority="99"
LatentStyleCount="267">
<w:LsdException Locked="false" Priority="0" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Normal"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="heading 1"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 2"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 3"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 4"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 5"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 6"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 7"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 8"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 9"/>
<w:LsdException Locked="false" Priority="39" Name="toc 1"/>
<w:LsdException Locked="false" Priority="39" Name="toc 2"/>
<w:LsdException Locked="false" Priority="39" Name="toc 3"/>
<w:LsdException Locked="false" Priority="39" Name="toc 4"/>

It is actually 650 lines, view it all here.

Also, random HTML formatting is added to tags like:

<p class="MsoNormal">

Upon further interesting research, it appears that this happens when the author paste content from MS Word directly into the TinyMCE visual editor. And as detailed:

The bad news isn’t evident until someone attempts to view that page with a different browser and the page is totally misformatted or appears blank. Ironically, this latter scenario happens most often when the page is viewed in Microsoft Internet Explorer [Good!].

A way to solve it may be to use the Paste from Word button.

However, that is not a viable solution when 20% of submissions have this issue. Is there any way to strip this nonsense formatting upon paste?

2 Answers 2

3

I am interpreting the question to mean that you already have Word markup in your post and so you need to clean that up via PHP. If so...

  1. You can see the code that cleans up Word content here: http://core.trac.wordpress.org/browser/trunk/src/wp-includes/js/tinymce/plugins/paste/editor_plugin_src.js#L375 That is Javascript. With some work, you could convert that to PHP.
  2. PHP Tidy, if available, will clean that up.
  3. I believe that HTML Tidy can do it.
  4. strip_tags will just get rid of the code. (Tested)
  5. wp_kses will get rid of much of it but will take some tweaking to work well, at least as indicated by my simple test. Maybe with the right arguments it can do what you want.
1
-1

Here is a "zero development solution" : I would instruct your users that if they paste content from Word, they should paste in the html tab, not the "Visual" tab. They can switch to the visual tab afterwards. This will only paste the visible text, not its markup.

1
  • Ah... yes... if only users would do what you tell them :)
    – s_ha_dum
    Commented Sep 16, 2013 at 20:00

Not the answer you're looking for? Browse other questions tagged or ask your own question.