Converting Word documents to HTML

word.jpg
Eeek! Who needs all this extra code?

While I write my own copy for this blog, and some of my other sites, much of what I post on the Web is written by others. This material comes to me in a variety of formats, from Open Office to .pdf files, but most of it is in Microsoft Word, and all of it needs to be converted to HTML. There are a variety of ways to do this, but I'll just review three—two common approaches and my preferred method.

Three common approaches to converting Word documents
  • Open the file in Word and save as HTML. This is not recommended. When you do this, Microsoft Word adds all sorts of extra coding, much of which is not what you originally intended. I tried that with this entry just to see how weird it would be and it turned out to be 450 lines long (as opposed to 82 with my HTML).
  • Copy and paste to Dreamweaver. Reliability varies. Copy the text from your word file, open an existing HTML file in Dreamweaver, save it with a new name, select the text you wish to replace, switch to Design view, and paste in your content. Switch to Code view to see how it worked. If your Word document was perfectly formatted this may turn out fine. If the author did a lot of editing you may find that mysterious characters, extra spaces, or the wrong codes (for items such as headers) appear. If there are only a few you can delete them. If there are a lot, you may want to start over using the next method.
  • Use Word's Find and Replace feature to substitute HTML for Word formatting. This is what I usually do. The following instructions will show you how.
Clean up any odd or special characters
  • Open your Word file
  • Find and replace & with &
  • Look for any other special characters such as trademarks, umlauts, em dashes or percent signs and replace with HTML character or text as appropriate. Charts to look up characters are available at http://www.webstandards.org/learn/reference/charts/entities/.
  • Find and replace ’ with ' and ” with " to remove curly apostrophes and curly quotes (if appropriate). Curly apostrophes and quotes are typographically correct and can be replaced by special characters, but straight quotes work more consistently in some situations, such as HTML e-mail.
Add coding for bold and italic

Put <strong> immediately before each bold entity and </strong> after and <em> before each italic entity and </em> after. I usually color these red so that I can easily see if I've closed any tags that I opened.

findsm.jpg

Add HTML paragraph formatting
  • Find and replace paragraph marks (^p) with </p>^p<p>.
  • Move the extra <p> from the end of the last paragraph to the beginning of the first paragraph.
  • If necessary replace </p> <p> with blank space.
  • Replace manual line breaks (^l) with
    <br /> ^l.
  • Manually change p> to h3>, h5> or the appropriate code for heads and subheads.
  • Replace p> with li> for any bulleted text. Add <ul> before and </ul> after the bulleted sections.
Save file then open an existing HTML file (from your site) in Dreamweaver
  • In code mode, save the Dreamweaver file with a new name (thus creating a new file).
  • Copy and paste the coded text from your Word file to replace the main text in your HTML file.
Add links
  • In Dreamweaver, select the text you would like to link, copy the url to which it will link, then paste this into the link box in the properties panel. In the case of e-mail links you need to add mailto: to the beginning of the address (instead of http://).
  • When a sentence ends with a link, check to make sure that it is followed by a period. The period should come immediately after the </a> without any space preceding it.

Now give your code a quick review; if it looks clean, post it to the Web. View the page in your Web browser then validate it using the W3C Markup Validation Service—to find any errors you may have missed. If everything checks out, you're done!

Bookmark & Share:
  • Facebook
  • StumbleUpon
  • del.icio.us
  • Digg
  • LinkedIn
  • FriendFeed
  • MySpace
  • email

    Share on Google Buzz

7 Comments »
  1. Great information. I had one more suggestion that seems to work pretty well for me (and it is fast). I upload the word document, or email it to myself, to my Google docs area. From there, I can view the document in html. View source and you have a pretty clean version of html to copy and paste. Bill Charleston web site design

    Comment by Bill Nixon — April 26, 2007 @7:57 am

  2. I just drop my Word Doc in a Notepad, kill the formating, then I drop it in to Dreamweaver and go from there.

    Comment by George Morris — May 4, 2007 @10:07 pm

  3. George, dropping a Word Doc into Notepad doesn't get rid of the curly quotes.

    Comment by ted baxter — August 1, 2007 @4:38 am

  4. Good Morning, When I try to copy and paste HTML codes into a Web document the code will not convert to a button etc. I also get the following error message: FTP Folder Error An error occurred copying a file to the FTP server. Make sure you have permission to put files on this server. Details The process cannot access the file because it is being used by another process. What can I do to remedy this problem? Peace, Carl, http://www.psychezpublishing.com

    Comment by Carl — November 7, 2007 @7:47 am

  5. Great recommendations. I use a similar method to George with great success.

    Comment by Bryan Boettiger — November 9, 2007 @5:20 pm

  6. Thanks for this Nice post, Really usefull all of us. just bookmarked this post in my digg profile, hope you will update more post soon. I really liked your blog! Regards, Shaza

    Comment by Crossing Arbogast — July 9, 2009 @2:21 am

  7. Heidi, I do what George Morris (poster above) does as well. I can say I used to try your first suggestion (Open the file in Word and save as HTML) and it was always a bloody mess! These days there's also a lot of free tools online that do this for you. textfixer.com is one that I've used. -Declan Owner of NightWave Sleep Assistant

    Comment by Declan — June 4, 2011 @6:05 pm

Leave a comment

RSS feed for comments on this post. | TrackBack URL