Monday, July 4, 2016

Use Git to Version Control DOCX Files for Windows / Cygwin

Git is a wonderful tool for version control of various texts and documents, such as source code. Unfortunately however, git by default will not be able to produce diff result for Microsoft's word document format (.docx) files. In this post, I will go over how to use git diff docx files on Cygwin. This post is based on this.

Let's see what happens when we try to use git for docx files. Download Cygwin x86 for 32-bit system or Cygwin x64 for 64-bit system from the Cygwin download page.

Run the installer file to install it on the system. Just go through installation with default settings until Select Packages section. Here, search for and install unzip (Archive), git (Devel), and vim (Editors). Note that I will assume the default Cygwin installation directory to be c:\cygwin. Note that if you download 64-bit version, the default directory will be c:\cygwin64.

Let's run Cygwin. You should see bash terminal. Say you have your docx files saved in c:\Users\unixnme\Documents\docx folder. To change directory into this folder, run
$ cd /cygdrive/c/Users/unixnme/Documents/docx

Let's assume that there is a docx file called test.docx in the folder.
$ ls

Let's initiate a git repository and commit test.docx file.
$ git init
Initialized empty Git repository in /cygdrive/c/Users/unixnme/Documents/docx.git
$ git add test.docx
$ git config --global your_name
$ git config --global
$ git commit -m "initial commit"
[master (root-commit) 90cf9f2] initial commit
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100755 test.docx

Next, make any changes to test.docx file, save, and run git diff command:
$ git diff
diff --git a/test.docx b/test.docx
index 41c6200..ce81ceb 100755
Binary files a/test.docx and b/test.docx differ

As expected, git diff will complain that test.docx is a binary file. Now, let us enable git diff for docx files.

First, we need to install docx2txt utility. Download docx2txt-1.4.tgz file from here to your download folder, say c:\Users\unixnme\Downloads. Extract the files by running tar command
$ cd /cygdrive/c/Users/unixnme/Downloads
$ tar vxzf docx2txt-1.4.tgz

Go into the extracted directory and enable execution flag for windows installer batch file.
$ cd docx2txt-1.4
$ chmod u+x WInstall.bat

We are ready to install it. Enter the installation folder path and Perl path as follows:
$ ./WInstall.bat
Welcome to command line installer for docx2txt.

Where should the docx2txt tool be installed? Specify the location
without surrounding quotes.

Installation Folder :c:\cygwin\bin

Please specify fully qualified paths to utilities when requested.
Perl.exe is required for docx2txt tool as well as for this installation.

Path to Perl.exe : c:\cygwin\bin\perl.exe

Continuing with simple installation ....

Copying script files to "c:\cygwin\bin" ....

Please adjust perl, unzip and cakecmd paths (as needed) in
"c:\cygwin\bin\docx2txt.bat" and "c:\cygwin\bin\docx2txt.config"

Note that I simply installed docx2txt in c:\cygwin\bin, but you may want to change it as desired. The default perl executable is installed in the cygwin/bin directory, which in this case is c:\cygwin\bin\perl.exe. Note that if you have installed the 64-bit version of cygwin, then you will need to replace cygwin with cygwin64 for paths. If it asks for unzip and cakecmd paths, simply leave them blank as they are not necessary.

Next, we need to create /usr/bin/docx2txt file by
$ vim /usr/bin/docx2txt

with the following content.
#!/bin/bash $1 -

Make sure to set the executable flag
$ chmod u+x /usr/bin/docx2txt

Finally, go back to the git repository folder and create .gitattributes file to apply filter for docx files
$ cd /cygdrive/c/Users/unixnme/Documents/docx
$ echo "*.docx diff=word" > .gitattributes

Finally, edit git config file to run /usr/bin/docx2txt script file for word filter
$ git config diff.word.textconv docx2txt

When everything is successful, you should see diff result for the modified docx file like a regular plain text file
$ git diff
diff --git a/test.docx b/test.docx
index 41c6200..ce81ceb 100755
--- a/test.docx
+++ b/test.docx
@@ -1 +1 @@
-This is a test file
+This is a test file, which I have modified after commit.

Enjoy git on your Microsoft Word documents!