WYO: Write your own diff tool
A diff tool generates an output given two input files. This output can be thought of as a script which when applied to first input returns the second input. Diff tool is nothing but a straight forward implementation for finding edit distance at its core. In addition to finding minimum edit distance we also find actions which when applied first input would lead to second input. These actions constitute diff tools output. Lets first define some classes which denote these actions
trait EditAction
case class Delete(line:Int) extends EditAction
case class Insert(line:Int) extends EditAction
case class Copy(line1:Int,line2:Int) extends EditAction
Above we declare three forms of edit actions which can be performed on first input to arrive at second input. Delete action contains line number of the line which is deleted from first input. Insert action contains line number from second input which is inserted into first. Copy action contains two line numbers. First is the line number from first input which is copied and second is line number in second input where it is copied. Although, there would be many ways to do this but we would like a sequence of these action with minimum cost. Now, we assume that cost of Delete and Insert is more that Copy action since in case of Copy line in first and second input is same and we did not make a change. And since while finding minimum cost only relative cost matters we can assign cost 1 to Delete and Insert Actions and cost of 0 to Copy action. Below is cost function for these edit actions
def cost(action:EditAction) = {
case _:Delete => 1
case _:Insert => 1
case _:Copy => 0
}
Now, lets code up the actual diff method which takes line numbers in first and second input at which starts calculating the diff. The actual diff of two file can be found by invoking this function at position 0,0. This method also assumes that lines from first input are available in lines1 and lines from second input are available in lines2 variables.
def diff(pos1:Int,pos2:Int):List[EditAction]={
// all lines processed, return empty list
if(pos1>=lines1.length && pos2>=lines2.length)
List[EditAction]()
// all input 1 lines processed, insert lines from input 2
else if(pos1 >= lines1.length)
Insert(pos2)::diff(pos1,pos2+1)
// all input 2 lines processed, delete lines from input 1
else if(pos2 >= lines2.length)
Delete(pos1)::diff(pos1+1,pos2)
else if(lines1(pos1) == lines2(pos2))
// both lines from input1,input2 are same, we can copy from
// input 1 to input 2
Copy(pos1,pos2)::diff(pos1+1,pos2+1)
else {
val diff1 = Insert(pos2)::diff(pos1,pos2+1)
val diff2 = Delete(pos1)::diff(pos1+1,pos2)
val diff1Cost = (diff1 map cost).sum
val diff2Cost = (diff2 map cost).sum
if(diff1Cost < diff2Cost) diff1
else diff2
}
}
Above is a naive implementation of our diff algorithm where in first if we just return empty list of EditAction since pos1 and pos2 are both equal to or greater than number of lines in files, which means that we have already processed all lines. In second if we see that we have already processed all lines in first file so we have no option but to insert lines from second file. Third if handles the case when we have processed all lines in second file, in this case we have no option but to delete remaining lines from first input. Next we handle case when current line in input 1 and input 2 is same, in this case it would be best to simply copy this line from file one to file two. Next we handle more complex case by calculating cost of both performing Insert and Delete and then comparing the diff cost and returning the diff with minimum cost. Now lets also add code to read input files and call diff method.
object DiffTool extends App {
val lines1 = io.Source.fromFile(args(0)).getLines.toArray
val lines2 = io.Source.fromFile(args(1)).getLines.toArray
trait EditAction
case class Delete(line:Int) extends EditAction
case class Insert(line:Int) extends EditAction
case class Copy(line1:Int,line2:Int) extends EditAction
def cost(action:EditAction)= action match {
case _:Delete => 1
case _:Insert => 1
case _:Copy => 0
}
def diff(pos1:Int,pos2:Int):List[EditAction]={
if(pos1>=lines1.length && pos2>=lines2.length)
List[EditAction]()
else if(pos1 >= lines1.length)
Insert(pos2)::diff(pos1,pos2+1)
else if(pos2 >= lines2.length)
Delete(pos1)::diff(pos1+1,pos2)
else if(lines1(pos1) == lines2(pos2))
Copy(pos1,pos2)::diff(pos1+1,pos2+1)
else {
val diff1 = Insert(pos2)::diff(pos1,pos2+1)
val diff2 = Delete(pos1)::diff(pos1+1,pos2)
val diff1Cost = (diff1 map cost).sum
val diff2Cost = (diff2 map cost).sum
if(diff1Cost < diff2Cost) diff1
else diff2
}
}
println(diff(0,0))
}
When running above program two files, the program literally prints a list of actions to be performed to convert input one into input two. Now this list of actions can be easily translated to either text patch format or html diff or html split view diff. Now, let us add a method to convert these actions to html.
def toHtml(diffActions:List[EditAction]) = {
<html>
<body>
<table border="0" cellspacing="0" >
{(diffActions map {
case Delete(line)=>{<tr style="background-color:pink;"><td>{line+", -"}</td><td>{lines1(line)}</td></tr>}
case Insert(line)=>{<tr style="background-color:lightgreen;"><td>{"- ,"+line}</td><td>{lines2(line)}</td></tr>}
case Copy(line1,line2)=>{<tr><td>{line1+","+line2}</td><td>{lines1(line1)}</td></tr>}
})}
</table></body></html>
}
println(toHtml(diff(0,0)))
Although, this is conceptually complete implementation for generating HTML diff of two files, this implementation is usable only for creating diffs of small files. For large files this implementation is likely to fail with Stack Overflow Exception, since the definition is recursive. One way to fix this is to have an implementation which is tail recursive, such implementations are converted to iteration by Scala compiler. Following is such an implementation of diff routine. We also define a type alias DiffScript for List[EditAction] type
Tail Recursive Diff implementation
// type alias
type DiffScript = List[EditAction]
// an empty diff script
val emptyDiffScript = List[EditAction]()
// an empty list of diff scripts
val empty = List(emptyDiffScript)
// return diff script with minimum cost
def minDiff(a:DiffScript,b:DiffScript)={
val aCost = (a map cost).sum
val bCost = (b map cost).sum
if(aCost < bCost) a
else b
}
def diff():DiffScript={
val emptyList = Seq.fill(lines2.length+1)(emptyDiffScript).toList
diff(0,0,emptyList)
}
/**
* @param pos1 current line being processed in file1
* @param pos2 current line being processed in file2
* @param last last calculated DiffScripts
* @param curr list of DiffScripts being created
* @return final optimal DiffScript
*/
@tailrec
def diff(pos1:Int,pos2:Int,last:List[DiffScript],curr:List[DiffScript]=empty):DiffScript={
if(pos1 >= lines1.length)
last.reverse.head.reverse
else if(pos2 < lines2.length){
val newCurrent = if(lines1(pos1) == lines2(pos2)){
Copy(pos1,pos2)::last.head
}else {
val diff1 = Insert(pos2)::current.head
val diff2 = Delete(pos1)::last.tail.head
minDiff(diff1,diff2)
}
diff(pos1,pos2+1,last.tail,newCurrent::curr)
} else
diff(pos1+1,0,curr.reverse)
}
Above diff method is tail recursive implementation of iterative lcs implementation. Instead of calculating diff(a,b) in terms of diff(a-1,b),diff(a,b-1) and diff(a-1,b-1), it sequentially calculates diff(a,0),diff(a,1)….diff(a,lines2.length) and stores it in last variable. It then uses this list to calculate diff(a+1,0),diff(a+1,1)…diff(a+1,lines2.length). Proceeding this way it calculates diff(lines1.length,lines2.length) which is the final result.