Histogram representing number of substitutions, insertions and deleting in sequences

Question

l have two columns that represent : right sequence and predicted sequence. l want to make statistics on the number of deletion, substitution and insertion by comparing each right sequence with its predicted sequence.

l did the levenstein distance to get the number of characters which are different (see the function below) and error_dist function to get the most common errors (in terms of substitution) :

here is a sample of my data :

de               de
date            date
pour             pour
etoblissemenls  etablissements
avec           avec
code           code
communications  communications
r               r
seiche          seiche
titre           titre
publiques      publiques
ht             ht
bain           bain
du             du
ets            ets
premier        premier
dans           dans
snupape        soupape
minimum        minimum
blanc          blanc
fr             fr
nos            nos
au             au
bl             bl
consommations   consommations
somme           somme
euro            euro
votre           votre
offre           offre
forestier       forestier
cs              cs
de              de
pour            pour
de              de
paye            r
cette           cette
votre           votre
valeurs         valeurs
des             des
gfda            gfda
tva             tva
pouvoirs        pouvoirs
de              de
revenus         revenus
offre           offre
ht              ht
card            card
noe             noe
montant         montant
r               r
comprises   comprises
quantite    quantite
nature       nature
ticket       ticket
ou           ou
rapide      rapide
de          de
sous        sous
identification  identification
du               du
document      document
suicide      suicide
bretagne     bretagne
tribunal    tribunal
services    services
cif           cif
moyen         moyen
gaec         gaec
total         total
lorsque     lorsque
contact     contact
fermeture   fermeture
la           la
route        route
tva          tva
ia           ia
noyal       noyal
brie        brie
de          de
nanterre    nanterre
charcutier  charcutier
semestre    semestre
de  de
rue rue
le  le
bancaire    bancaire
martigne    martigne
recouvrement    recouvrement
la  la
sainteny    sainteny
de  de
franc   franc
rm  rm
vro vro

here is my code

import pandas as pd
import collections
import numpy as np
import matplotlib.pyplot as plt
import distance

def error_dist():
    df = pd.read_csv('data.csv', sep=',')
    df = df.astype(str)
    df = df.replace(['é', 'è', 'È', 'É'], 'e', regex=True)
    df = df.replace(['à', 'â', 'Â'], 'a', regex=True)
    dictionnary = []
    for i in range(len(df)):
        if df.manual_raw_value[i] != df.raw_value[i]:
            text = df.manual_raw_value[i]
            text2 = df.raw_value[i]
            x = len(df.manual_raw_value[i])
            y = len(df.raw_value[i])
            z = min(x, y)
            for t in range(z):
                if text[t] != text2[t]:
                    d = (text[t], text2[t])
                    dictionnary.append(d)
                    #print(dictionnary)

    dictionnary_new = dict(collections.Counter(dictionnary).most_common(25))

    pos = np.arange(len(dictionnary_new.keys()))
    width = 1.0

    ax = plt.axes()
    ax.set_xticks(pos + (width / 2))
    ax.set_xticklabels(dictionnary_new.keys())

    plt.bar(range(len(dictionnary_new)), dictionnary_new.values(), width, color='g')

    plt.show()

enter image description here

and the levenstein distance :

def levenstein_dist():
    df = pd.read_csv('data.csv', sep=',')
    df=df.astype(str)
    df['string diff'] = df.apply(lambda x: distance.levenshtein(x['raw_value'], x['manual_raw_value']), axis=1)
    plt.hist(df['string diff'])
    plt.show()

enter image description here

Now l want to make a histograms showing three bins : number of substitution, number of insertion and number of deletion . How can l proceed ?

Thank you

Hi @Goyo, for bins l'm talking about the histogram to do for presenting substitution , deletion and insertion occurrences not what l showed. Hope it's clear — vincent, Aug 01 '17 at 09:30
Do you want to get the number of substitutions, insertions and deletions of a levenshtein distance? Something like `f('rain', 'shine') = {2, 1, 0}` ? — Yohanes Gultom, Aug 01 '17 at 13:44
@YohanesGultom, Yes l'm seeking for that. l want to get that for every pair than sum the number of substitution over all pairs (respectively for insertion, deletion ) than make a histogram that tells the overall substitution , deletion and insertion — vincent, Aug 01 '17 at 13:56
This question gives thorough explanation about it https://stackoverflow.com/questions/10638597/minimum-edit-distance-reconstruction. Basically you can do it by tracing the matrix built using [Wagner-Fischer algorithm](http://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm) and count the number of operations. You can also reuse [nltk code](http://www.nltk.org/_modules/nltk/metrics/distance.html) `nltk.metrics.distance.edit_distance` to build the matrix — Yohanes Gultom, Aug 01 '17 at 14:04

score 1 · Answer 1 · answered Aug 01 '17 at 14:14

1

Thanks to the suggestions of @YohanesGultom the answer for the problem can be found here :

http://www.nltk.org/_modules/nltk/metrics/distance.html

or

https://gist.github.com/kylebgorman/1081951

answered Aug 01 '17 at 14:14

vincent

41
7

Histogram representing number of substitutions, insertions and deleting in sequences

1 Answers1