{"id":2745,"date":"2022-01-24T18:22:41","date_gmt":"2022-01-24T18:22:41","guid":{"rendered":"https:\/\/www.psyctc.org\/psyctc\/?p=2745"},"modified":"2022-01-26T10:57:19","modified_gmt":"2022-01-26T10:57:19","slug":"why-kappa-or-how-simple-agreement-rates-are-deceptive","status":"publish","type":"post","link":"https:\/\/www.psyctc.org\/psyctc\/2022\/01\/24\/why-kappa-or-how-simple-agreement-rates-are-deceptive\/","title":{"rendered":"Why kappa? or How simple agreement rates are deceptive"},"content":{"rendered":"\n<p class=\"has-small-font-size\">Created 24.i.22<\/p>\n\n\n\n<p>I did a peer review of a paper recently and met an old chestnut: that the inter-rater agreement reported was good because the simple agreement rates were &#8220;good&#8221;.  This is nonsense and that has been written about for probably a century and alternative ways summarising agreement rates have been around for a long time.  Jacob Cohen invented his &#8220;chance corrected&#8221; &#8220;coefficient of agreement for nominal scales&#8221;, kappa in 1960 (Cohen, 1960).  That made me think it might be useful to have a blog post here, perhaps actually several, linking with demonstrations of the issues in my &#8220;R SAFAQ&#8221; (Self-Answered Frequently (self) Asked Questions&#8221; (a.k.a. Rblog).  <\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Background<\/h4>\n\n\n\n<p>The issue is very simple: if the thing that is rated is not around 50:50 in the ratings, then agreement even by chance is going to be high.  Let&#8217;s say two raters are asked to rate a series of photos of facial expressions for the presence of &#8220;quizzically raised eyebrows&#8221; and the rate of photos that look even remotely quizzical they are given is only 10% and let&#8217;s suppose they are told that the rate is about 10% and use that information. <\/p>\n\n\n\n<p>Now if they have absolutely no agreement, i.e. only chance agreement about what constitutes a &#8220;quizzically raised eyebrow&#8221; they may well still each rate about 90%. In that case by chance alone rater B will rate as quizzical  10% of the photos that rater A rated as quizzical: rate of agreement 10% * 10% = one in a hundred, 1% agreement.  However, rater B will rate as not quizzical 90% of the 90% of photos that rater A rated as not quizzical: rate of agreement 90% * 90% = 81%.  So their raw agreement rate is 82% which sounds pretty good until we realise that it arose by pure chance.  Here&#8217;s an aesthetically horrible table of that for n = 100 and the perfect chance level of agreement.  (In real life, sampling vagaries mean it wouldn&#8217;t be quite as neat as this but it wouldn&#8217;t be far off this.)<\/p>\n\n\n\n<div class=\"wpdt-c row wpDataTableContainerSimpleTable wpDataTables wpDataTablesWrapper\n\"\n    >\n        <table id=\"wpdtSimpleTable-10\"\n           style=\"border-collapse:collapse;\n                   border-spacing:0px;\"\n           class=\"wpdtSimpleTable wpDataTable\"\n           data-column=\"4\"\n           data-rows=\"4\"\n           data-wpID=\"10\"\n           data-responsive=\"0\"\n           data-has-header=\"1\">\n\n                    <thead>        <tr class=\"wpdt-cell-row \" >\n                                <th class=\"wpdt-cell wpdt-italic\"\n                                            data-cell-id=\"A1\"\n                    data-col-index=\"0\"\n                    data-row-index=\"0\"\n                    style=\" width:25%;                    padding:10px;\n                    \"\n                    >\n                                        n                    <\/th>\n                                                <th class=\"wpdt-cell \"\n                                            data-cell-id=\"B1\"\n                    data-col-index=\"1\"\n                    data-row-index=\"0\"\n                    style=\" width:25%;                    padding:10px;\n                    \"\n                    >\n                                        Rated quizzical by rater B                    <\/th>\n                                                <th class=\"wpdt-cell \"\n                                            data-cell-id=\"C1\"\n                    data-col-index=\"2\"\n                    data-row-index=\"0\"\n                    style=\" width:25%;                    padding:10px;\n                    \"\n                    >\n                                        Rated NOT quizzical by rater B                    <\/th>\n                                                <th class=\"wpdt-cell \"\n                                            data-cell-id=\"D1\"\n                    data-col-index=\"3\"\n                    data-row-index=\"0\"\n                    style=\" width:25%;                    padding:10px;\n                    \"\n                    >\n                                        Row totals                    <\/th>\n                                        <\/tr>\n                    <tbody>        <tr class=\"wpdt-cell-row \" >\n                                <td class=\"wpdt-cell wpdt-align-left\"\n                                            data-cell-id=\"A2\"\n                    data-col-index=\"0\"\n                    data-row-index=\"1\"\n                    style=\"                    padding:10px;\n                    \"\n                    >\n                                        Rated quizzical by rater A                    <\/td>\n                                                <td class=\"wpdt-cell \"\n                                            data-cell-id=\"B2\"\n                    data-col-index=\"1\"\n                    data-row-index=\"1\"\n                    style=\"                    padding:10px;\n                    \"\n                    >\n                                        1                    <\/td>\n                                                <td class=\"wpdt-cell \"\n                                            data-cell-id=\"C2\"\n                    data-col-index=\"2\"\n                    data-row-index=\"1\"\n                    style=\"                    padding:10px;\n                    \"\n                    >\n                                        9                    <\/td>\n                                                <td class=\"wpdt-cell \"\n                                            data-cell-id=\"D2\"\n                    data-col-index=\"3\"\n                    data-row-index=\"1\"\n                    style=\"                    padding:10px;\n                    \"\n                    >\n                                        10                    <\/td>\n                                        <\/tr>\n                            <tr class=\"wpdt-cell-row \" >\n                                <td class=\"wpdt-cell wpdt-align-left\"\n                                            data-cell-id=\"A3\"\n                    data-col-index=\"0\"\n                    data-row-index=\"2\"\n                    style=\"                    padding:10px;\n                    \"\n                    >\n                                        Rated NOT quizzical by rater A                    <\/td>\n                                                <td class=\"wpdt-cell \"\n                                            data-cell-id=\"B3\"\n                    data-col-index=\"1\"\n                    data-row-index=\"2\"\n                    style=\"                    padding:10px;\n                    \"\n                    >\n                                        9                    <\/td>\n                                                <td class=\"wpdt-cell \"\n                                            data-cell-id=\"C3\"\n                    data-col-index=\"2\"\n                    data-row-index=\"2\"\n                    style=\"                    padding:10px;\n                    \"\n                    >\n                                        81                    <\/td>\n                                                <td class=\"wpdt-cell \"\n                                            data-cell-id=\"D3\"\n                    data-col-index=\"3\"\n                    data-row-index=\"2\"\n                    style=\"                    padding:10px;\n                    \"\n                    >\n                                        90                    <\/td>\n                                        <\/tr>\n                            <tr class=\"wpdt-cell-row \" >\n                                <td class=\"wpdt-cell \"\n                                            data-cell-id=\"A4\"\n                    data-col-index=\"0\"\n                    data-row-index=\"3\"\n                    style=\"                    padding:10px;\n                    \"\n                    >\n                                        Column totals:                    <\/td>\n                                                <td class=\"wpdt-cell \"\n                                            data-cell-id=\"B4\"\n                    data-col-index=\"1\"\n                    data-row-index=\"3\"\n                    style=\"                    padding:10px;\n                    \"\n                    >\n                                        10                    <\/td>\n                                                <td class=\"wpdt-cell \"\n                                            data-cell-id=\"C4\"\n                    data-col-index=\"2\"\n                    data-row-index=\"3\"\n                    style=\"                    padding:10px;\n                    \"\n                    >\n                                        90                    <\/td>\n                                                <td class=\"wpdt-cell \"\n                                            data-cell-id=\"D4\"\n                    data-col-index=\"3\"\n                    data-row-index=\"3\"\n                    style=\"                    padding:10px;\n                    \"\n                    >\n                                        100                    <\/td>\n                                        <\/tr>\n                    <\/table>\n<\/div><style id='wpdt-custom-style-10'>\n<\/style>\n<style>\n                    \n                                                                                        \/* table font color *\/\n    .wpdt-c.wpDataTablesWrapper table.wpdtSimpleTable,\n    .wpdt-c .wpDataTablesWrapper table.wpDataTable {\n        font-family: Lucida Sans Unicode, Lucida Grande, sans-serif !important;\n    }\n\n            \/* table font size *\/\n    .wpdt-c.wpDataTablesWrapper table.wpdtSimpleTable,\n    .wpdt-c .wpDataTablesWrapper table.wpDataTable {\n        font-size: 20px !important;\n    }\n\n            \n                <\/style>\n\n\n\n\n<p>That&#8217;s why Cohen invented his kappa as a &#8220;chance corrected&#8221; coefficient of agreement.  It actually covers ratings with any number of categories, not just binary &#8220;quizzical\/not-quizzical&#8221; and there are arguments that it&#8217;s an imperfect way to handle things but it is easy to compute (look it up, the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Cohen's_kappa\">wikipedia entry<\/a>, as so often for stats, takes some beating).  Pretty much any statistics package or system will compute it for you and there are online calculators that will do it too (<a href=\"https:\/\/idostatistics.com\/cohen-kappa-free-calculator\/#calcolobox\">https:\/\/idostatistics.com\/cohen-kappa-free-calculator\/#calcolobox<\/a>, <a href=\"https:\/\/www.statology.org\/cohens-kappa-calculator\/\">https:\/\/www.statology.org\/cohens-kappa-calculator\/<\/a> and <a href=\"https:\/\/labplantvirol.com\/kappa\/online\/calculator.html\">https:\/\/labplantvirol.com\/kappa\/online\/calculator.html<\/a> were the first three that gurgle found for me, the last has some advantages over the first two.)  <\/p>\n\n\n\n<p>The arguments against it are sound by fairly fine print and it&#8217;s orders of magnitude better than raw agreement.  Kappa for the chance agreement in that table is zero, as it should be.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">See it for different rates of the rated quality from R<\/h4>\n\n\n\n<p>This plot illustrates the issue pretty clearly.  The x axis has the prevalence of the quality rated (assuming both raters agree that). The red line shows that raw agreement does drop to .5, i.e. random, 50\/50 agreement, where the prevalence is 50% but that it rises to near 1, i.e. to near perfect agreement, as prevalence tends to zero or 100%.  By contrast, and as a sensible agreement index should, kappa remains on or near zero across all prevalence rates.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"647\" src=\"https:\/\/www.psyctc.org\/psyctc\/wp-content\/uploads\/2022\/01\/kappa1_20220126-1024x647.png\" alt=\"\" class=\"wp-image-2748\" srcset=\"https:\/\/www.psyctc.org\/psyctc\/wp-content\/uploads\/2022\/01\/kappa1_20220126-1024x647.png 1024w, https:\/\/www.psyctc.org\/psyctc\/wp-content\/uploads\/2022\/01\/kappa1_20220126-300x190.png 300w, https:\/\/www.psyctc.org\/psyctc\/wp-content\/uploads\/2022\/01\/kappa1_20220126-768x485.png 768w, https:\/\/www.psyctc.org\/psyctc\/wp-content\/uploads\/2022\/01\/kappa1_20220126-1536x971.png 1536w, https:\/\/www.psyctc.org\/psyctc\/wp-content\/uploads\/2022\/01\/kappa1_20220126-2048x1294.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>See my &#8220;Rblog&#8221; or &#8220;R SAFAQ <a href=\"https:\/\/www.psyctc.org\/Rblog\/posts\/2022-01-24-chance-corrected-agreement\/\">entry about this<\/a> for more detail and plots.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">References<\/h4>\n\n\n\n<p>Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37\u201346.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Created 24.i.22 I did a peer review of a paper recently and met an old chestnut: that the inter-rater agreement reported was good because the simple agreement rates were &#8220;good&#8221;. This is nonsense and that has been written about for probably a century and alternative ways summarising agreement rates have been around for a long &hellip; <a href=\"https:\/\/www.psyctc.org\/psyctc\/2022\/01\/24\/why-kappa-or-how-simple-agreement-rates-are-deceptive\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Why kappa? or How simple agreement rates are deceptive<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":2752,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-2745","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorised"],"_links":{"self":[{"href":"https:\/\/www.psyctc.org\/psyctc\/wp-json\/wp\/v2\/posts\/2745","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.psyctc.org\/psyctc\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.psyctc.org\/psyctc\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.psyctc.org\/psyctc\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.psyctc.org\/psyctc\/wp-json\/wp\/v2\/comments?post=2745"}],"version-history":[{"count":4,"href":"https:\/\/www.psyctc.org\/psyctc\/wp-json\/wp\/v2\/posts\/2745\/revisions"}],"predecessor-version":[{"id":2751,"href":"https:\/\/www.psyctc.org\/psyctc\/wp-json\/wp\/v2\/posts\/2745\/revisions\/2751"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.psyctc.org\/psyctc\/wp-json\/wp\/v2\/media\/2752"}],"wp:attachment":[{"href":"https:\/\/www.psyctc.org\/psyctc\/wp-json\/wp\/v2\/media?parent=2745"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.psyctc.org\/psyctc\/wp-json\/wp\/v2\/categories?post=2745"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.psyctc.org\/psyctc\/wp-json\/wp\/v2\/tags?post=2745"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}