Ultrasonography is the best available tool for the initial work-up of thyroid nodules. Substantial interobserver variability has been documented in the recognition and reporting of some of the lesion characteristics. A number of classification systems have been developed to estimate the likelihood of malignancy: several of them have been endorsed by scientific societies, but their reproducibility is yet to be assessed. We evaluated the interobserver variability of the AACE/ACE/AME, ACR, ATA, EU-TIRADS and K-TIRADS classification systems and the interobserver concordance in the indication to FNA biopsy. Two raters independently evaluated 1055 ultrasound images of thyroid nodules identified in 265 patients at multiple time points, in two separate sets (501 and 554 images). After the first set of nodules, a joint reading was performed to reach a consensus in the feature definitions. The interobserver agreement (Krippendorff alpha) in the first set of nodules was 0.47, 0.49, 0.49, 0.61 and 0.53, for AACE/ACE/AME, ACR, ATA, EU-TIRADS and K-TIRADS systems, respectively. The agreement for the indication to biopsy was substantial to near-perfect, being 0.73, 0.61, 0.75, 0.68 and 0.82, respectively (Cohen’s kappa). For all systems, agreement on the nodules of the second set increased. Despite the wide variability in the description of single ultrasonographic features, the classification systems may improve the interobserver agreement that further ameliorates after a specific training. When selecting nodules to be submitted to FNA biopsy, that is main purpose of these classifications, the interobserver agreement is substantial to almost perfect.