Word composition

A nucleic acid or amino acid sequence can be seen as composed of a
number of possibly overlapping k-mers or words of length k, for a
certain k ≥ 1. The k-mer composition of a sequence is given by the
frequency with which each possible k-mer occurs within the sequence. The
1-mer composition is related to the GC content of a DNA sequence, and
the 2-mer, 3-mer, and 4-mer compositions are also known as the
di-nucleotide, tri-nucleotide, and tetra-nucleotide compositions of a
DNA sequence. For example, the di-nucleotide composition of TATAAT is
given by one occurrence of AA, two ocurrences of AT, and two ocurrences
of TA.

Write pseudocode, Python code, and C++ code for the word composition
problem. The program must implement and use the word composition
function in the pseudocode, which must be iterative and is not allowed
to perform input/output operations. Make two submissions, including the
pseudocode as a comment to both the Python and the C++ code.

Input

The input is a string s (a genomic sequence) over the alphabet
Σ = {A, C, G, T} and an integer k with 1 ≤ k ≤ ∥s∥.

Output

The output is a sorted list of all the k-mers of s and their
frequencies.

Problem information

Author: Gabriel Valiente

Generation: 2026-01-25T17:28:15.490Z

© Jutge.org, 2006–2026.
https://jutge.org
