Last.fm Datasets: Unlocking Music Recommendations Through Listening History and Social Connections

The article explores the significance of Last.fm datasets in developing music recommendation systems, highlighting their value as benchmarks for modeling implicit feedback, sequential listening behavior, and social influence. It breaks down what’s included in these datasets (such as user listening history, social graphs, and tags) and why they matter for music personalization research. It also walks through how teams can bring these datasets into Shaped to build real-time ranking models, covering schema setup, event ingestion, and optional use of tags or social data, demonstrating how Shaped makes it easy to prototype and productionize music recommenders using this rich, real-world data.

In the realm of recommender systems, understanding user preferences for dynamic content like music requires specialized datasets. The Last.fm datasets are pivotal resources in this area, providing large-scale insights into music listening behavior, user social networks, and community-driven tagging.

These datasets, often curated and released by research groups (like GroupLens or through specific academic projects), utilize data scraped or sampled from the Last.fm music platform. They are crucial benchmarks for developing and evaluating music recommendation algorithms, particularly those leveraging implicit feedback signals and social influence.

What is the Last.fm Data?

"Last.fm dataset" typically refers to several different collections derived from the platform over time. They don't usually represent the entirety of Last.fm's data but rather significant snapshots tailored for research. Common components include:

  1. User Listening History: The core data, recording which artists or tracks users have listened to. This is usually the primary source of implicit feedback.
    • user_id, artist_id (or sometimes track_id)
    • A measure of listening frequency (e.g., playcount) or simply binary interaction.
    • Timestamps (timestamp) for listening events (crucial for sequential models).
  2. User Social Network: Anonymized information about friendship links between users on the platform.
    • Pairs of user_ids representing a friendship connection.
  3. User-Applied Tags: Tags (genres, moods, user-defined labels) that users have applied to artists or tracks.
    • user_id, artist_id/track_id, tag (textual tag).
  4. Artist/Track Metadata: Basic information about the music items (though often less detailed than dedicated music metadata datasets like MSD).
  5. User Profile Information (Limited): Sometimes basic, anonymized user profile data like country or signup date.

Key Characteristics & Popular Versions

Last.fm datasets are characterized by:

  • Domain: Music Listening & Discovery.
  • Primary Signal: Implicit Feedback (listening counts/events). Explicit ratings are generally absent.
  • Social Dimension: Often includes a user friendship graph, enabling social recommendation research.
  • Rich User Tagging: Provides folksonomy data reflecting user perception of music.
  • Temporal Dynamics: Timestamped listening events allow for modeling sequential patterns and user preference evolution.
  • Scale: Varies significantly between versions, from hundreds of thousands to millions of interactions.

Popular Versions:

  • Last.fm-1K dataset: Contains listening data for ~1,000 users, including timestamps and user profiles. Widely used benchmark.
  • Last.fm-360K dataset: A much larger dataset focusing on user-artist listening counts and user social connections.
  • Various smaller subsets associated with specific research papers.

Why is Last.fm Data Important for Recommender Systems?

These datasets are vital for several reasons:

  1. Benchmark for Implicit Feedback Algorithms: As explicit ratings are rare in many real-world systems (especially music streaming), Last.fm provides a standard testbed for algorithms designed for implicit signals (e.g., ALS, BPR, LightGCN).
  2. Standard for Music Recommendation: Serves as a go-to dataset for evaluating algorithms specifically tailored to the nuances of music preference (e.g., discovery, genre exploration).
  3. Sequential Recommendation Research: Timestamped data is ideal for developing models that capture listening sequences and predict the next song/artist (e.g., RNNs, Transformers like SASRec).
  4. Social Recommendation Exploration: The presence of a social graph allows researchers to investigate how friend influence affects listening behavior and recommendations.
  5. Leveraging User-Generated Tags: Provides opportunities to integrate collaborative tagging information into recommendation models, capturing user-defined semantics.

Strengths of Last.fm Datasets

  • Real-World Implicit Data: Based on actual user listening behavior.
  • Music Domain Focus: Specifically suited for music recommendation challenges.
  • Sequential Information: Timestamps enable modeling user preference evolution and session dynamics.
  • Social Graph Inclusion (often): Facilitates research into social influence.
  • Rich Tag Data: Offers user-generated semantic information about music.
  • Established Benchmarks: Widely used, allowing for comparison across studies.

Weaknesses & Considerations

  • Implicit Feedback Ambiguity: High play counts strongly suggest preference, but low counts or absence doesn't necessarily mean dislike (could be lack of discovery, niche taste). Requires careful modeling/sampling.
  • Data Sparsity: Users listen to only a fraction of available music.
  • Cold-Start Problem: Recommending music to new users or suggesting newly released tracks remains challenging.
  • Potential Biases: Popularity bias is significant; data may reflect specific demographics or periods of Last.fm usage.
  • Static Snapshots: Represent data from a specific time; don't capture the absolute latest trends or catalog changes.
  • Metadata Variability: The richness of artist/track metadata can vary between dataset versions.

Common Use Cases & Applications

  • Developing and evaluating implicit feedback collaborative filtering algorithms.
  • Building sequential music recommenders to predict next plays or session continuations.
  • Implementing social recommendation models incorporating friend listening patterns.
  • Creating tag-based recommenders or hybrid models using tags.
  • Analyzing music listening patterns, artist popularity dynamics, and genre trends.
  • Researching music discovery and serendipity in recommendations.
  • Evaluating hybrid models combining collaborative, sequential, social, and tag information.

How to Access Last.fm Datasets

Several popular versions are available from academic or data-sharing platforms:

  • GroupLens Datasets (University of Minnesota): Often hosts or links to datasets used in their research, potentially including versions of Last.fm data.
  • Konect (University of Koblenz-Landau): May host network-focused datasets, including the Last.fm social graph.
  • Zenodo / Figshare: Researchers often upload specific dataset versions used in their papers to these repositories.
  • Direct links from relevant research papers: The paper introducing a specific version usually provides access details.

Important: Always check the specific license and terms of use associated with any dataset version before downloading or using it. Citation requirements are common.

Connecting Last.fm Data to Shaped

Shaped is well-suited for modeling the implicit, sequential, and potentially social data found in Last.fm datasets. Connecting this data involves mapping the listening history and optionally incorporating social or tag information to build powerful music recommendation models. Let's assume you have acquired a Last.fm dataset file (e.g., containing user-artist listening events with timestamps).

1. Dataset Preparation: Load your Last.fm data file. Common formats include TSV or CSV. Identify the key columns and map them to Shaped's requirements:

  • user_id -> user_id
  • artist_id (or track_id) -> item_id
  • timestamp -> created_at (Ensure this is converted to Unix epoch seconds or milliseconds).
  • Optional: playcount or other interaction metrics can be kept as event features.
prepare_lastfm_data.py

1 import pandas as pd
2 
3 data_dir = "path/to/lastfm/data"
4 listening_file = f"{data_dir}/user_artist_data.tsv"
5 
6 # Load TSV with expected columns: user_id, item_id, playcount, created_at (epoch)
7 listen_df = pd.read_csv(
8     listening_file,
9     sep='\t',
10     names=['user_id', 'item_id', 'playcount', 'created_at'],
11     header=0 # skip the actual header row if present
12 )
13 
14 # Select relevant columns for Shaped
15 shaped_listen_df = listen_df[['user_id', 'item_id', 'created_at', 'playcount']]
16 
17 prepared_file_path = f'{data_dir}/shaped_ready_lastfm_listens.jsonl'
18 # shaped_listen_df.to_json(prepared_file_path, orient='records', lines=True)
19 
20 print(f"Last.fm listening data conceptually prepared at: {prepared_file_path}")
21 
22 # Additional: prepare social or tag datasets if needed
23 # social_df = pd.read_csv(...)  → Save as social_graph.jsonl
24 # tag_df = pd.read_csv(...)     → Save as artist_tags.jsonl

2. Create Shaped Dataset using URI: Use the create-dataset-from-uri command to upload the prepared listening history data. Repeat for social graph or tag data if prepared separately.

upload_lastfm_datasets.sh

1 shaped create-dataset-from-uri --name lastfm_listens \
2                                 --path path/to/lastfm/data/shaped_ready_lastfm_listens.jsonl \
3                                 --type jsonl
4 
5 # Optionally upload social graph (if prepared)
6 # shaped create-dataset-from-uri --name lastfm_social \
7 #                                --path path/to/lastfm/data/social_graph.jsonl \
8 #                                --type jsonl
9 
10 # Optionally upload tag data (if prepared)
11 # shaped create-dataset-from-uri --name lastfm_artist_tags \
12 #                                --path path/to/lastfm/data/artist_tags.jsonl \
13 #                                --type jsonl

3. Create Shaped Model: Define the model schema (.yaml), connecting the listening data and potentially other datasets like tags or social connections.

prepare_lastfm_model.py

1 import yaml
2 import os
3 
4 dir_path = "lastfm_assets"  # Create if needed
5 os.makedirs(dir_path, exist_ok=True)
6 
7 lastfm_music_model_schema = {
8     "model": {
9         "name": "lastfm_music_recommendations"
10     },
11     "connectors": [
12         {
13             "type": "Dataset",
14             "id": "lastfm_listens",
15             "name": "listens"
16         }
17         # ,{ "type": "Dataset", "id": "lastfm_artist_tags", "name": "tags" }
18         # ,{ "type": "Dataset", "id": "lastfm_social", "name": "social" }
19     ],
20     "fetch": {
21         "events": """
22             SELECT
23                 user_id,
24                 item_id,
25                 created_at,
26                 playcount
27             FROM listens
28         """
29         # "items": """ SELECT artist_id AS item_id, LISTAGG(DISTINCT tag, ',') AS artist_tags FROM tags GROUP BY artist_id """
30     }
31 }
32 
33 with open(f'{dir_path}/lastfm_music_model_schema.yaml', 'w') as file:
34     yaml.dump(lastfm_music_model_schema, file)

Create the model using the CLI:

create-model.sh

1 shaped create-model --file $dir_path/lastfm_music_model_schema.yaml

Shaped will then process the listening history, implicitly learning user and artist/track representations. Including features like playcount can add weight to interactions, and incorporating tag or social data via additional connectors and fetch queries can further enrich the model for more nuanced music recommendations.

Conclusion: An Essential Resource for Music & Implicit Recommendations

The Last.fm datasets are foundational resources for advancing music recommender systems. Their strength lies in providing large-scale, real-world implicit feedback data (listening history), often augmented with valuable social network information and user-generated tags. They serve as critical benchmarks for evaluating algorithms designed for implicit signals, sequential user behavior, and social influence within the dynamic music domain. While requiring careful handling due to the nature of implicit data and potential biases, Last.fm datasets remain indispensable for researchers and practitioners pushing the boundaries of personalized music discovery.

Request a demo of Shaped today to see it in action with your specific use case. Or, start exploring immediately with our free trial sandbox.

Get up and running with one engineer in one sprint

Guaranteed lift within your first 30 days or your money back

100M+
Users and items
1000+
Queries per second
1B+
Requests

Related Posts

Jaime Ferrando Huertas
 | 
September 28, 2022

Your browsing behavior is being modeled as a language

Daniel Camilleri
 | 
November 7, 2023

Part 2: How much data do I need for a recommendation system?

Tullie Murrell
 | 
June 3, 2025

Shaped vs. Vector Databases (Pinecone, Weaviate, etc.): Complete Relevance Platform or Similarity Search Tool?