First steps to text mining


Date
Apr 17, 2024 8:30 AM — 1:00 PM
Event
DH-IGNITE Northern Region

WHAT TO EXPECT

In this workshop we will take a look at some fundamental skills that will get you started on your journey with text mining. To kick off, we will learn how to tell the computer what to search for. We will start out with simple search operations and explore their limitations. After that, we will look at more complex search operations. We will also introduce the first data wrangling steps for example converting text data into other formats for further processing. The workshop is accessible to humanities and social sciences students and researchers with no prior exposure to programming. We will not be covering any advanced text mining strategies or tools. Skills learned will be applicable in other aspects of research such as well, e.g. literature reviews.

AIM OF WORKSHOP

Most people know how to perform simple searches in text using the computer. However, search (as well as search and replace) operations allow for quite complex applications. In this workshop, we will take a look at how to get from relatively simple search operations to more complex search, as well as search and replace operations for instance for initial text analysis.

After following this workshop, you will be able to perform simple and more complex search operations in texts. You will be able to automatically identify patterns in text. You will also be able to perform search and replace operations that allow for simple data wrangling (in other words, to convert text into data formats for further processing).

WORKSHOP FORMAT

The workshop will provide a combination of practical examples and a bit of theory underlying search. The main focus will be on practical examples and exercises which will be handled within the group, within smaller groups, and on an individual basis. As the background knowledge of the participants may vary, we will tackle the different topics slowly, so everybody can follow.

PREREQUISITES

Participants should be familiar with word processing software (like Microsoft Word or Google Docs) or text editors (like Notepad, Notepad++, EMACS, or Vim).

PARTICIPANTS WHO WOULD BE INTERESTED IN EVENT

The workshop will be relevant to anybody who uses text on a computer regularly. In particular, people who are interested in identifying interesting parts (such as pronouns (relevant to gender studies), diminutives (relevant to linguistics), etc) of the text will benefit from this workshop.

REQUIREMENTS

  • A laptop
  • Internet connectivity
  • A web browser like Google Chrome installed

TOPICS THAT WILL BE INCLUDED

  • Simple search
  • Limitations of simple search
  • More powerful search options
  • Regular expressions
  • Search and replace
  • Finite-state machines
  • (In between several exercises will be tackled)

TOPICS THAT WILL NOT BE INCLUDED

  • Searching using search engines
  • Text or information extraction (although initial steps are covered)